 All right, hello. This is a huge conference. Oh my gosh. So hi, my name is Joe Arnold. I'm the CEO of Swiftstack. And here we're going to talk in this session about a globally distributed storage cluster with Swift. And so what this enables folks to do is to use multiple data centers and replicate data between them and have one storage system, and they can flop back and forth between those data centers to get access to the data. And we wrote a book on OpenStack Swift, and we're giving them away at our booth, so it's a Swiftstack booth. And we're going to do some book signings tomorrow and Wednesday. So I have a few up here if anyone wants some after we do a Q&A. A few more announcements. So we have a party going on at the Tabernacle on Wednesday at 8.30. Please do come check it out. There's been a ton of Swift sessions going on. Tomorrow there's a Swift 101 session, which I'm doing with John Dickinson, the Swift PtL. And then right after that, there's going to be a case study with Fred Hutchinson. It'll be good back to back tomorrow. And if that's not enough Swift for you, Thursday, you can spend all day immersing yourself in Swift at a workshop that we're going to be doing at near where most of the hotels are. So come check it out and sign up for that. So globally distributed Swift cluster. So we implemented this feature about a year ago. And the thing that we were trying to solve was people had use cases where they're deploying applications and they wanted to upload data and have it available in multiple places. So the first customer that we worked with, our first user of this, announced at the Hong Kong Summit. And what they do is an expense reporting travel documents. And they wanted to have a better user experience to serve content out. And so the use case, we put a video for the keynote then, what it allows a user to do or build an application is an operator can set up two regions and then content can be uploaded and directed to, say, one of those data centers. And data goes in and what they wanted was the ability to upload entirely in one region. And then they also wanted to have the ability to read that data in from another region. So in this case, so we're using Hong Kong as an example, data gets replicated asynchronously over to that environment. And then what happens is another user, if they want to go access that data and say they're on the other side of the further away, can pull up that picture. And what they do is they'll say a GeoDNS will grab that object and pull it in. So depending on where the user is, the user will be redirected to a different data center. So that was kind of one of the main use cases. And so what we were solving for was some hard problems. The first thing was you had most people who were setting up a disaster recovery setup. They had more storage footprints. So you had your primary storage that you're using for the application. And then you had a one-way replication to the disaster recovery site, which was more or less identical. Some people set it up with maybe the previous generation of equipment. And the problems with that is you have more infrastructure, and not all of it is being used to service user requests. Then you have larger failure domains, because you have larger volumes that are being serviced out to users. They have more eggs in one basket. If one of them goes down, it's more consequential. And then the failover wasn't transparent. So if the main site went down, there was a hiccup in that application before the failover took effect, and the application was to continue. So the ways people try to solve this is do things like user sharding. And so we'll take the knowledge of all these users that they have, and we're going to put this group of users in this storage pool, this group of users in this other storage pool, and so on. And when they added new farms of capacity, they would grow that way. And some of the issues that were brought up then were you're adding new users to new equipment, and your old equipment is older, and it's full. So it's slower. And then the new equipment doesn't have any users on it, and it's fast. And it didn't make any sense. It made upgrades harder. There was more to manage because you have more storage environments to work through. The other issue then, if you're to do a distributed file system and have POSIX semantics, is that there's consistency needs to be addressed. And waiting for a global write lock was it can add latency into the system. And so that was a design goal of POSIX that isn't necessarily needed for all applications. And it causes things to break down once they reach a certain latency distance from each other. So why not just use cloud for everything? If you're familiar with the Amazon ecosystem, they have a product called S3, which is also an object storage in the cloud. And they too also have multiple regions. But you, as an application, have to know to put data in this region or this region. You have to move the data between the two regions. And one of the first service providers to use this multi-region capability in Swift is EnterIT. And so they have a multi-region European cloud. You can go talk to them at the booth. They have a booth here. So you can go talk to them about that. And it's powered by some of the software that we built for them. So how does Swift work? So you have a data placement strategy with Swift. And sometimes it helps to think about things in extremes. And if you take a Swift environment and you can collapse it down into just a box, what it does is there's a unique as data placement that's happening. And so each drive is used to put data onto it. So there's no raid in the system. You don't put Swift on top of the file system. And data is replicated across those different drives. Then if you move up and you have a medium-size Swift cluster, say three nodes, this is really common what we see. If you have a few hundred terabyte, maybe 50 to 100 terabyte, something like that, you might have three or four boxes. And then what Swift will do is it'll make sure that objects are uniquely placed across each one of those different nodes. And then if you grow up a bit and you get a bit larger cluster, you might have multiple racks of gear. And then Swift will, again, make sure that data is placed across those different racks. And analogy holds true. You can just kind of see what's happening here. Multi-region, you have data is distributed across multiple data centers. So that's the general strategy. And so how Swift does this is the first thing that was introduced was this concept of zones. And a zone in Swift allows an operator to say, here's a section of my data center. And it is running under this power source. Or maybe it has this cooling environment. Or maybe it's this network segment. And here's another one. And you basically give a tool to an operator inside of the same data center to say, this is a fault-tolerant domain. And so that was one of the first things that Swift implemented to allow an operator to do that. And so one of the components in Swift is there's this proxy tier. And the proxy tier, I just showed one of them. But usually, there's more than one, obviously, because you need to fail over. The proxy tier will route requests from a client to wherever the storage is. And so in a write configuration, what happens is the proxy will stream that write into the proxy tier and then down to each one of the storage nodes simultaneously. And the client will receive an OK request only when what we call a quorum. So if you're using three replicas, when two respond back correctly, then, hey, we got your data. You're good. And then for a read, the proxy will go to one of those nodes, read one of the data responses back, and we'll serve it out. And if it fails, it'll try an alternative if it can't reach that data. And even when that failure happens while the reading is happening, what it'll do is it'll keep that connection open up to the client. And it will go find another location for that data and fetch the remainder of that object and continue to feeding it up to the client. So that way there's an interruption happening. So we went, aha, OK. We have these components in place. What if we can add another abstraction on top of that? And that's where regions come in. So now we have the ability to say, here's a region for a data center and then zones within that data center. So and then inside that, there's actually nodes too. So it'll use all those three tiers to do data placement. So what this means, so now you're a client and you're making a request. And if you're trying to read data, the proxy has some knowledge about where it is. So when a request comes in, it's not going to send that user off to fetch data from something over a WAN link. It's going to try to fetch it as close as it possibly can. And in that case, it has two rules that it follows. The first rule is, which region am I in? And which zone am I in? Something that you as an operator, you're telling it what it is. So that's going to be the first prioritization. And then the second prioritization is going to be, it's going to choose based on the, it keeps track of the latency to each of the storage nodes. And it's going to route the request to the nearest storage pool. And with writes, there are two configuration options with writes. The first one is what we call a normal write. And this is actually what we recommend for most use cases. Not to say the other one's abnormal, it's just this is the default setting. And how this is set up is the client will put data into the proxy, and then it'll stream to all the storage locations at the same time. And the reason why we recommend this for most use cases, this is just what we see out in the field, where you're loading data into the system. When you do that upload, you want the data to go live in all the locations at the same time. And it's great for backups, it's great for content, things like that. And it prevents the replication from falling behind client ingest. And that's the reason why we recommend it. And then for what we call affinity writes, how that works is the client uploads to a proxy. The proxy's aware of where it is. And in this case, it writes to only local at first. And then asynchronously, it puts the data in the off-site location. And we recommend that in the context of trying to speed up or where users are waiting to upload data. So the use cases for this very specifically would be media ingest. So if you're working with folks who are meeting entertainment and you want to have multiple geographically distributed clusters, you could upload just locally very quickly and then let the offline replication trickle out. Course networking is very important. So there's a dedicated or you can specify a dedicated replication or WAN link and apply a separate QoS between those two regions. And then when you go to deploy and you go to tell the users or the applications how to be routed to one of these proxy pools, that needs to be figured out. And again, this isn't part of Swift. This sits above Swift, conceptually, is using DNS to route to either one of those storage pools. Some people get fancy and will use something like a GeoDNS product, which is like Akamai and others have products which will, based on the DNS query by the client, knows the relationship of how far away everything is and can tell them which one to be near. So that's how people plug that in. This multi-region works really well with storage policies. And what storage policies do is allow an operator to say, I'm going to create a policy in one region. I'm going to create a policy in another region. I'm going to create another one between the two. And this allows all sorts of different configurations. So if users put data up into, say, that purple one, the region where it's region one plus region two policy, it means that the data is going to be in both locations. Or they can choose, and you can charge back differently. If you missed it, Paul and John gave a good talk this morning, but I'm sure the links are going to be up on YouTube. So keep posted on that. And so why all this works, or it works sanely, is because there's no file or object locking that is happening in the system. We kind of cheat in a way. And it means uploads can occur, and other people can be uploading the same objects at the same time. And conflict resolutions very simple to mediate. It's done through timestamps, newest file wins, and there's no logic in the code to distribute any global file locks around. It makes it faster to access objects because you're not waiting for those. And it increases the surface area of the number of requests that you're allowed to have coming into the cluster at any given time because that resolution is so system. There's nothing shared between any of the nodes to lock anything up. So it's very efficient that way. TCO is another great reason. Because you don't need to set up site A and site B and have them replicate between the two, you can buy less hardware in its one system. So the concur use case is presented in the Hong Kong OpenStack Summit. This is their TCO. It was with four replicas, as all the power, space, cooling, management, personnel costs, and including licensing costs, less expensive than the public cloud, less expensive than traditional storage. And this is a very intensive application. And then for an active archive example, Sir Fred Hutchinson, they're going to be presenting on tomorrow at 12.05. And this is what their costs ended up being. They have really, really low cost power because they're in the Northwest. But still, it's a really compelling number. And they did three data centers on the same campus and three replicas. OK. So that was all the fun stuff to talk about. This is the hard stuff. So if you go to do this, there's some challenges. There's some hard problems to solve. And the first thing is about adding and managing the storage capacity. When you go to add a device, the way Swift works is that it sucks in an individual device, individual hard drive. And to that hard drive, you have to say, here's its location. Here's how much capacity it has. Here's how to access it, both for storage access and for replication access. And at the end of the day, it means you have lots of stuff to manage. You have lots of devices. You have lots of zones, regions, little details to manage. There's Swift configuration details. And then there's the building of the Swift rings, which will go into tomorrow at the Swift 101 talk. And then distributing those around and doing that across multiple data centers, across multiple machines to all be orchestrated and acted as a whole is pretty challenging. So the other thing is also adding capacity. And the way you add capacity in Swift is by tuning some variables. So when you add a new hard drive, you're adding by what we call a weight to that drive. And what we do is we slowly ratchet that up. And that's mostly so that you're not flooding that drive, you're not turning it on 100% at once. And there's a lot of network traffic. Adding a region often involves adding another replica. So you're going from three replicas to four replicas. And you're doing it over a WAN link. And I don't know how many of you have been in that situation before where you start saturating your WAN link, but I know I've been yelled at before for blowing that thing up. And so what the strategy is is to incrementally change the number of replicas over a longer period of time. So you go from three replicas to 3.1 replicas. And what that means is you're going from 100% of your objects have three replicas and 10% have four replicas. And you push that out and you let the data queues. And you do it again and you let it queues. And you do this in a coordinated orchestrated way so that you're adding capacity in a graceful way. The other thing to solve for is authentication. And Swift has in the proxy tier a memcache ring. Actually, it's a really cool piece of technology that came out of the Swift community. It's pretty neat. And what this does is it caches some things, like access control lists for objects. And the way Swift does authentication, it uses a challenge token response method. And those tokens are cached in that memcache ring. Well, what you don't want is you don't want that memcache ring to encompass a proxy tier that exists over a WAN link. Because then you'd have random latency issues when somebody came in. And it just so happened to have to traverse the WAN to go get that cache bit of data. You don't want that. So configure that memcache ring so it encompasses a proxy pool in each region. And that's the strategy to solve that. So from a, we had Swift Stack focus entirely on building deployment tools to orchestrate, manage, scale, Swift. And so here's how we solved it, was we have agents that are running on the nodes to identify devices and fold them in. And then a way for operators to place them into regions and zones and to manage that configuration. And then we automate the storage provisioning. So that whole readjusting the weights, readjusting the number of replicas, and rolling that out, pushing that out, confirming that the configuration is durably distributed across all the nodes is something that we do as a product. One of the other hard things is also is network monitoring. And you want to watch that the replication network is not getting behind what you're ingesting. And so keeping tabs on that so you know if there's any issues is an important thing to do. But even if it does, even if that WAN link goes down and you're writing in, it will still find places to put the data durably. And when that WAN link comes back on, it'll push stuff over. And of course, management needs disaster recovery too. So if you do have management tools, however you're doing the deployment, the configuration management, make sure you keep in mind that it's not some VM parked in a data center somewhere that's now not accessible. And so have a disaster recovery strategy for however it is that you're managing your system. The way we do it is we have a warm failover, usually in the other environment. And then we use the same DNS strategy to flop over when we need to. So our product, what we do, we work on Swift. That's what we do as a company. We have a lot of the core team members. We have the project technical lead who works on us. We try to do best we can to rally a lot of developers in the ecosystem. And we build a deployment tool around it. So it's Swift at the core, it's on commodity hardware. We try to make it as easy to deploy and scale as possible. And we want the control of the storage system to be in the software, not something bundled hardware and software together. So that's what we provide. So deployment, integration, scalability. So benefits, you can distribute data across the data center zones, across multiple data centers. The software corrects for failures. So that routing happens, whether it's a networking failure or a dry failure happens. In software, you can use all your capacity to serve all your users. So provision for all the capacity, all the incoming requests that you need. And you can route across all that available capacity instead of having dedicated disaster recovery equipment. And through that process, you can route users to nearby data and it gives them data faster, makes them happier. And that's it. We have the party come see the book. And thank you very much. Any questions? So the question was daisy chaining of replication. And so if you saw some of the things, sometimes you might have two objects of that same object that's traversing the WAN. So we have a choice to make. Do we try to add some smarts into how the replication is done and be efficient use of the WAN? Or do we just naively push the data across the WAN? If we had to be smart about it, we would have had to introduce something in the system that's a shared state that said, hey, this is the one that's going to go across the WAN. And we chose to be more naive in exchange for having better scale out properties. So there's a trade-off there. And we didn't want to introduce something that introduced a shared state in the system. Yeah, so the question is, is Swift the back-end storage for it? Well, this whole conversation was Swift as the back-end storage. And because of the way that the data is stored in Swift as an eventually consistent model, it allows for the use cases such as this. So it's more challenging to put the strategy on more of a consistent storage underneath. Yeah, this goes down to the disk. Yeah, it's a full solution down to the disk. Yep, a question in the back. So the question was, how do you configure storage policies? Is it something for an application or is it set by the operator? So the storage policies is a tool for operators to specify options on how you want your users to use it. It's not a user-defined setting. It's a operator-defined setting. So if you had an application that was significant enough and you wanted to create a specific storage policy for that application, then you could do so. And yeah, one more question. Yeah, so the question was GeoDNS. And GeoDNS is outside of the scope of Swift. So it'd be something that you would use or you'd acquire through some other service. And the way those services typically work is they keep track of the distance between that client request to all the different options for the host. And then it'll route those users when they make that DNS query to resolve to another host name, which is going to be good for them. And it makes a guess at that. And so those are outside of the scope of Swift as a project. But it's often how people will use that. Yeah, does that answer the question? Oh, so right. So when the reader request comes in, by that point, the client already knows, ah, you might have Swift.example.com, right, as the top level. And then you would have East Coast Swift.example.com, West Coast Swift.example.com. And when you're making those API requests, you've already resolved down to that specific host name down here. So all right, thank you very much. And if anyone's interested in a book, come on up. Then I'd happy to provide one.