 So I work on the cloud platform engineering group. And we use Swift for a bunch of various use cases, storing images or logs or other applications. We went into production in two data centers six months ago about. So first we're going to do a very quick overview of Swift, then go into an interesting use case that we had and then explain what we're doing across data centers and then some examples and problems that we have had. And then what we can do to improve. So a quick overview. Swift is a massively scalable, multi-tenant object store. It's eventually consistent, which means we don't want to use it for something that requires huge consistency or databases or file hierarchies. The ring in Swift is a data structure which allows you to, which allows Swift to find the actual location of the object data. It's a consistent hash map. It's a static structure. Here's a high-level architecture. First you have the access here at the top. You have load balancers as you would have for most REST applications. The proxy servers handle the inbound requests from the user, authentication, and that's where most of the logic and decision-making happens. Below that is the storage tier, which actually store the object data, the accounts, and the containers. And they're broken up into different zones for availability and data production purposes. Now our customer use case that we're going to demonstrate. It was an application that has two different parts. There's a part that writes fairly small objects, which is a manifest, and it rewrites them, and then stores the actual data parts as roughly 10-meg objects. Then the other side of the application uses the container list to find out which objects are available to be read, reads them, and then deletes them. And there's a fairly high concurrency rate on this. It's charted across accounts and containers. And they wanted to use this strategy because they get cross data center replication for free. So across this, now let's go over what this looks like across data centers. For this case, this is about 50 or 60 millisecond distance, so a few thousand miles. So typically for multi-region, you get active-active in the cluster. It's also good for disaster recovery. You have to then choose how many partitions and what data center you want and balance the rings appropriately. There's a few different nuances to this, though, with increased latency on operations, how you deal with read and write affinity and what the impact there is, increase potential consistency windows, and you have to understand how much bandwidth between the data centers and the reliability of the WAN. So for storage policies, it isn't something that's necessarily required, but for mobile data centers, it's really, really helpful. For example, you can have one policy that's just for one data center, which is a plenty of use cases for that. For example, if you want to store glance images, you don't need them replicated. So you can just store them locally. But if you're storing artifacts for CDN, then you're going to want those replicated. So you use the replicated policy. Storage policies are basically choosing a particular ring for your object data. So here's a high-level view of a multi-region switch cluster. We're going to go through a write operation with write affinity on. So the first step is the application is going to have to talk to Keystone to get a token and also get the endpoint from Keystone. Now it's going to connect to the load balancer and send the write request. That's going to get forwarded down to the proxy node. The proxy node is going to have to process the request, check the token as cash, and then depending on your token type, you're going to have to go to Keystone to validate. After that successfully completes, then it's the proxy server is going to open up four connections down to the storage nodes. To, for this case, we're going to use a replication factor four. So it's going to open up connections to two primary nodes and two handoff nodes. And then the data is going to be written. After that completely succeeds and the containers are updated, then the request returns back to the user and we have four copies in this one data center. Then during the object replication process, these two local copies will then get transitioned to the other data center. So now you have full protection. So let's see what it looks like for multi-data center. If we're first time going to get an object, we're going to assume redefinities on. If we see here, we've got two data centers, and we've got the account and container. An object rings shown below. A few things are left out to make this a little bit less complicated. So we're going to do a git on the object, it was on the proxy. It's going to check the cache to see if the information for this account is in cache and it's not. Now we need to go across the WAN to get the account information. Since the RTT is 50 milliseconds, so the actual request overhead, it's in a minimum of 100 milliseconds because it will first connect across the packet with the return and then actually send the request. So we're already burnt 100 milliseconds there. Now we're going to store in cache for next time. Now we need to check the container and it's not in cache. So we do the same thing, go back, burn another 100 milliseconds, then store in cache. Then we can go and then read the data and return it to user. So there's another nuance that happens when if you have write-definity on, if you write an object, you delete the object, you read the object and then you can still get it back. If you do it, if it happens before the objects get replicated, we're going to go through the whole process and explain a few other nuances along the way. So we put, we write the four objects. The two dotted ones are the handoff locations. We are now then going to write the container updates. And we're going to have to spend 100 milliseconds doing that part, although even though that the object data is actually local. So for small objects, you still get a 100 millisecond hit. We turn it back to the user. Now let's do the delete. We send the request to the proxy node. We write tombstones to all four primary locations. It doesn't, the delete does not obey write-definity. So we'll write the four primary locations, two local and the two remote. So we'll write the tombstones for those four primary locations, but if the replication doesn't happen, the two handoff locations that are local will stay there for now. And eventually if you wait long enough, they'll get replicated and then they'll find that the tombstones are later and then delete those objects. We'll return back to the user success. Let's try the get. So we're going to do a get. We're going to check, find a tombstone and the proxy server is going to move on. Let's find an x1 tombstone move on. Now go across to the data center, find an x1 and not there, not there. And then we're going to come back to the handoffs. And it's going to find the copy or turn it to the user, even though it's supposed to be deleted. And then the user gets the object back. So you can see in this use case, if we're using container listing and listings out of sync, the user could get an object they already processed. So part of the stuff that we would be useful to know is that there's no impact to affinity for accounts containers. So for container listing, you have a 50% chance of going across the WAN. And same thing for those. So if you were to, let's say create a container, it would take 200 milliseconds because you would have to write the four copies, two local, two remote. And then those four copies would then need to potentially talk to the other side of the cluster. So you burn 100 milliseconds in each direction. The success will read after a delete. I'll just explain to you that. So the WAN operations for certain operations will be double or quadruple the RTT at a minimum on top of the actual processing time under optimal conditions. Another thing that can be done to help control this is in our recent, I believe in 2.4 patch, there is a way of controlling a timeout in the async for updating a container. So you could set that to a fairly low limit. Before this update, you could do it by setting the connection timeout on the object server to a value that you would feel acceptable, like let's say five milliseconds if you can assume that your container servers in your local data center are going to be responsive enough. So it would timeout and then write a async pending file locally and then that would be able to be processed. So there are ways around this currently, but there are room for improvement and the new patch that came in, I believe 2.4 really helps that because it would just create a new thread and then process it without writing the async pending file to the file system. Also another thing that's really important to do is understand what your worst case scenario over the WAN is. If there's any road construction and your one path needs to go down and then you have an extra 20 milliseconds you didn't account for. And how that will affect your SLAs and also increase application times because there's the extra latency in all these requests when you go when you move to multi region, all of these processes start slowing down because of the latency. You might need to think about increasing currency or stuff like that to deal with that, but you also be careful of not increasing it too much and dealing with too many TCP connections open. So Swift, what can we do to help improve this? For this read after delete, it's a pretty simple fix. I don't have the patch submitted yet, but you can simply check the timestamps from the previous tombstones you found and just not return the object if it's if the tombstone is newer than your object that you have. Read if any of these for accounts and containers. This shouldn't be too hard of a thing to add. And then if we were to have the containers update locally and then a-circuitously remote, if we have the read affinity for accounts containers at that point, then you could have fairly quick access, read write access locally and have fairly consistent results. And then if the application would fail over to the other data center, then there would just be the window for the replication of the account container listing and the objects. And storage policies, which there's a patch out for this, not reviewed yet, about having affinity rules set per storage policy, which we really powerful because then the application can choose which level of consistency it doesn't want. Does the application want to take the penalties and have everything written synchronously to ensure all copies are there or is asynchronous or write affinity rights better, which is really powerful if you have a generalized system and you don't want to be spinning off one off clusters for this single use case. Another really important thing that would really help out is that the objects that get written with write affinity are written and then the object replicator will then go and replicate the objects. So it has to wait through the walk of the whole file system to find the objects and then send them. And if there's problems that can be hours or days, if you're operating well, then it's only a few minutes possibly, depending on the number of items you have in your system and stuff, but that makes it a lot more difficult and a lot more time managing it and making sure that the replication happens. There's ways around this of queuing these async rights and then having a process send them as fast as possible versus waiting to a full file system walk. And also the other potential improvement is for TSP connection pooling where if we keep these connections established, we don't have to wait double the RTT. We could just wait single for the cross data center connections. So that would cut operations in half potentially. You have any questions? Have you looked at the three data center case for this? Or is this only you're only doing a double replication? I'm only doing two right now. Just say it's OK. Yes, correct. Yeah, correct. The question was if there's an active passive application where the application fails over from one data center to the other. You have the window of the replication time. So getting that reduced would be very beneficial for a failure scenario. So you don't have objects orphaned for too long. You're using four-way replication, which requires that three of the copies need to be accessed. Are you also doing two questions? One, are you also doing that for the containers and the accounts? Yes. And what happens to when goes down on that? Does that mean you can't be creating objects? You can still create objects and do that over the WAN. But you may want to look at doing five replicas for the account container. Because there's cases where you get 503s if there's problems with the WAN sometimes. And how does the four work where you need to get three copies? Yes. No, no, no. How does it work? How's your experience of it not? Oh, yes. It works pretty well. But you do get some issues. It's mostly with container operations, create and delete. So those are fairly infrequent. So it depends what the use case is, kind of. Have you done anything in terms of tuning the WAN and monitoring the WAN for the replicator to deal with any tooling that in terms of bandwidth or experience on the replicated going over the WAN? We haven't had too many problems with that for reaching bottlenecks or the WAN. Yeah. Also, thank you.