 All right, everyone, let's take a seat, and we'll get started here in just a few seconds. Hey, everyone. I'm Joe Arnold from Swift Stack. I'm Clay Geron. I'm a core contributor to Swift. And we're going to be talking about Swift, bigger than big. And I guess the topic is Swift scales massively. How massively does it scale, Clay? As big as you want it to go. And so it's architected for that. And you can add nodes. Also, things like new drives and new servers are coming along, where density is going up. And there's been an evolution in Swift over the years. Because when did Swift start? Swift started back in, well, I mean, so OpenStack started about five years ago. Yeah. And even a couple of years before that, we were working on Swift. So there's been evolution as not only the use cases have grown in size, but also the type of hardware that people have used has also changed. It's also grown more dense. So we're going to talk about that. All right. Building for scale. So what are we seeing? Like, yesterday we saw Expedient talk about they're a service provider. They've been building out of storage as a service. That means lots of users are all on the same storage. We have folks from OVH doing exactly the same thing. So we're seeing more concentration of more storage and fewer pools. So that puts some pressure on the size of the environment. There was one of the original use cases. But then it's opened up from there. And it became open source technology. And more people have started adopting it and applying it in all kinds of ways that we didn't expect. Correct. And you have Swift being deployed all by its lonesome in these environments and in places like entertainment. And where media files are going from 4k to 6k to 8k. So you have these very large repositories. OMRF spoke earlier today. And they're in the life sciences space. Similarly, there's not an open stack compute environment. It's just storage. And they're storing whole human genomes in that storage system. Scientific computing use cases where you have microscopes and lots of sensor data being put into the cluster. Enterprises, we're seeing backup. We've been talking about NetBackup and Commvault. So many different applications out there in the ecosystem that people are just picking up off the shelf and plugging right into their object storage. And then software as a service application. Instead of everyone having an exchange server running in their closet somewhere, they're using Gmail. But that pattern repeats itself all over the place. So more concentration, more data in fewer domains, instances. The other thing that's been interesting, and again, six, seven years ago, we saw four terabyte, two terabyte drives. This is very exciting. Those are big drives. But now, 10 terabyte drives, I mean, I think that's what is standard order. Nowadays, 10 terabyte, 8 terabyte drives, we have, I think you can buy 12 terabyte drives now. And in the next 8, 16, 18 months, I think Seagate talked about having the 16 to 18 terabyte drives, 20 terabyte drives right on the corner. I mean, these are big monster drives. And we won't think so in another 10 years. But right now, to us, they feel big. So the other thing that's happened is the chassis. So you have, in the beginning, this is like a 24 drive for you chassis. And that was big a few years back. And now, 60 drive. So it's this Seagate S3260. And that's pretty common chassis for people to roll in. That's a lot of disks per CPU and core and system. Yeah, well, it's worse than that. I mean, as the chassis are getting denser, they're not really trying to apply more CPU and more memory. They want to even do with less. The J-bods are taking less resources. And so this has been one of the biggest challenges that we've been working on over such a long time. We've been doing this for a long time. And the ecology of how the hardware's coming together and that people want to apply to their storage problems is changing. And so we've been keeping up. Yep, all right. So let's start. I think we have to cover some of the basics of Swift though to kind of give a context of why some of these things are. It's another dimension of scale. The nodes themselves are getting smaller, but the clusters are getting bigger too. The geographies that we're tackling are changing. Right, and also multi-region is a thing now, right, where we have multiple regions that people can deploy. So that means you need to communicate and replicate data across more networks that aren't necessarily close to each other. And even in the public cloud instance, where we're replicating data out to the public cloud. So you have to deal with more of these things happening rather than having rack of gear and everything within the same data center now. So it's like this perfect storm of Swift is being adopted more and more in the ecosystem to all kinds of new challenges. We're getting on denser hardware that we didn't have available to us whenever we started architecting Swift. And they're deploying an even more complex topologies. So as they're wanting more out of it, they're gasking more of it, and we're doing it with less resources and applying to new areas of space. So you want a lot of fun? A lot of fun. All right, so the architecture components. So we have just one slide on this, just not an overview. So reliable, scalable, object storage core. That's the sort of the meat and potatoes of the storage system. And the architecture goals, like high durability, high availability, and scale, and needing to achieve all those things. So it's a shared nothing architecture, meaning that each node in the system is a worker bee. And it's doing what it needs to do. And it's not necessarily needing to coordinate with the other nodes in the system in a synchronized way. Fault tolerant, any single node in the system can go down. Or you could have two halves of the cluster not be able to talk to each other for a period of time. And then come back up online and be able to reconnect and re-synchronize with each other. The automation around replication, where if things like drives go bad, automatically the system can repair itself and begin to raise the durability level up to where it needs to be. And continuous auditing. So detecting errors, even before a user might detect an error, constantly sweeping through the system. And then the durability strategies of using replicas in the system and erasure codes in the system. And those erasure codes are new in the past few years. And those have an impact on some of the processes that are doing all the things on that right-hand side need to be taken into an account. And so let's talk about the ring. OK, so do it. So one of our saving graces as we're applied into all of these new challenges is those design tenets of the architecture. What we wanted the system to do, it's interesting how one design goal sort of gets into the implementation. So if you want to have horizontal scalability, you have to design a system that's shared nothing. You have to have ways to reduce contingent and coordination between the nodes. So one of the foundational pieces of the Swift architecture is the ring, which is just a consistent hashing ring. That was proven at the time. I mean, consistent hashing as a concept has been around since the 90s. And Swift takes that concept and extends it further. But that design sort of gets into the implementation in the way that we have to do rebalancing. So if you have a take your entire namespace, rather than keeping an index where every one of the 10 billion or more objects that you're loading into your cluster, you have to go update indexes, have a lookup table. You can use a hashing algorithm to display things across the node. And then you break that up into partitions in order to have more nodes that you can float around as you add capacity to the cluster without having to move around a lot of data. So when we add some data in here, you can see node three came in. A few of the partitions have to be reassigned. Every name that could potentially come into the object has a place that it has to go. But whenever you add new capacity, you have to change that some of the partitions have to go on other nodes. So in designing for a system that can extend out, where we want as you add capacity to the cluster, you're actually growing the available resources that are there for you to do things like aggressive active replication, where we want to repair the system back to a fully consistent state as soon as possible. You want to have background processes that are always auditing. So as you're adding new capacity to the system, you want the whole system to grow, not just have the few nodes coming online to handle the new ingress as you sort of stack them on the end. The whole system has to grow together. Because there's two choices when you're adding capacity into the system. So do you want all the new data going into the new storage nodes, or do you want the new storage nodes to take the brunt and take some of the existing storage in the system and the workload of bringing that data to the system? Right. But as we get into bigger nodes and larger geographies, we're being asked to do all of those same consistency sort of work, but we need to optimize it. And luckily we've had a lot of time and a lot of deployments on which we can do and learn how to make that work really well. And that's what we have a lot of data around, right? So we have a product to help deploy and manage Swift. One of the key parts of that tool is the ability to add capacity in the cluster. And it's building the rings, it's redistributing those rings. And so we can watch a bunch of stuff about the rebalancing activity that's happening. How long does it take for us to add the new capacity into the cluster? And adding the capacity to the cluster involves not only just changing that data structure, but also obviously moving the data to that new capacity that's there. And so we can watch, okay, how long does it take to add a new server? Especially these new 60 drive bay, eight terabyte drive chassis into the system or people are expanding out and adding a whole rack of that. How long does it take? How long is the replication cycle times? How long are the audit times taking? And we can watch that and have that inform some of the designs. In fact, we have a whole monitoring page. I mean, Clay, you're involved in a lot of this. One of Clay's graphs of the week that he always likes to point out. Just watching the rebalancing time in customer accounts. And this has been a thing of just adding the visibility to understand when in a particular environment, some environmental issue is causing things to not behave the way that we want. Like how can we get a real clear picture of that at a glance? And so that we know what to go drill down into and think about for the next time that we get into a similar situation. But also, it's just an education thing. Working with people to understand the way that object storage and the way that distributed systems scale and the way that things work, it's maybe different than they've dealt with in the past. So having some clear visualizations and a lot of insight under the covers and what's going on in the system can help people that maybe are operating or monitoring, managing their Swift cluster as just one of their part-time jobs to get up to speed really quickly on what they need to do and what they need to understand on how these clusters grow. Right, okay, so let's talk about some of the optimizations. This is ostensibly what this is about. So the first was just targeting being efficient on writing data to disk and queuing it up so that you don't have to be as random about writing data into it and be more sequential about writing the data. Yeah, just anything that we can do to eliminate a minor stat call here or a read call there and a lot of times you get this via profiling. We have various metrics that are being emitted from the system and a large community of deployers that are all looking at these together. So there's a constant feedback of, hey, are you guys seeing this or I see this thing in one particular note or we have an opportunity, we get some tracing data and then we go, we look at the code and we reason about it and we say, well, what if we organize this differently here? And so, especially over the past year, there's been a series of small minor patches that have made changes down in the disk file and the updates that we do to the hash pickle and other things to do append only stuff instead of doing a read, update, deserialize repeating thing. And it's just every little one that you can save, every little IOP that you can save really pays off a lot. And it's gonna be, it's a continuing thing. We've got more work going on in the community to change some of the wings that we do in the background of replication, some of the even dramatic things that we do on disk files for lots of small file use cases. So, really, really exciting and it just pays off so much at scale. It seems like a minor thing. It feels like we're micro-optimizing but because this is something that's happening tens of thousands of times on disk, it really pays off. One of the next things was while we're doing rebalancing, we talked about all those auditing and replication processes that are going on but those are there to detect when things are not connected or something is failing in the system or a disk has gone bad or but when we're in a replication state like we know we're at a capacity or we've detected that there's a disk down, can you talk about some of the optimizations that we've made there? Yeah. So, there's a number of different things that you can do. So, first of all, that continuous auditing in the system, one thing that you have to be able to do is apply some back pressure against so that you're trading off for new incoming requests that are coming into the system as the replication's going on and it's a trade-off that operations needs to make where they have some new capacity online, they need to get that capacity participating in the cluster as quickly as possible but they're also servicing their client requests on the same system so they need to be able to trade off there. So, focusing that work that you are doing during a rebalance stage so that it's focused just on making that new capacity fully in use on the cluster and trying to avoid other work, deprioritizing other work that is not directly related to the rebalance is a big change that is going on in the reconstructors and the replicators right now. We've actually, I've actually taken a few ideas from the Erasure Code implementation which is slightly different than the Replicator implementation and you can, but they're both involved in the rebalance process and so there's been some feedback in cross-follination and I think those two will continue to converge in the code base. So, okay, so I mean the analogy here is don't pass code, don't pass $200, right? So it's like don't wait for the replicator to tell you need to move some data, just start moving the data right away. How is that different with replicas versus Erasure Codes? Because in terms of like Erasure Codes you have two choices. Replicas you can just move the data because it's a whole replica. Erasure Codes you can either reconstruct or you can move data from once it was to where it needs to be. What are some of the choices made there? Those are one of the things, the parallels. So in the Reconstructor, because we've actually got fragments of data in EC, no individual piece of data on disk constitutes the entire object, it's just a parity or a clear test data that's part of the entire system. None of them are the same, they're not replicated, they're unique, they're their own little unique piece of the object. So during a partition reassignment when a ring rebalance, we have to just move those partitions onto the new primary holder for that specific one. There's no coordination that needs to happen between the other instances that are also holding fragments during a rebalance. You only need to do that whenever you're cross checking for doing a rebuild. If a disk is gone, if you've lost a server or even a controller maybe went out and you have to rebuild all of the data on those disks, then you use the other pieces of the data in the cluster to reconstruct one of those missing objects. That's not a property that exists in a replicated case, any one of the replicas can stand in for it. So in one hand the erasure coding can move more quickly during a rebalance because it just has to put that one fragment right where it goes as opposed to checking with its peers. But in the other hand it can be more expensive if you end up trying to rebuild as opposed to just moving that data. So one of the big things that we need to work on or one of the things that we're working on in the reconstruction is to be able, maybe that's actually coming up. Yeah, this is exactly it. So I mean this touching on this exactly which is the erasure codes where calculating erasure codes does use more RAM, it uses more CPU. And so yeah, continue right where you were. Yeah, so the erasure coding has been great. It's an opportunity for operators to reduce their total storage costs. But they're asking for that at the same time that they're giving us less resources. And I'm not fussing at anybody, but these chassis are getting denser and denser and we have less CPU and less RAM to work with. So erasure coding has been one of the big drivers in understanding what we need to do to work better on these larger chasheets and the optimizations that we need to make. So there's been tactically a few things that have been developed in order to make that work. So for example, having multiple reconstructors running per sets of drives. So that way if you have a big chassis with 60 drives you can say, hey, these six drives or these 10 drives, you guys get your own dedicated reconstructor to operate and rechecking the data to make sure that it's... And again, prioritizing the rebalance work over the replicating work, the reconstructing. So next is multi-region. And so there's been a multi-region clusters, of course, allow a deployment to have data in both locations. And the way that it has been implemented is to have even placement. So you have region A, region one, region two, and you set a storage policy that has a certain number of replicas, and then those would be assigned in each of those regions based on the capacity total in the system. Well, when we started interacting with customers we realized there was use cases where people wanted to go, oh no, I want one replica here, I want two replicas here, they wanted to be really specific about what they wanted per policy, and they could have multiple policies, so that meant that they could have, they could say on the same set of hardware, they can start stacking those policies with different durability properties for each one of them across these different regions. And we wanted to just be able to accommodate that in a better way. So it wasn't just that they were dreaming of everything that they might want. In a lot of the multi-region appointments that we're seeing today that are set up like this, they'll have multiple policies where one policy is the primary in region one, and a separate policy is the primary in region two. Generally speaking, the applications that are running nearest region two are accessing that cluster, the ones that are running nearest region one are accessing that cluster. But in addition to that being the primary storage, they want to have some off-site availability, particularly if they have a hybrid case where they need to pull some data off the local region, if it's there waiting for them, that's beneficial. So even placement was great when there was only one policy, whenever all of the data was considered even, but as you overlay more complex policy configurations and offer multiple policies, it became more and more clear that deterministic placement on the rings was something that we needed to support. It was also ended up being a requirement for... So what type, dig a little bit into more what composite rings are and how they work. So composite rings are, it's the same ring concept that we had, but it just allows the cluster to group two rings together in a logical concept. So within one region, you could have a ring that's placing out two replicas of the data on just the devices that are in region one, and then you'd have another ring that evenly places out all of the partitions, a single replica in region two. Those two partition assignments can be stitched together so that when you present to the cluster, composite ring composed of region one and region two as the primary region one ring, it would be able to find partition assignment in both locations. And then the next thing, the logical extension of that of course is to overlay this with erasure codes. So erasure codes was one of the drivers for, another one of the drivers for composite rings. We were talking earlier in the erasure codes that each piece of data isn't a replica of another one, it's a unique piece of data. So in order to do a multi-region EC, you can't just willy-nilly or randomly place, evenly place the fragments that make up the object into each region. You have to make sure that the specific fragments that are getting placed in each region can reconstitute the entire object. So composite rings allowed us to say that you're gonna have a set of data in parity in region one and a set of data in parity in region two. Those are separate and independently placed, but whenever dealing with the cluster logically, you can have the two of them together, but separate them. And so what is, so this allows to have, so for example, so let me ask a couple of questions about this set, right? Okay, so if I'm in one region and I get a request for an object and I don't have enough parity, say something tragic happened in the cluster on the hardware and I'm not able to reconstruct, do I ask the other side to reconstruct or do I just pull? You're the consistency engine or this is a client request? I'm getting client requests. So if, well first of all, each site is gonna have it be independently its own failure domain and can support a number of failures up to your parity count. So if it's just a minor thing, a few things are offline, it will just ask for that local region. But if you have massive failures and their access requests are still coming through that region, it'll absolutely, we don't like to say not serve data. We're gonna reach across the WAN, pull it in and make sure that we handle that client request. And then the consistency engine. So the consistency engine gets to be a little more optimistic. So the direction that things would be heading now is that it's gonna be cheaper in most cases to get a copy of the missing fragment data than it is gonna be to have everyone in this cluster rebuild it. And it's just gonna be quicker so it reduces the mean time to recovery. So that's gonna be the way that we're gonna be going. Cool, that's a big deal. So a couple more things. One of the things that we've been working on with Intel is to be able to throw some fast storage at doing some specific metadata caching on the file system. And what this does is it allows that the requests to the underlying file system to respond more quickly. And I think one of the big benefits of this and is to be able to do things like reduce the replication cycle times by leveraging a small amount of SSD and taking up a PCIe slot which may not even be used in the system anyway to get some great performance improvements in the replication cycle times. There's also improvements to be made also in the time to first byte latency. So adding a little hardware to the problem, adding some software like the Intel Cache acceleration on the file system can give some benefit as well, particularly on these dense chassis. And I think for those of you at the conference, they have a kiosk in their booth talking about this and a white paper is available on that. So it's something to check out. And I think, lastly, is just a plug to try out SwissDAC and test out and do some deployments across multiple regions with the Racer codes and get your hands dirty on it. And we have the ability you can test drive it with SwissDAC.com slash test drive. And thank you. And we're happy to answer any questions that folks may have. Thanks. Hey, Mark. So I'm glad you guys mentioned the Intel Cache thing. Cause we've been doing that for a number of years and it does allow us to do a lot more with less, we're using DM Cache actually. But what's the recommended percentage of SSD per terabyte or whatever? I would point to the two folks sitting back in the role it wrote to answer that question. I don't have that information off the top of my head but I think they'd be happy to answer after the talk. But generally speaking, you're gonna be targeting the file system directory cache size. So it depends on how dense, like how many, is it lots of small files or bigger files that you wanna try to have as much flashes that you can to keep all of the journaling and file system information in memory. Yeah, so it's not caching the data, it's caching the file system meta data. What are you doing with data? Oh, you're doing with data? Okay, cool. Per few hundred gigabytes of storage. All right, yeah, so John's confirming. So it's not a lot, right? It's a few hundred gigabytes. I think we have terabyte cards and some of the denser chassis because it's only file system metadata. It's not caching metadata with the Intel Cache stuff. Yeah, please. So I suppose this is sort of a question for some feedback. From what I understand, the limit of present is five gigabytes for per object. And I understand you can do trickery with multi, it supports, Swift behind the scenes will sort of join together for you in multi-part, whatever it's called. Is there thoughts around that increasing, or does it not need to happen or are there reasons why it won't or... Yeah, that's a really good question, right? Because we've just matched parity with the... There has to be some sort of standardization because people are building clients, people are building... For example, if you're building a file system client, you can't just say, hey, I'm gonna have some arbitrary block size that I'm gonna write to this. People have kind of settled in on a size for these object APIs because they're building clients that are working across lots of different installations. Yeah, sure, you can go configure the max object size to be whatever you want. If you can fit it on a single drive. But if you can fit it on a single drive. But in practice, what people are doing is they're building clients and they're building these tools to work across lots of different clusters and lots of different deployments, including Amazon, including Google. And all of them have... We've all settled in on this five gigabyte. Yeah, I think it's all about the clients. I think that tends to be what's driving that conversation is people are building the transparency of the multi-part uploads into the clients because there's a lot of benefits that you can have in resumable uploads, the ability to parallelize the uploads. And if you're doing that at the data source and then using the object APIs to create a manifest and pull them back together and then serving that on CDN or some other client that's... Yeah, and what we're saying with customers is they start using the clients that are provided and they just see this to them, they're just uploading a file into the system and then behind the scenes, we're just parallelizing it across tons of servers and they're going, oh, my God, this is so fast. Well, it's because we're doing a giant parallel upload for them into the cluster. And so the user's benefit from that because I think it's already ingrained in the client ecosystem that's out there. So I actually have two questions. One is, is there any quantitative way to measure the impact bring by the data movement? For example, when I want to add new devices, all the most devices, that definitely will cause data movement, right? So is there any quantitative way to measure the impact? So how to measure the impact of data movement is that the question? Yeah, absolutely. Fine grain monitoring into the cluster. I mean, there's lots of vectors which you can measure that. The way that it impacts the object servers that are filling, the way that it impacts the object servers that are training. So the answer is absolutely yes. It has a quantifiable impact and there's knobs and tools that you can use to control that. But I feel like we could talk a long time about it. What was the second question? The second question is, I saw you mentioned the cloud native app, right? So is there any special example of how a cloud native app's users switched? I think the biggest benefit that you get in cloud native design to talk to storage behind an API is that you don't have to think about where that application is running to figure out where you're gonna store the data, right? And either you're caching it locally because you want it to be close to the application and that's not your in-store or you throw it into object storage. That's a final resting place of all data, whether it's a backup or something that there's user-generated content, whatever it is. So for cloud native apps, I think just the idea of just API-based storage is really important in most of their design. I can't do anything with that. Yeah, last question. All right, so when you're doing multi-region deployments, how is the object metadata distributed and what are the consistency claims on it? So when we say the term object metadata, we're normally thinking about the metadata about the object that's stored with the object and so that stays the same. So you'd have that metadata available in both but the way your question was framed made me actually think about the container metadata which is the listing of which objects are in a container, how many bytes are stored in that container and that gets replicated. Most appointments are deploying the container and account metadata layers across the entire cluster. So it would have replicas in each. Eventually consistent on directory listings like S3 or is it more strongly consistent? Just like S3, it's always eventually consistent for the listings inside of the containers yet. Thanks. But it's sort of actively, optimistically consistent. So with every object update, we try to update the listings. So it's just, it's all on the back end. It's latencies and timeouts. We can respond to the client and have something fail and asynchronously update something later. Hi, good to talk. So regarding your racial coding, you mentioned about you need to check whether the data can be reconstructed or not, right? Yeah. Is that happening periodically? Yes. Okay. But if you have a multi-region based racial coding, then how you know the other side of what happens? So this gets into the design of the multi-region racial coding. It actually draws a lot of parallels back to like mirrored raid. So in racial coding, we're actually introducing in order to keep the amount of parity calculations that we have to do during ingest to a reasonable minimum. We have a duplication factor where each independent cluster can store a completely reconstructable entity on either side of the region. So the primary consistency checks just have to do to make sure internally within the region that the object is fully as reconstructable. But if it's not, they can then reach across the WAN to get the information that they need. Okay. But this seems like actually a lot of overhead in your management perspective, right? So does it mean actually replication will be better? The reconstruction operation is more expensive than replicating a piece of data because it has to read more data and parity in order to rebuild an object. So in a racial coding, it's more important that we only rebuild whenever we have to. But when we do, you actually have to write less data because you don't need to write down the entire replica. You just need to write down the piece that you're building. Also, there's a lot of optimizations that we have made to do during the consistency checks to basically form the consistency checks up in a ring so that each individual node only has to check with his peers watching out for gaps in the ring so that he doesn't have to constantly, everybody talk to everybody. They can just watch the metadata as it floats around. Yes, maybe I should ask in another way. So for example, if I ask some suggestion from you for multi-region-based design or multi-region-based data storage, replicating is better or you're racial coding is better? It's gonna depend on the use case, right? So there is, like you pointed out, there is a trade-off to be made in terms of how you want to store this. If you're going above two regions, replicated is better. Okay. That's good answer. All right, thanks for sticking with us tonight. Hope to see you at the event later tonight at the ballpark. All right, thanks everyone.