 Well, welcome. Thank you for coming this evening and joining us as we take an intimate look at scaling Swift at Rackspace. My name is Chuck Theer. I'm a principal engineer at Rackspace and I've been there for quite a few years. I've worked both on Cloud Files and Swift and Cloud Block Storage and amongst other projects there. But today we're going to take a deep look and get a better understanding, kind of open up the Kimono a little bit and see what it's like running Swift at Rackspace. But before we do that, let's talk about the beginning. Almost four years ago we had decided, or Rackspace actually had decided that the current implementation that they had for Cloud Files was not scalable for them. And they kind of had this moment where they had to decide what to do. They decided it best that the current architecture, they couldn't take that any further so they wanted to try again and had a new idea. So they formed a new team and that's when I joined to help them build Cloud Files 2.0. It was just five developers and four operations. It took us about nine months in a little tiny room secluded away from everybody else. I think half the people didn't think we could even do it. And came out with ten lines of code and released to the public. We released, it was kind of a software release, there was no major announcements. We did it with no customer impact, migrated all the data and was very successful. And I would like to highlight this ten lines of code. People like to celebrate, oh, we have millions of lines of code and I think we've done quite an achievement to do what Swift was capable of doing when we first released it with just ten thousand lines of code. And a lot of that code had been rewritten over and over and over as we were trying to get the best implementation possible. And then there came OpenStack and somehow we got to where we are today. And we have all sorts of deployments now Swift. We have a startup, SwiftStack. I see some of you guys back there are working around Swift and it's very exciting stuff. Used at Disney, at Wikimedia, all over the place. So the original goal when we designed Swift, and we thought this was a very lofty goal. We set a very high bar that we weren't even thinking about being able to attain any time in the near future. So we wanted to build a system that would, this was four years ago, that could hold and manage a hundred petabytes of data that could hold and take care of a hundred billion objects that could withstand 100 gigabits per second of internet throughput. And that could handle 100,000 requests per second. And I think we've done a pretty good job of that. So this is probably the slide everybody wants to see. We've always been pretty secretive about our clusters. We currently run Swift in six data centers. We're storing more than, we are controlling more than 85 petabytes of raw disk. That's a lot of hard drives. We've also handled over a trillion requests, or half a trillion requests, since the release of Swift for Cloud Files. By the time this talk is over, our Swift clusters will have handled more than 40 million requests. And that's depending upon how fast I talk. If I get too nervous and talking too fast. But we also right now handle in one, in a single data center, more than, well actually at times more than almost 60 gigs, gigabits of sustained peak throughput to that single cluster. And the back end services, all the storage nodes when that's going on, and we'll have some cool network graphs we'll look at in a minute. There's over 200 gigabit of back end services from all of that. So I think we can kind of safely, we're kind of at that point now where we're feeling we're achieving those goals. We have a lot of use cases at Rackspace for this, both external and internal. Some of you might know our, we had acquired Rackspace, had acquired JungleDisk a while back. And JungleDisk uses Rackspace, Cloud Files, among many other backup solutions that you may use today and don't realize that your files are being backed up to Cloud Files. Tons of companies store backups and media that they're serving their websites off of. Several different gaming companies stream out their assets and stream out their content. So when you're downloading a game, you're getting it from Cloud Files or our content distribution network. A lot of our customers leverage the CDN integration capabilities. And so all the requests I was talking about in the previous slide are just the requests that are handled by Swift itself. Our CDN is orders of magnitude more than that of requests that are handled. We also store a lot of logs internally and externally. People that create systems and need to archive their logs for long periods of time, put them in Swift. We actually even store Swift logs inside of Swift itself. Makes a nice little place to put them and we can pull them down at any time and search through them. If you're a Rackspace customer, you use Swift all the time. Between our cloud images that you might spend a cloud server up from. Or if you create an image of your server, you create a backup of your Cloud Block storage. You look at attachments from your tickets in our ticketing system. Those are all served from Cloud Files. When you're installing Swift, you have to think of Swift as a complete system. While Swift is designed to run on tons of commodity hardware and different types of hardware, it's not meant to just be thrown on a bunch of random hardware, you just have leftover. Just because you have a couple of 300 gig hard drives and you throw those in a chassis, it doesn't mean that you're gonna have a good Swift install. It's funny, a lot of people that run into issues at first are just cobbling together a bunch of hardware and just trying to make it work. You need to think about your overall system design. And I think one of the things that has really made us successful, deploying Swift and scaling Swift is just really thinking that through designing, looking at our hardware profile and going through iterations of looking at the usage and understanding how much, well, the disks get used, how much CPU is needed, and further refining that. You also have to think about the network. And not only the network that's going to the servers, but the overlaying network and the topology that you create. To make sure you can handle the traffic that is going into the system. And of course, there's monitoring. And we'll talk a little bit about those in more depth. So when we first started, we had machines that had 24 1T hard drives per box. They each had a 1 gig network connectivity, and that worked pretty well. But the kind of overall hardware densities and power utilization and things didn't work out. So we've been trying to tune that and work through that. And hard drives have gotten bigger and better. So now we run 90 3T drives per unit. And those are actually just in basic DAS or Jbods that we're running per machine. And each machine has a 10G network. That allows us to put more dense storage into a single node. We also use SSD drives for account and container servers. If you're in the Mercado Libre talk, which is kind of a hard talk to follow, they're very enthusiastic. They talked about how much that improved their services as well. The SQLite databases that you used for account and container services are very IO limited. So if you want to be able to put a lot of requests through to a single container, it helps to put those on something that has a lot of IO capability. And so we're finding that, putting that on SSD drives really helps a lot. We also use Commodity SATA drives for our storage nodes. Since we own our data centers and have a good relationship with them, we're able to have good SLAs on changing out drives and things like that. So that we're able to make use of cheaper drives, even though we have to try them out more and more. The more important thing is that you really have to, don't take my word for it. Don't take somebody else's word for it. If you're gonna install a Swift cluster, install some samples and test and try and try it out. Because even across a single vendor that you've had one type of hard drive that works very well and you think, well, they have a new two terabyte version of that same hard drive, it should be the same. Well, often that's not exactly the case. Hard drive to hard drive, there's differences. There's differences in CPU architecture. So there's a lot of variability there. So you really need to test and think about how you're designing your systems. So for the network, you really need to think about the overall topology. And when we started, we started at one gig per host. And that works pretty well. The problem is that it's very easy for a single stream to saturate that one gig. And when you add on replication and things like that, you can run into some issues. So we upgraded to 10 gigs per host. It allows us to handle a fair number of streams going to a single host, plus all the replication that goes on in the back end. You also have to think about your network aggregation layer and how that's going to look. Making sure that both above going, spreading out to your customers, you have enough throughput in. But also looking at your network layer below behind the proxies. Because since we're streaming out three copies when a put is done to the back end servers, you're going to have a lot more bandwidth in the back end services than you're going to have in front. So you need to think about that. We also use HA proxy for load balancing now, which has been very useful and works very well. It does SSL termination with some of the handoff or offload capabilities that are in the new Intel chips. It works very, very well. We can fully saturate a 10G network going through our load balancers right now with it. And we're about to test and look at pushing 20G traffic through a single node. We're very excited about that, and it's working really well for us. So monitoring. So first, the general monitoring question has been relatively solved. Everybody has a favorite tool. That everybody uses. We use one of the usual suspects. It's not a big deal. But really, if you want to monitor the overall health of a Swift cluster, there are some key indicators that'll help you know when something's going wrong, and you need to do a little bit further inspection. One of the things that's actually really useful to us is just watching the air log lines. We have a couple monitors at work that are dashboards to let us know how things are going. And for each data center, we have a little graph that just shows us our airlines per second. And yes, we have to measure our airlines per second, which is pretty crazy. But if it starts going up, then you know that probably something's wrong, and you need to dig deeper and see what's going on. You also look at your replication times. One of our other graphs that we watch quite a bit is how long it takes to do a complete replication pass on average per node. And if that starts creeping up, then there's likely something wrong as well, too. So that's a pretty nice metric that's very nice to watch. A couple other things that are nice to look at are dispersion report. If you haven't used the dispersion report before, it basically puts a basic object out onto every partition and then goes and queries it whenever you run the dispersion report and looks for that object on every node. And then returns back if it was able to find it all three nodes, or if it was missing a node. So it'll return back a nice little report that shows you how many copies of each object it could or could not find. And that lets you know pretty easily if maybe you have a bad hard drive or something's going on that you didn't realize before. We also watch async pendings. And if you don't know what an async pending is, it's one of the eventual consistency mechanisms that we have within Swift. If a container update can't happen fast enough or a container server is down, it'll store that locally on the storage nodes, and it'll keep retrying to make that container update down the road. Now, if that can't happen fast enough, those will start stacking up. And so that's a good indicator that you either have some issues on your container servers or things like that. We have a couple of extra tools in our toolbox that we use. Some of them we've open sourced. They're still a little experimental. And some of them are a little specific to our use cases and how we use Swift. The first one is the Swift Ring Master. And what that basically does is allow us to very easily manage our rings and push those rings out to all of our servers and make sure that those all stay up to date and everything's good. It helps us manage when we need to add new storage, making sure that all the rings get updated and rebalanced and everything. So we're doing some interesting work there. Swift Stalker. So in addition to monitoring, there's just some home grown monitoring type things that you need to do. So all those things that we were talking about earlier in monitoring, we actually push up into our programs called Stalker. And that helps us notify us when issues are happening. It integrates with MailGun to send out emails. It integrates with PagerDuty to let us know when major issues are happening. And it's been very useful for us. Graphite is also a tool that has been invaluable to us. So with a combination of that and StatsD and a program, a set of middleware called Swift Informant that pushes out statistics to StatsD. And then all of that's loaded into Graphite. So we get graphs of all the information, like from CPU to disk usage to even requests in Swift. And we get all sorts of great metrics. And then we have another tool that we haven't open sourced yet, and I'll show you a picture of it, but it's called Swift Spy. So it goes out and pulls some of this information and gives us these nice little graphs that show us overall usage of the system. And you can look at a glance really quickly if something's going wrong. So I know this is a little bit little. This is just an example of Swift Spy. On the top, you see the bar going across, and it's got just little mini bar graphs going up. And that's total rights to a couple of the zones in the system. And you can see kind of an even gradient as the right usage kind of goes up and down throughout the day. But we can look at that and see if something, like if a zone suddenly is getting over-replicated or something's going on, you'll see a lot of extra activity as stuff is going on. This lower graph right here is one of our storage servers, and this is showing the rights to each drive. And this is kind of interesting because you see some interesting patterns, like these big green blocks that kind of go up and down and kind of make a continuous throughout a period of time. That's actually the auditors going through. And you can see that the auditors actually use up a lot of disk. And so that's one of the pieces that we need to work on to try to make it a little bit more efficient. And then the other box is just a replication going on on one of the. But you can actually kind of see when things are going on. And it helps us to debug issues if suddenly a change goes in, and all of a sudden our servers are going crazy, and we're trying to figure out what's going on. That will often help us kind of look at that. So these are some network graphs for over a 12 period of time. The first graph is the incoming network. And this is just our internal network from all the machines that are through our internal services, not our external. But this is where, on this peak, we're getting close to 60 gigs right here, but we often hit it going up all the way up to 60 gigs. And then the graph in the lower right, you see going up over 200 gigabits of traffic, is the overall network in the back end systems. And we very often set new records at Rackspace on overall throughput going through networks at Rackspace. So the road ahead, so things that we're very interested in working on at Rackspace. And what we've been working on continually is mostly working on making these things really scale at very large scale, making sure that replication is going to work better. One of the big things we're working on right now is a replacement for some of the bits of our replication paths to make it more efficient, work faster. Once we have that in there, we can also add even more innovation in that area to Swift to make replication work better, to replicate only the little bits that we really need to replicate. We also need to work on better handling of full disks. When we first created Swift, we kind of had this ambitious idea, oh, it'll never fill up. So we never worked on any of those edge cases. Well, it turns out that sometimes forces come that you don't have control over, and a cluster starts filling up. And some very weird things happen. So we've actually done some work already. We've added functionality in Swift that will prevent replication and things like that from filling up the hard drives all the way, so that if hard drives do start getting full, you can help manage that and manage those systems. But there's still some more work that can be done there, and we want to work on that. We'll also want to work on better error handling and limiting within the system so that it can work around issues. There's been a lot of work in that area, but there's still more that can be done to optimize that and make that better, so that when a server goes down, a hard drive stops working, Swift does a pretty good job of working around that now, but we can make that better. We're also really interested in rethinking container sync and making that really what it was supposed to be. Container sync was an experiment for us to figure out a way to be able to replicate from a container in one data center, Swift, to a container in another location somewhere. The initial implementation wasn't really great, and but we want to pick that up again and try to make that work really well so that we can offer that to our customers. And so those are some of the areas that we're looking into. We're also looking into, I'd forgot to put up here, but I just remembered, being able to take containers and split those up and being able to better, instead of having everything in a single container, but being able to take those DBs and shard them even more and being able to get more throughput to it through a single container. And that's actually all I had. I actually went through them a little bit faster than I thought I would, but that's, I hope we have some time too for some questions. I don't have to answer any questions that you might have, but that's kind of an inside look of what we do at Rackspace and the scale that we run Swift at and hopefully give some of you more confidence that, yes, you can run Swift at a very high scale. We've been doing it for over three years now and been doing it fairly well, I would say. So, especially thanks to our ops guys and everybody that helps out with it. So questions? Oh, sure, sure, absolutely, absolutely. So you asked about the RAID level of the hardware. So when we very first started working on Swift, we actually started trying to get it to work with RAID 5 or some RAID level so that we could kind of reduce some of the ops kind of bear before we had a bunch of error handling and things within the system. But then we found a lot of issues with it with right throughput to RAID 5 and since our use case is very right heavy and especially when the objects are fairly small, you run into a lot of issues. So we actually run with no RAID at all and most Swift clusters are recommended that we don't run it with RAID at all. Except for, I will say there is one exception of the container servers where we run the SSDs. We run those in a RAID 10 just for extra durability and just makes that kind of layer a little bit nicer and easier to maintain. And then you asked about memory as well. So the memory is pretty important. We kind of try to keep it within a, it's hard to explain. The more memory you can have, the better it's gonna be because you can cache more of the iNodes within the file system and the better everything will perform. And we try to keep that at where kind of a sweet spot is and that keeps kind of moving around. So we're working on trying to make that as good as possible. In our current nodes, I think are we running 64 gigs? Or was it 32? The next gen? Yeah, I think it's 64 gigs right now in our current hardware and that seems to work fairly well. And especially with some of the recent changes where initially when we released Swift, we had recommended setting the iNode size to 1024 and XFS has actually made a lot of improvements lately and allows you to, and it does a lot better with iNode allocation. So you can use the default and that actually of 256 bytes per iNode and that actually allows you to use a lot less memory on all of your the iNodes. And so a lot of the, so like we found that replication was actually halved when we do that the overall time for replication by making that change and the current kernels. Yes, Chamele. So, how many people? So there's about, well, did you miss my first slide? I talked about it. I was like, two people. So, Chamele wanted to get his little do for being on the team for a little while while we were working on it. But no, we had about five people starting in a small room. So it was fun times. And that member of the heat coming through the window and everything. We had another question. No, that's okay. Right, so the question is asking about drive failure and there's a couple of different ways you can handle drive failure. One way is that you can let the drive fail, leave it in your data center and remove it from the ring and then the system will start removing the data from the drive. Those are for instances where your data center is remote, you don't have people there, you don't have an SLA to replace drives and things like that. And that's when you go in maybe once a week and pull out all the hard drives, put new ones in, put them back in the ring and let it go. Since we have a fairly good SLA with our data centers internally to be able to replace drives soon, we just go ahead and when a drive is down, just leave it down, Swift works around it until and we have a system that auto creates tickets to our DC ops when drives go bad and they'll swap them out for us and then it starts filling it back in, putting the replication, replicating the data back into the server. Oh, cool, cool. Well, if it's just one disk, we just leave the whole weight and just let it replicate in. That is one thing you have to kind of watch, especially as your Swift cluster gets bigger. If you have too many incoming replication, or R-Sync, your disk can be stampeded basically as all of, because the data is spread out throughout the whole system. All of those systems are then trying to replicate that data into that single disk. So you do have to kind of watch that. Sometimes it can kind of stampede that single disk. This is one of the guys that actually makes it all happen. Hello. This is Danieli, everybody. Yeah, if you're doing that method, you definitely want to replace them as fast as possible. So when a total data node fails, it kind of depends. If a failure is like, for example, the just a backplane failed or the server, the motherboard failed or something like that, what we'll do is actually have them replace it and make a change and fix it and then just put it back in and replication will fix up whatever is missed. If that's going to take more than a certain amount of time and I don't know, is it about a day or so that you guys, if it's going to be more than a day or so, then they actually remove that node from the ring and then wait for it to get fixed and when they bring it back up, they'll completely wipe it and put it back into the ring then and go from there. Yep. Question right back here. Sure. So the question is asking about do we use a separate network for replication and for our normal traffic? And we don't right now. That's why we have the 10 gig network going everywhere and that really kind of helps a lot of that problem. No, not right now. It hasn't been a problem for us. If it does become a problem for us, we could certainly shape it. And that was actually something else that I meant to mention about the networking twos. You also want to design the network in a manner so that all your storage nodes and the network behind the scenes is a private network. You don't want that network publicly accessible, so. Yes. Disc failure rates. I don't have a good number for you. It's somewhere, we get somewhere around 10%, maybe a little bit higher than that, for our annual failure. So, yeah. Well, I mean, it's standard hard drives. That's pretty, it kind of matches with most of the other, I mean, it's actually, you know, it's absolutely not what most manufacturers advertise, but there's other great documents from both Google and White Papers that people have put out, you know, that have similar types of failure rates that we've seen as well too. And any other questions? Yes. So the file system we use, the default file system that Swift recommends, which is XFS, we have found that gives you the best overall balance of performance for Swift and reliability and durability. One of the big things I think that a lot of people don't realize, and when you look at a lot of benchmarks, is file systems really change a lot as you fill up your drives. And in a file, in a system like this, you're usually running most of your systems at least half to even more, where your disk drives are half for more full. And so, when you really want to do benchmarks, most benchmarks just start writing data initially to a file system and they look at that and look how awesome this is. If you run that benchmark over a period of time as the disk gets full, you'll see a drastic reduction. And XFS is one of the few that gives the least amount of reduction over time as that disk fills up. So, if you're looking at benchmarking systems and disks and things like that, that's very valuable to look at when the disk is actually mostly full and see how it performs then. Yes? The iNode size or the block size? Block size. Block size? So, all of our new systems are going out. We are using the default iNode size now of 256 bytes. And whenever we replace drives that are on newer systems, we're switching over to that as well, too. And block size, we're just using the standard block size of the drive. We're not messing with any of that. Any other questions? Yes? Postgres? I'm sorry, I can't hear you all the way up here. Okay, just one person. So, what was it, sorry? Oh, Gluster. So, Gluster is FS. We had looked at it a long, long, long time ago when it was kind of first getting started. There were a bunch of systems out there. And it just really wasn't ready yet at the time. Now, I don't know, I can't speak to it now. I haven't looked at it. I mean, it certainly got a following and a lot of people are using it. But Gluster FS kind of tries to solve a little bit of a different problem. They're looking at file system and they'll put the object storage on top of that. And if you want a file system with it, that could be useful. But we were really focusing on building just something that purely focuses on object storage and being good at doing that. Any other questions? Well, thank you very much for your time. And I'll be around if any of you all have questions.