 All right, I think we're ready to get started, guys. You guys want to mind taking a seat? All right, thank you, everybody, for coming today. My name is Brian Klein. I'm the Development Lead for Object Storage at IBM SoftLayer. If you haven't heard yet, as of yesterday, the SoftLayer brand name is now part of IBM Bluemix as part of the infrastructure arm of the Bluemix brand. So if you're a SoftLayer customer, you'll probably have some information in your email about that already if you haven't seen it. So today, we're just going to go through just some history, some interesting numbers, and some lessons that we've learned over the last maybe four and a half years or so of operating swift clusters around the world. So we'll start off with a little bit of history. In early 2012, our first three clusters went live out of the gate. We had Dallas, Amsterdam, and Singapore. It also had some home-based CDN integration with our provider and metadata search as well. It wasn't until about 2014 we had a dedicated development team for Object Storage. Initially, it was kind of a labs project with a group that I was in for a bit that we tried to build POCs of new products and take them actually out into production and hand them off to actual product groups after that. And Swift was one of those. From 2014 to 2016, we launched a whole bunch of additional clusters around the world. We tried to keep pace with every new software data center that went live. We would try to have Object Storage available in that one. And for expansions to existing data centers, of course, we would always still expand if we needed to. But that was sometimes a bit more of a challenge. And starting last year, of course, we started rolling out some integrations with IBM Bluemix. Both directly ordering Object Storage through the Bluemix catalog, as well as kind of being a back end for some of the IBM Bluemix products that would either store backups in Swift or they would use us as kind of a temporary store for some data. So back in the beginning in 2012, when things were a lot simpler for us, we kind of started out each of these clusters with somewhere in the ballpark of 7 to 10 nodes. We had two node types originally, the proxy and kind of an all-inclusive data node that housed all three account container and object services. We had a load balancer per cluster. And for better or worse, we started out with running this on free BSD on top of ZFS. I don't recommend this. It's probably not even possible anymore because there's so many optimizations that Swift has related to Linux-specific syscalls that you probably couldn't even do this anymore without a lot of heavy core modification. But historically, that's what happened. And some of our first contributions upstream were around making that a little bit more painless for us. We didn't have any centralized logs or our log analysis tools at that time, sadly. There wasn't a whole lot of demand, though, for those. Fast forward about four and a half years later, we've gone up to a couple hundred nodes in a few of our clusters. Some of them are still relatively small compared to that. We now have three node types. So we've basically moved out. Account and container services onto what we call a metanode. Container services is kind of the biggest reason for that. It's a bit more CPU bound. And account container services, they need a little bit more speed with IO. So SSDs are recommended for those. And we found that moving the SSDs that we had in our existing data nodes prior to this was more effective for us than trying to house them all in the same node. We also now have a full cluster of load balancers in every Swift cluster just to help with availability and doing upgrades and things like that. We have to do maintenance. Happy to say we're now running Debian for quite a while now. We do have centralized and searchable logs thanks to SysLogNG, LogStash, and Elasticsearch. And we have kind of out of band of all these clusters. We have a single analytics cluster that's running, I think, CloudArea, CDH5, where we run Spark and Hadoop jobs against all of our logs on an hourly basis and sometimes on an on-demand basis as we need to research the answers, some questions, or dig into other issues. So talk about a few numbers about our scale, the riverings, and everybody came. You can just see here on the map our global footprint. These are all the countries, plus two that were too small to actually fit into the pixels here. This is where we are in the world. So we have 22 Swift clusters today, and these are all distinct ones. They're not connected with any sort of additional regions in the ring. This is across 24 data centers because two of these clusters are multi-DC clusters. In the same region, there's usually multiple data centers for some of software's bigger data centers. So where they've needed capacity to do that and more, and we've taken advantage of that. So one of the clusters we launched this year, right out of the gate, it was a two data center cluster. So it had two Swift regions. And we took one of the existing, actually one of the original three, and converted it to a multi-facility cluster this year, just as a natural evolution, trying to keep pace of the usage there. And like I said, 16 countries. We currently have in the range of tens of billions of objects with about 7 million containers, which is a good ratio. There are some clear outliers in there, but that's about what it is globally. And hundreds of thousands of Swift accounts. This is looking at purely just Swift account database. How many of those are out there globally? Capacity-wise, we have about 90 petabytes across thousands of nodes, and about 40,000 plus disks give or take a few thousand. I can't actually share any of our percentages or exact numbers on usage, but this entire slide is actually a graph showing the increase over time that we've had to deal with in usage. And this increase started occurring well before we brought in the majority of our new capacity. So it's grown quite a bit. And at Nalcom, after we got all that extra capacity in place, it was a bit of a race against time. We get globally tens of thousands of requests per second, and the distribution of those is three quarters is about gets and heads with puts and deletes following that. We do get still a considerable amount of copy and post, but relative to the others, it's too small to actually see up here. Now, for hardware, I have a few numbers for you. Everybody knows that software loves SuperMicro, so we, of course, use those here. For our data nodes, we use a 36-disk chassis with about 12 to 16 physical cores, depending on the age of the actual node itself. On proxies, we have quite a bit of RAM, considering what it does, 128 gigs on data nodes. We'll get into the reason for this a little bit later, but 256 on those. We use 10 gigabit NICs. We have physically separate networks for API traffic coming in from the outside versus actual storage and replication networks on the other half of that end. As far as disks go, we use a mixture of three and four terabyte disks, depending on the age of the cluster and the node. Some of them we're still bringing up to four. Our controller card that we use, that actually gives us the ability to use all those disks. You can kind of see here, we've got some of those carved out for OS and redundancy there. A few for SSD caching of the data which helps us tremendously in re-performance because of the extremely get heavy workloads that we see. And then 29 actual data disks that are just kind of set up in a J-Bot type setup on the controller. And as far as physically expanding capacity, we usually do so by about a half row to a row at a time. In software data centers, that's I think roughly 12 racks at a time. Give or take a few. Software-wise, just a quick run through here. Debian, like I said, for the OS. Of course we use Swift. Depending on our actual version production and how recently we've upgraded, we might actually pull in some back ports from upstream partially because we have a bit longer of an upgrade process than we do a back port testing process. And so before we develop more automation around those we have to sometimes pull in specific fixes upstream. Authentication-wise, again we use both Swarth and Keystone with BlueMix accounts. You get Keystone out of the box with a reduced set of regions, we're still in the process of rolling that out globally in conjunction with the BlueMix teams. And with Swarth we have some internal patches and enhancements that's helped us a little bit with performance and reliability of it. Metadata search, we use Elasticsearch, I think we started out with Solar. And about a year into it, we found that we would, for a variety of reasons, would prefer the characteristics of Elasticsearch a bit more. Monitoring and logging, just about everything you can think of is probably up there. Though they're all pretty self-explanatory, we do have a capacity dashboard that we developed internally that shows us kind of a histogram of disk usage per cluster. It also shows us an actual map of disks across the clusters. So if we've got some problem ones or ones that are missing, we can pinpoint exactly where those are and track down what the problem is. It also tells us in a little hover item what the weights on those are, what the usage on those are. So if we see some strange anomaly, we can just visually look and see very quickly what that is. It's also really helpful too when we have capacity expansions that may take a while to fully bleed in. We can see just, it color codes it in a way so that if based on what percentage of space is actually being used on those disks, it adjusts like a green, yellow, and red shades depending on that usage. So it's really helpful for us to see visually. And then of course, logging, we use for proxy logs which actually get uploaded into Swift. This is where our analytics cluster gets those logs to process. And this is also helpful for us for billing. Automation-wise, we use Chef primarily. A few things in Jenkins and for kind of ad hoc type things. We use Fabric a bit. If you don't know what Fabric is, just do yourself a favor and Google Fabric Python once you get out of here. It saves a tremendous amount of time. Custom middlewares that we have, kind of three groups here, or three code bases I guess. One for our CDN, two different middlewares. One for actual handling of operations that customers can request be done on the CDN on behalf of object storage. Pretty typical set of operations for any CDN provider. We also have one for CDN origin pool. We do this so that if somebody changes settings on TTL or compression, there are sometimes specific things that have to be written with the object so that we know what to give back to the CDN provider and the headers to control things like TTL and give them the correct expiration. For metadata search, we've also got two. There's one for an indexer that just on right and right like operations, including a copy which is not listed there. We index into an elastic search index on search query operations that runs this other middleware that handles parsing the actual query and transiting it into an optimal elastic search query. And then as far as management of our clusters go, we've got another two, one's called Checkpoint that handles the ability for resellers like Bluemix that may have their own kind of front end for billing and metering and things like that. It's still integrated with us, but they also need a way for their own business rules to do things like disable an account for a while if somebody hasn't paid their bill before they actually purge it. And a few other reasons, it's like abuse and whatnot. That allows them to do that just by exposing a few roles, some of which are reseller level. And then we have internal management functions in another middleware, mostly around some CIS meta read and write. As you know, Gatekeeper keeps a lot of that out, but we wrote this as a way to be able to manage that kind of in band through some kind of different means and also expose some proxy level recon to some of our monitors. So lessons learned, this is the most difficult part of this presentation to try to distill into a few slides. Automation was a big one. If you're not doing it already, you need to automate everything and I'm not even exaggerating. For anything that you've launched on day one, you should have that. It should be part of the actual launch itself. You're gonna need it and you're gonna need it more when you get the call that says you need to expand into all these extra nodes and all these extra data centers. You wanna have that in place already and time tested and solid. Especially you want it to work across all environments. Even if you're talking about a single node Swift installation that might have your custom middleware built into it or a thousand node product cluster. They should be using the same automation. And automation is also code so it needs tests and metrics sometimes even if the metrics come out of whatever is orchestrating the automation. That's just as important. Functional testing, every step along the way in your pipeline to get a change into production, you should be doing functional testing at least. I know a lot of people do performance testing as well. That's also very helpful. We've had a few situations where we've upgraded and because of some settings that we have, we've noticed a difference in performance. Sometimes good, sometimes bad. And orchestration is always key to, especially when you're talking about Swift. And it's one of the advantages of having zones in the ring is that those are not just your failure isolation boundaries, but also boundaries for being able to do relatively safe maintenance in the cluster without disturbing too many replicas of partitions that sit in that zone. On monitoring, you should be scale testing any of the monitoring and logging solutions that you put into place because for the software that's essentially telling you whether or not your cluster is healthy and working or not, you need to know where that software is gonna break too and what it's gonna cause it to break, what's gonna happen when it breaks. Establish those as known data points ahead of time so that as you approach them, you can stay prepared and do what you need to do to retrofit. Just very basic, obvious metrics to monitor space and IOPS are two of the biggest errors on your disks or your controllers or your boxes actually from a variety of system level sources. Your HTTP response codes and latency aggregates with Swift itself as far as its own metrics. If you don't do anything else out of the box, at least start with async pendings. There's so many things that that can tell you about your cluster. It tells you when something's unhappy for sure, but it also tells you when you might need to start thinking about expanding your container and account layers as well. We've seen that a couple of times and we've used that to tweak the ratios that we use as far as how many proxies to metas to data nodes in some of our clusters. Replicator failures and partitions per second rates misses a key way to track really slow disks or ones that's maybe having issues that aren't bubbling up elsewhere. If you see really, really slow partitions per second rates on a node repeatedly and there's probably something wrong. So you wanna be looking for those thresholds. And the last completion timestamp of a replicator versus the last time you actually pushed a ring is a key thing to watch to kinda get a good overall view of after a ring push during the rebalance just to get a good sense of the overall progress. You can see how many nodes have actually finished their first cycle. Also any middleware you create should be emitting useful ops metrics and usage metrics as well. The ops metrics are gonna help you obviously triple shooting issues and finding anomalies. Having these things lined up with other events on graphonotype dashboards is really important. Saves you a lot of time. The usage metrics are really good because you need to be able to answer business questions occasionally about how many people are using XYZ feature and a whole set of other questions that you'll get. Don't forget about your debug level log messages. While you should have debug turned off in production there are times when you'll need it turned on and you don't want to shoot yourself in the foot by not having any useful information at that level available to you when there is a problem. And when you're doing things like setting up log stash filters or alerts for various types of monitors don't limit yourself to just checking for the log lines or the metrics that indicate a failure. Over time you'll see clear patterns and the types of failures that you get. And you should be setting up monitors along the way that track those types of things. So if you for instance have repeated network issues don't wait for Swift to complain about it. It will but it may not always be clear that's a problem. Set up a monitor that for instance checks connectivity to all the places that it has to connect to and try to be as proactive about fixing that as early as possible so that Swift doesn't have to complain about it and get things into, you know, depending on the issue it can put things in a weird state and you don't want to be in that place. Rebalancing, this was a very difficult one to distill down to a few points. Always keep tabs on your rebalance times not just if the replicators are happy and doing their jobs, the times are really important. There's a few things that can happen. I think it was in Tokyo Rackspace that it talked about a lot of operational issues they'd seen and specifically in the context of the hummingbird work that they've been doing. But they mentioned something about dark data and this is one of those situations where if you don't keep tabs on rebalance times and keep it in check within your reclaim age for tombstones you could get into a similar situation. Also coordinate rebalances around your node and cluster maintenance. Don't just do them whenever. You could be impacting one or two replicas with your rebalance. Well you should only be affecting one replica. But if there's maintenance on the other two replicas across different zones you need to finish that out before you go and move a third replica around. Don't let your IAPS levels get too out of hand before you decide to expand your cluster. It's not just about the space usage or the network throughput. It's definitely a lot about IOPS. That's the most customer-visible metric you're probably going to have other than availability. It affects their latency, it affects times to first bite. Just a whole slew of other cluster health issues. And if you don't expand it when it's still in a comfortable range it's going to be even more difficult to do that later without disrupting those customer-visible metrics anyway. And also in the same vein, just know your limits around customer API-type IOPS versus replicator and auditor IOPS. There's a number of situations or a number of things you could do to get around either of those taking up too much. If you're in a situation where there's a lot of contention and customers are starting to see a slight increase in latency, you could use, I think IO and I should be nice for certain things. But don't get too crazy. You still want those other demons to do their job. But just be cognizant of those limits and where they start to affect customer traffic too much. Especially during a rebalance. On the general SWIFT side, use the default inode sizes. The reason we have 256 gigs in our data nodes is because we can't fit all of our inodes in the inode cache which is a problem for replication and auditing and the speed of those things. And also things like directory scans and all that. It's kind of a consequence of how the clusters were originally set up before the default was 256. I think it recently changed from 512 for whatever reason when we started out with 1K. So that's definitely a bit more than we need. If you're using SWALT, tell it to use an SSD storage policy when it sets up all of its containers in the auth account. If you don't, you'll be surprised later how much iOS it can eat up in a very busy cluster. Just for simple things like token validation and doing initial authentication for someone that can actually hit several objects in the course of that single request. For any custom middlewares that you write or tools that you use that need it in API, name space those and be consistent not just with other things that you've written but also with the way SWIP works in general and the general profile of the contracts that it exposes. We have some very strange ones with our CDN search middlewares and something that we're trying to go back and fix. But doing it right the first time around saves you a lot of heartache. When possible, ask the community about new middleware thoughts that you might have, somebody else has probably already done it or if they haven't, they have had thoughts about it as well. It's also good to try to recognize an opportunity where you could affect some upstream development and get something out there that everybody can use instead of just writing one internally and keeping it to yourself. Also, if it involves some of the guts of SWIFT that you probably want to make sure that what you're doing is both recommended and not gonna hurt you later. And last but not least, the upstream community is very important. You need to stay involved and get back whenever possible. Even if you can't do or review every week or every day or that can be a full-time job within itself but there's plenty of other ways to give back. This talk is one of the ways that we want to give back and share some numbers and thoughts and lessons. If you're an operator, you should do this as well if you haven't already. That's one of the biggest things that the community needs is more of these discussions about how things perform in their environments under different types of hardware and scenarios. Just bring those to the table because that's only gonna help the upstream community and help them deliver a better product to you. And I believe that's about it. So, there are any questions? Please use the mic over there on the left side of the room. Brian, thank you as always. You're such a valued member of our community and we appreciate you giving this talk very much. Thank you. I have a lot of questions. I'm sure I'll be talking with you through the week. Catching up on stuff. But one that I wanted to ask for everybody was, Jesus man, 60 petabytes. Congratulations. 40,000 discs. I'm sorry, 90? I'll check my notes here. How many people besides yourself are responsible for keeping this thing up and running for your customers? I mean, I'm not asking you to thank all of Softlayer, but how many people are really focused on Swift besides yourself? In development, on this flavor of the product, we actually have more than one now. I would say six or seven on the development side, probably about another six or seven on the operation side. Some more than others. Some are completely dedicated to it. Some are on call. And then we have plenty of product level folks as well. There's a lot there too. That's not a huge team, frankly, for what you guys are doing. Congratulations again. Thank you. So thanks Brian first. I have a question regarding adding new clusters or new hardware to your clusters. You were telling that you're adding half a row or even a row at once, and that means up to 12 racks, if I understood correctly. How do you rebalance then? Do you do adding small capacity nodes at once, or do you do it all at once and just putting everything down? Do you automate it as is? We don't automate the rebalance yet. There are things we want to do with that. We've written code. We're not 100% comfortable with yet. That's kind of been on the back burner. I would say we do it very carefully. I think the biggest thing is we try, for most of our ring changes, we try to keep the number, the percentage of partitions changed or that will move in the single digits. And let's see. It usually gives us about, you know, anywhere between a one to five or one to 6% increase in actual weight within the cluster. But it is very iterative. We have to make sure there's not too much disruption and that the rebalance, especially on an IOPS heavy cluster, doesn't get too negatively impacted. So it's a bit of a balancing act and there have been times where we've overdone it and had to come back in the next time around and adjust our assumptions and go from there. So no easy answer, but just kind of iterative thing. Okay, thanks. Please feel free to join us if you guys have some questions for Brian as well. Are you guys, you talked about monitoring. Well, you talked about a lot of different monitoring things that I'm really excited to talk to you more about this week. You talked about, what did you call it? Partition. The timings on your partition, the metrics there. Oh yeah. The only place that I'm aware of omitting that is in like some log lines. I think info level log lines that come out of the replicator. Are you, is your elk stack what you're using to suck that up? And then how are you aggregating that? How are you visualizing that? Obviously looking at, you know, thousands of nodes, log lines, isn't the ideal. What have you guys come up with for different ways to? Well, so we have, there's two places where that shows up for us. One is on that capacity dashboard that I mentioned. We have a tab in there that just shows us like a full list of all the data nodes in the cluster that are actually in the ring. And there's a column on those that shows us, you know, what the latest message was or the latest ETA was for replicating or for the replicator actually finishing on that node. Which should just be within the last five minutes. If we haven't gotten one in five minutes, there's probably a replicator issue. The other place that shows up is because we parse out all the different numbers from that log line and throw them into Elasticsearch, it lets us also throw it into a, we have kind of like a replication burn down panel on our Kibana dashboards. That just shows us, I mean, it tries, you know, Kibana tries to show you everything. The main thing we're looking for on that is just which ones are just way out of whack. And those usually jump out right away. Yeah, we need to get one of those dev or QA clusters and you need to give us a whirlwind tour of the capacity dashboard and some of the other stuff that you guys have cooked up. Maybe it's the next summit. Yeah. All right. Any other questions? All right. Oh, one more. One more. Any lessons learned in relation to geo-replication or Swift, are you guys replicating it? Not a ton yet. The first cluster that we rolled out with that was about six months ago, I think, almost seven. I think the best tidbits I can think of for that are test the heck out of your network with very large, like, load the cluster up with a large amount of data before you turn it on and then cut the replicators loose and see what that does to your network. And a large cluster with that turned on, your WAN link, is going to be the one key thing you're concerned about, along with all the other Swift stuff, of course. But that network component is really key. And if something goes wrong, all sorts of strange problems are going to manifest in ways you didn't even imagine. And then I would say probably early on establish what you want to expose to your customers from an availability perspective of having two data centers. You might want to have kind of a full load balancer and proxy level on both ends. Or you may just want to have data nodes on both sides, but a proxy node and load balancer on one end of that. It kind of depends on what you want to get out of that setup. So that'd be my recommendation. Just one more. Can you guys talk about if you've tested the hummingbird code at all in your clusters? And also, if you guys are using EC storage policy at all? We don't use any erasure coding today. As far as the hummingbird object server, it's pretty high on my list, next to a lot of other things that are high on the list. It's something that I think we could benefit a lot from. I've tested it in our UAT cluster, which is a small cluster that uses all the same hardware that we have in production. And I did it there just over the course of a weekend just to see if I could notice any difference functionality-wise. And I couldn't, of course. But the latencies also, when I stressed it, were a good bit lower. I don't have the numbers on it. But I'm hoping that within the next month or so, we can work with our ops team to get that out on. First, maybe a node and a heavily loaded zone, and then maybe the full zone after that and just see what happens. Just try to get some good numbers from that that we can move forward with from there. All right, thank you guys very much.