 Good morning, everybody. Thanks for coming. My name is Aiman O'Toole. I work in the HP Swift team. And this is my colleague, Mark Sieger, who works in HP's Cloud Services Performance Team. We're going to talk today about benchmarking Swift and some measurements on Swift. So, Mark, do you want to go ahead? OK, thanks, Aiman. Yeah, I'm hoping that everybody who at least went to the show last night had a good time. Our bright-eyed and bushy tailed and ready to hear all about something as exciting as benchmarking. Basically, the presentation is divided into two pieces. I'm going to talk about benchmarking. And Aiman's going to provide some data from some case studies that we had worked on together. And the one important thing I think about benchmarking to keep in mind is there's at least two major kinds of benchmarking. One kind of benchmarking is someone's got an existing system, and they just want to say, how fast can it go? Or how well does it perform under different loads? And you run a set of tests, and you generate some data, and you present some plots and graphs and whatever. And quite frankly, I don't get all that excited about those kind of benchmarks. The kind of benchmarks I get more excited about is when you run something, and it maybe doesn't behave like you expect it would behave, and you want to know why, or it runs pretty well, but you want to see if you can make it run better. And to do that is a lot more involved. And that's kind of the benchmarking stuff that I want to talk about. And again, Aiman will be showing some results that were based on that kind of a methodology. So I have what I would call my benchmarking viable. And for anybody who's ever done benchmarking, this is kind of something everybody already knows. But for those who may not have done a lot of benchmarking, it may not be as obvious as every one of these bullets may not be as obvious as you may think, although you certainly may be aware of some of them. You know, one of the big things with benchmarking is repeatability. If you're in a set of tests, and then you run off, and you tell the world, look at the results I got, and then somebody else runs a set of tests and gets totally different results, and then you realize, oh, I better re-run them again. But you know, this was this quick script I threw together, and now I don't have the script anymore, so I really can't run it again. And the more that you script, the more you can repeat. And again, benchmarking is all about repeatability, repeatability. The other important thing about benchmarking, that again, some people miss, is they say, okay, I wanna know how fast Swift goes. So I'm just gonna run a whole bunch of Swift tests and we'll see what's going on. And this applies to anything, by the way. I'm gonna run a bunch of tests, then I'm gonna generate a whole bunch of numbers. But wait a second. If you get poor numbers, are those poor numbers because Swift was bad? Or misbehaving? Or was it bad because your disk was misbehaving, or your network was misbehaving? You know, you've got those great big stacks starting all the way down to your hard drive, into your disk controllers, into your network stack, into your, you know, your object servers, your proxy servers, your network switches. I mean, there's a lot of peace parts in here. And it's really important that you make sure that the lowest levels of the stack are behaving properly before you move up the food chain. Again, obvious to a lot of us, sometimes people forget. Also, there's caches everywhere. You've got caches in your disk controllers, you've got caches in Linux, you've got caches in a lot of places. And cache effects can skew the results. So if you were to run like a 10 second run, you might wind up measuring how long it takes to get data into the disk controller, but not to the drive. Or into the Linux cache, but not out the network. And you can't really entirely eliminate that, but when you run a longer test run, it tends to reduce the impact of those cache effects. And this is one of my particular hot buttons. I claim that the middle of the test is just as important as the whole test. Somebody turns around and says, well, I just did a test on Swift and I got, you know, 8,000 IOPS. Well, that's cool. Could it have been more? Did you get a steady 8,000 IOPS during the entire duration of your test? Or did you have some spikes at 20,000 and some valleys at 20 and stuff in the middle? And again, I believe it's really important to see what's going on throughout the duration of the test. Yeah, one of my favorites is this one about changing more than one thing at a time. A lot of times people get really anxious. Something isn't behaving correctly. And they'll say, you know, we better put in more memory. We better upgrade the operating system. We'll change some of these system tunables. You know, we'll make a whole bunch of changes. And then if things work better, it's like, damn, I wonder why. And you really don't know. Maybe it's really important to understand the impact of individual changes. And again, this is the one that management hates to hear. It's gonna take as long as it's gonna take. You know, you might think it's gonna take a week or a couple of days, but you know, sometimes you discover problems or you find yourself going down pads and you weren't playing and going down. So you just keep slugging along. And also another really key thing is there's really no such thing as a coincidence. A lot of times you run a test and there's some anomalous behavior in the test and two or three things kind of correlate, but you wanna write it off to a coincidence. And more often than not, it's not a coincidence. I would also claim that size matters, at least in the case of Swift testing. You know, we've basically got large objects and small objects. And on the one hand, they're kind of sort of, you know, an object is an object is an object, but if you think about it for a minute, if you're doing large objects, you tend to have a very low number of IOPS and a high throughput rate. Well, the problem is if you're trying to do a lot of large object puts or gets, you're really tying up a lot of network bandwidth and you may not have that bandwidth available to you. So again, these are some of the kind of things you have to take into consideration when you're doing your testing. And another thing that turns out was a, I had mentioned earlier, sometimes you get surprised and you go in different directions. One of the big surprises I had when I was doing some of my Swift testing was I discovered that the Swift client for large objects is actually CPU bound. And that was like totally something that never even occurred to me. You'd run a test on a single core VM and you'd see the CPU load on the VM go up to 100% while you're trying to do a large object put. And again, this took me down a whole different path and if you start doing profiling and stuff, it's like, oh my God, it's spending all its time doing SSL encryption. And you put in a faster CPU, it still has to do the encryption, but now it has a little bit more time left over to kind of do the IO itself. And you will definitely see a performance improvement with faster clients versus slower clients. And that was totally counter-intuitive to me. Small objects are a little bit different because now you have a lot of IOPS. And if you're doing a lot of small object puts, you're spending an awful lot of time updating containers and talking to container services. And Amon's gonna get into this a little bit in a few minutes, but it's pretty interesting. And again, network bandwidth is probably less important because if you wanna do hundreds of one kilobyte puts, you're doing a couple hundred kilobytes a second. That's not a big deal. And of course, the CPU requirement is lower because there's less encryption because there's less data have to encrypt. Okay, just a couple words about Colecto. This is a tool that I had written a number of years ago that does benchmark that I use for doing performance monitoring in clusters and things. And this gives you the ability, it's another one of these, think of like IOS stat or top or net stat or whatever. It's a tool that allows you to look in real time what's going on in the CPU, the disk, the network, et cetera. And the one thing that's kinda cool for our purposes, it allows you to look at process statistics. And furthermore, it lets you look at the process IOS statistics. And by being able to look at the process IOS statistics when you run some tests, you can then say, you know, I've got a whole bunch of Swift processes out there. I've got, you know, container services, I've got account services, I've got object services. Well, and there's lots of them, not just one. Well, if you get the data from Colecto, you can add up all the numbers and see how much time that I spend in each one of these processes. And again, Eamon's gonna show some more details on that. And then there's another tool that we use from time to time called callplot that lets us plot the data that Colecto generates. But the main thing I wanted to talk about is this benchmarking tool that I wrote that for lack of a better name, I call it getputtools. And what getputt does is it's specifically for Swift benchmarking. And what it does is it's got lots of switches and lots of options because like I said, I'm interested in seeing what's going wrong, not what's going right, so that I can help figure out how to make things go better. So on the one level, it lets you say, whether you wanna do puts, gets, or deletes, what kind of object sizes you wanna write, how many clients, you can say how many clients are gonna be running the tests in parallel. You might wanna run multi-threads on each client, so this'll let you do that. And it also allows you to talk about container sharing because this is kind of an interesting one. If you're, you could have 20 clients with 100 threads and are all right into the same container, or you could have 20 clients with 100 threads writing different containers, and that can have an impact on the way the performance works. There's also some options for how you wanna run the tests, and that means that you can actually say instead of just running a test for 1K object, I wanna run it for 1K, 2K, 4K, 8K, and just do all in a single run. So that's kind of a neat thing to do. And the other thing that's interesting is you can have pre and post test processing so that you can then go analyze the data as you're collecting it from individual tests. And again, there's a lot more switches, it's way beyond the scope of this presentation, but you're certainly welcome to get a copy and play with it and see what you see. There's documentation, there's a man page, help, all the usual stuff. So getting started with using this tool, and I don't know, I hope people can read some of this stuff, but the main thing to know about get put is it's using the Python Swift Client Library that's part of the Swift, and you need credentials to talk to Swift. So that's the one thing that can sometimes be tricky, is you need to set up your environment with your Swift credentials, and what I always do is I just run the Swift stat command, and if the Swift stat command works, then get put's gonna work, and if the Swift stat command doesn't work, then get put probably isn't gonna work. So at the very, very top, I just wanted to show a real quick example of a get put run, and this is pretty boring actually. If you can read it, what it's doing is it's simply writing a single 1K object and it's doing a put, a get, and a delete, and then if you read horizontally across the top of the line, it will tell you how long the test took, how many objects it did, how many IOPS, what the megabytes were, and the really interesting column is the one on the far right that for this test is kind of boring, and that tells you the latency range because a lot of times people who run a test and say the latency was 0.08, and I say well gee, 0.08, where all the puts 0.08, where some of them 0.01, where some of them 0.8, where some of them 0.6, this at least tells you, gives you an idea of the range of the latencies. There's even more switches that'll let you print out a latency histogram, but it wouldn't fit on the screen and I'm trying to keep this relatively short. One of the really interesting things that I was able to find with this tool is if you look at the bottom where it's circled, and again I'm hoping people can read this, I kind of discovered that quite accidentally that again took me in a different direction during my testing which happens all the time. You keep getting surprised. But what I found was a 2K put was actually two or three times faster than a 1K put. And my very first thought was I must have a bug in my code but I did a lot of experimenting, I ran it on a lot of different test clusters, I started getting involved in playing with TCP dump and S-trace and all kinds of crazy stuff, and I finally found out it was one of these buffer alignment issues and I don't know how many people are familiar with the Nagel algorithm but it's at the lower levels of TCP and what happens with Nagel is depending on your packet sizes, the client might think that there's another packet so it delays its act and when the Nagel algorithm kicks in at the right point which in this case was a 1K object, it adds a delay and that's causing the 1K objects to run slow. So the real short story is I worked with the Swift Core team managed and they got the client fixed so that it no longer does this. So this is actually running with an older Swift client but I'm using it just to demonstrate. If you were running with the current Swift client version two you'll see that it no longer behaves that way. This I just really wanted to quickly give you an example of the kind of data that Collectl will report and here I'm doing like a 500 megabyte put I think, I can't really read it from here. No, it's a one gigabyte put and basically you're seeing the get put as reporting that it did something like 77 or 78 megabytes a second but if you take a look at the column and yeah I didn't get my updated slides to aim and I just discovered yesterday I circled the wrong column. It's actually one column to the right. It's actually this column over here and if you take a look what you can see is not every single second are you doing 78 megabytes. Sometimes it goes up to 90, sometimes it goes down to you know 40 or 50 and the other important thing to notice is before the test started the traffic was close to zero. Again, if you're running this on a cluster and there are other people talking to Swift from or there's other people doing network stuff on your client it's gonna mess up your test. So it's really important to see what your system's doing before the test starts and then watch the behavior and if you happen to see during the test that you're having some really short valleys inside the data then perhaps there's switch problems or network problems or who knows what kind of problems. So this is kind of like looking at running a benchmark and in this case what we're doing if you look at the three columns on the left it's telling you how many clients you were running the test on and in this case it was a single client. You're looking at how many processes we're running from that single client. So in this single client we're running from one to 48 processes and we're running them all with one kilobyte objects and if you look at the timings and stuff you can see that they were, I believe they were two minute long tests and now you can actually start seeing the number of operations were more than just one but more importantly if you look at the right hand side again for the latency range you can see that on most of the tests the latency range was relatively narrow. So that means most of the operations were falling in a relatively narrow band which is a good thing. Every once in a while you'll see a latency range going up to one and when it does that it's kind of sort of telling you that hey something at least one of these operations took a lot longer than it should have taken and that's kind of like a red flag that says maybe there's a problem with your Swift environment. Again these were all on, this was all done on these development clusters so I wish I could remember what all the settings and things were but quite honestly I don't. This last slide is an example of what happens when things go wrong. This is an example of running up to a thousand threads on eight clients. So I think each client was running 128 threads and that's really pushing the hell out of it and the important thing really is if you take a look at those two circled columns the first one being the IOPS and the second being the latency range. If you look at the lower numbers you can see yeah the IOPS they're kind of scaling. Here's what they're doing at one client here's what they're doing at you know 16 or 30 here's what 32 clients are doing 32 threads I should say and yeah it's kind of scaling it's going up really nicely but geez once you start getting up until you know the upper ends you can see wow we've really maxed out in fact those last two the number barely changed at all and the one right before it was pretty close to it as well but the thing that gets really scary is when you look at the latency range because on the last two ones we're having latencies what's that like 300 seconds or something like that which is like totally insane and when you think about benchmarking what really happens with a benchmark you do X number of puts over Y number of seconds and you divide the two well if one of the latencies is 300 seconds you've like totally blown you know your numbers so you really want to look at the latencies in conjunction with the number of operations to try to figure out exactly what's going on and I think that's pretty much what I wanted to cover with you guys I guess we have a couple of minutes if anybody had a quick question or I could just hand the mic over to Amon yes pardon I'm not sure what the a copy of get put it's on the last slide okay yes sir to be honest I haven't you know there's a lot of there's several really cool Swift benchmarking tools I'm familiar with Cosbench I'm familiar with SSbench I haven't dug really deep into them but you know part of my goal with the whole get put thing is to be able to do some diagnostics and well it sounds like something we could certainly talk about later one last question because I want to make sure Amon gets enough time I'm not sure this is beyond the scope of what you guys are doing but one of the things I'd be concerned about is if I have something that's a process that's writing a Swift object another process that wants to read it with the way most Swift installations are set up I think with the number of writes in the region it would be eventual consistent I would be concerned of knowing is there a way for me to measure what the statistical bound on the consistency would be so just writing and reading objects as fast as possible that's great but what about the actual consistent of the objects and knowing that if I on the origin just load if I read between if I write this many objects per second I'm trying to read them am I going to hit the point where I'm going to get inconsistencies because everything's hashing and not settled out yet so I'm wondering how we might be able to address that yeah that's definitely getting complicated what I've done with this with my particular sets of test is I do a bunch of puts and then I read them all back I haven't done any tests where one process trying to put while the other process trying to read primarily because the other process doesn't it's kind of hard to synchronize you can't do your read until the right finished but again it's kind of beyond the scope of this talk but it might be something to discuss maybe at the developer summit you know I'm going to be talking some more about this stuff on Friday but I think I should sit down and give them in the mic okay thanks Mark so we're now going to just talk about some results that we collected on a couple of configurations that we tested with Swift the objective of these measurements was I suppose just to get a better understanding of how Swift runs on particular hardware and how you can optimize the combination of Swift and hardware to get the best performance the two different configurations that we're going to talk about first configuration has 12 disk data servers with dedicated proxy servers those data servers run all the object container and account services and we deploy those data servers in the ratio of about 5 data servers to 1 proxy server the second configuration we have 60 disk data servers and those data servers run just the object services don't run the container or account services they're run on separate servers and the ratio of data servers, object data servers to proxy servers and that configuration is about 1 to 1 we didn't try to cover the entire spectrum of behavior performance measurements of Swift we concentrated on small objects in the initial case mainly because looking at the distribution of object sizes and our production systems, most of them are quite small 50% less than 20 kilobytes in size and we looked at particularly puts of those small objects looking at the transaction rate basically as you try to put more and more objects small objects onto a cluster and this is the first configuration the proxy servers have 12 physical cores, 24 virtual cores that they hyper thread 96 gigs of RAM 10 gigi network interface mirrored disk for availability of operating system primarily and the servers are half you wide one you high half you wide the data servers themselves have 12 disks 2 terabyte disks in this case those disks spend at 7200 RPM a single 1 gigi interface 24 gigs of RAM and as I said these servers run all the object container and account services one of the first things we noticed on our production systems and these measurements are actually on a development system was that idle systems, by idle I mean that we're not actually doing any external puts, gets or deletes or anything on the system still seem to be quite busy CPU wise so it seems to be a lot of things going on so what we did was we took this particular configuration here each server has about 100,000 containers and about 20 million objects and we selectively turned on and off different Swift processes so the far left column there is with the object and container servers alone running account isn't really important it's minimal, almost immeasurable in terms of CPU activity the green part there is the container sync servers I was running across all these tests then we turned on the auditors we turned the auditors off we turned on the updaters and then finally we turned off the auditors and updaters and turned on the replicators and you'll see there's a fairly major jump in CPU usage with replicators turned on and that's the container replication so what's happening is the container replication process itself is consuming CPU but it's also imposing a load on the container service itself so you know the CPU load is measured in terms of CPU cores so on this 12 physical core system we're seeing around four cores or so being burned just maintaining the system health basically which is very significant now onto actual measurements with an imposed load so these are for one kilobyte object puts on the same system and these are CPU measurements on the data server we didn't really look in detail on the proxy server for these runs we did for the second configuration and what you'll notice is that again most of the CPU load here is the container servers and container replication the you'll see that we start again with a fairly high idle effectively you're quite a CPU load of four cores it doesn't go beyond that so you've got a fairly narrow band of CPU usage and that's at the maximum obtainable put rate that we could get with that configuration which is about 340 or so 340 puts per second we max out about five CPUs in use on the data servers the object servers themselves don't use a huge amount of CPU but it does go linearly as you increase the put rate although you can't tell it from this graph on the IO side if you look at the the writes that we see again graphing it by put rate for different processes you see that the container and the object are doing most of the writing however the highest IO load is actually on the read side not the right side so this is the lower graph on the left hand side the process that consumes most of the is the auditor which is something that runs to be quite familiar with the object server though as you increase the put rate starts to read quite a bit as it turns out it's actually reading about six times as much as it writes for these the subject size what happens on the servers is that all of those reads are coming from disk not from cache so it's not good so you're seeing a lot of disk activity just to satisfy the reads on the server so just some observations summarizing what I said there there's a fairly high idle CPU burn the CPU burn doesn't increase ahead of it actually as you get to the maximum on the data server side container services are the major CPU hog the small amount of memory and the fact that we're running the container and object services together definitely is hurting our performance the container and object data are conflicting in cache and wiping into the routing cache and we're also seeing that the reads on the object side even for puts seem to be the major major performance limiter so this is a bit of a surprise to us to see that we've been limited by read and this configuration not so much by by write the other thing that we quickly concluded from these measurements is that we better to keep the container and object services separate they don't sit well together on a server the object services themselves consume great CPU it's mainly the container services and also it did seem that increasing the amount of RAM from buffer cache would have a positive effect on performance so this is the second configuration that we tested we have two types of server in this configuration we've got proxy and container account servers both of the same type in terms of hardware those servers have four disks this configuration those disks are reasonably slow they wouldn't terabyte 7200 RPM drives we tested a variety of configurations with this but for these tests it didn't make any difference but we need to go back and revisit those tests a 10 gigi interface either 96 or 192 gigabytes of RAM I can answer that in a second and again the same half you wide when you high server the object servers themselves have 60 disks of this is a mistake it should be 3 terabytes 96 gigs of RAM and again 12 cores now we looked at a lot of different combinations of server and running services on those server different combinations I'm not going to talk about all the measurements we made I'm going to concentrate on one particular combination which is where we ran the container and account services on the same node as the proxy services so in that configuration we have one server that's running our proxy services our container and account services one server running our object services and that's the ratio one to one in terms of this what happens on the CPU side two graphs here one for the object server one for the proxy server again looking at the number of cores that are being used as you increase the put rate first thing to notice is that the put rate we can achieve here is much higher than it was on the previous configuration we're getting up to in this measurement here with 4 kilobyte objects 2000 puts per second there's a pretty extreme bump in CPU usage at 2000 it looks sort of anomalous but it is reproducible so we get this consistency as we run these tests we don't understand why we see that jump in CPU usage particularly on the object server so at 2000 puts per second the idle CPU burn is less it's about two cores compared to the previous configuration because these configurations are pristine there's no data on them before we start doing the measurements on the proxy side you see that there is a reasonably even balance between the proxy container CPU burns and the combined CPU load of both services is actually less than the object service load on the object server when you look at the reads on the object server compared to the previous configuration where we saw all reads coming from disk they're now all coming from cache so a complete flip of the data on the first configuration so again summarizing the observations we see a much higher maximum put throughput for this configuration up to 2000 puts per second before kilobyte objects and up to about 1600 puts per second with one kilobyte objects and we do see that dramatic jump in CPU usage going from 1600 puts per second roughly up to 2000 puts per second on the object server side don't understand what's going on there really we haven't looked in great detail at that but if you look at the number of cores you're using you really are benefiting from hyper threading you go well beyond the number of physical cores you have which is interesting on the proxy slash account container server side the major CPU users are again the container and proxy services the account service really is almost immeasurable and finally compared to the first configuration those troublesome reads that we saw on the object server side now are coming from cache as opposed to from the disk themselves so that translates into a much higher transaction rate achieved about five times five times the first configuration per you of rack space when you work it out it also turns out that the proxy services on the container service can coexist quite happily which was a nice surprise to get actually what that seems to imply is that you don't need to have a separate server just for account or container services you can actually combine it with a proxy server and cut down on the amount of servers you have in your configuration one issue with this configuration potentially is object auditing so the object auditor running on a reasonably full let's say full say 70% full object server would take about 200 days or so to run which is just too long and you don't really want to bump up the default parameters of the auditor to make it consume more IO because it's already doing enough of that so one possible solution to that problem is to parallelize the auditor so that instead of auditing one disk at a time which is the way it works right now you audit multiple disks in parallel so we made that change and it's up for review and the measurements I've just presented were quite specific a specific use case we want to go back and revisit that a couple of ways these are just two of them look at the object auditor itself in detail running on a reasonably full configuration 2 where you've got the 60 disk on the data server and see how it behaves particularly if you've got multiple parallel auditing streams running simultaneously and the second thing would be to look at large containers the containers that we use in these tests are actually quite small and anyone who's used Swift, I think is probably quite familiar with the fact that large container performance can be an issue so you want to go back and visit that instead of using striped disks for the proxy and container services to use SSDs for instance and see how that affects performance and Mark mentioned that there are some links and here they are so you can get collector and get put with these certifications and that's it are there any questions what was the difference between what the disk layout oh yeah the you mean in terms of cache and so on oh one controller basically your results were normalized per object servers so that was 2000 puts per second per object server as you go to like 10, 20 object servers yeah no we haven't measured excuse me we haven't measured that we expected to scale linearly but that assumes that you've got a workload that again will scale linearly that will you know parallelize correctly Swift does tend to within the limits of how it scales scales pretty well which is a nice feature of Swift I was the objects per second I was the one giving objects ah you weren't listening to my presentation okay so is it the same issue no it wasn't the same machine but it was the same older version of the Swift client these tests were run a while back and they just recently got version 2 out so that's why 4 was faster than 1 see it looked at the data and you noticed good for you if we had the hardware yes we just didn't have the hardware to test point did you run into single container bottlenecks like at what point after how many objects did you start seeing a lower performance on a single container ah you mean when the container became the bottleneck is that what you're talking about yeah yeah puts on a like creates on a single container well um so basically at the peak if you go beyond that point it might saturate for a while but it will tend to tail off eventually so you start to come down in performance so um the peak rates that I show there were basically the points at which the container service was fully subscribed in that sense going beyond that you were going to you know 17 million objects is that what it was sorry was that 17 million objects oh sorry yeah so the the first configuration each server had about 100,000 containers and 17 million objects so that's not very many objects per container on the second test the system was pristine it was clean completely clean so we just created containers in the fly so we didn't actually have very many we need to go back and redo those tests where we've got actually existing containers in the system and then start running get put because it definitely it will affect performance but we still expect to see that the performance of the second configuration would be much better than the performance of the first configuration just given what we saw as bottlenecks in the first configuration that seemed to have been fixed in the second configuration I think that's still going to be the case when you turn on the replication you spike the CPU by 4 core right so is it locally replication or one in remote oh I think we have to oh you can answer that question no that's per server so it's actually multiple data servers in that configuration so it's not local it's actually across the entire Swift cluster measured on one server in particular server thank you