 Okay. Good afternoon everyone. In this session, we want to talk about monitoring and getting insights from a Swift cluster. And this is obviously a hot topic. This came up in a talk from HP on Monday, another talk from Symantec. I'm assuming that many of you were in Christians talk just before lunch. And we want to take you basically through our experiences, through some stories of what we've done in trying to monitor, diagnose, and get some insights from some Swift clusters where we're doing some work for customers, with the hope that this can be useful in our approaches, our methods, and the way we went through this can help others to try to understand the Swift cluster. I'm going to be presenting on Michael along with Dima. And this is work that we've done also with Yaron Weizberg and George Goldberg from our lab. Now, may I put this work into context? Dima, can you put this work into context? You know, one of the reasons we're interested in Swift is that Swift deals, you know, enables a hybrid cloud, right? And, you know, one of the things we're really interested in is being able to deploy in an on-prem model, a dedicated hosted model, and a public cloud model. And in all of these models, all these deployment models, it's important to be able to get an understanding, a sense of how the cluster is working, and if there's issues, what are the issues in that cluster? And we also need to be able to figure out a way to get a set of insights into how our presentation is not working. So I'll keep talking while you don't see the slides. Monitoring is fairly easy if you're in a simple system. So if you think about a car, right, a car is a fairly easy thing to monitor as a few set of dials. It has a speedometer, a tachometer, maybe a temperature gauge. It's fairly easy to get a sense of what's going on in the car. You want to know how fast you're going. You look at the speedometer. It shows 100. You've got your information. When you have four gauges getting insights out of those four gauges, it's fairly easy. Now, Swift is really simple to use, but it's not a simple system. Under the covers, there's a lot of complexity there. It has a lot of gauges. And when you want to get insights out of all of these gauges, when you have 100 gauges, 200 gauges, that's a lot harder. How do you know what to correlate with what? How do you know which information is relevant? To which information? And that's really where we're trying to focus. And there's been a lot of work done. And if we ever get the slides up or later when the slides go online, you'll see there's lots of information on how to get monitoring data from Swift. And you can get lots of great data. But once you have that data, what do you do with it? How do you get insights? How do you drive that data to get you have some insights and some sense and make sense of this huge mass of data? So what we're looking at is how to take advantage of a set of open source tools. And for those of you who were in Christian's talk earlier, you probably heard about some of these. And use these tools to gain insights. And so in particular, what we're looking at is, okay, we have a Swift cluster. And we take advantage of StatsD to collect hundreds plus data points from within Swift. CollectD, which is pulling that from StatsD as well as pulling in the underlying system data, memory, CPU, et cetera. And then we ship the stator off both to graphite and to an Elk stack, you know, Elasticsearch, LogStash, Kibana. Once we have the data over there, we are able to look at it graphically, but then we also take advantage of Spark, Apache Spark, which is a big data analytics platform to do analytics on the data, try to get a deeper insight into the data. And one more thing we've done is we've developed our own piece of custom middleware, which we call request stopper. It's a piece of middleware that's only useful for diagnostic. It's not something you would ever run in a production cluster because it actually stops requests from coming through. But it enables you to diagnose and see where the overheads are in processing a request. So that's sort of the context of where we're working. And as I said, you know, we're looking at Swift because it's something where we can work with customers, where we work with customers in various configurations. And in particular, you know, we had a customer that we were working with. This customer was dealing with relatively small objects, 15K objects. You know, often people think Swift is storing lots of very large objects. You know, those of you who were in HP's presentation on, I guess it was Monday morning, you know, sort of their average size was relatively small. Symantec on Monday talked about, you know, looking at 16K objects. So, you know, objects of this scale are not uncommon. This customer had a use case where they had 15K objects. And, you know, what they were interested in was, you know, being able to create 1,000 objects a second, right? And so what, you know, we did was I went to DEMA, you know, the others on the team and basically said, okay, what size cluster do we need to get, to be able to create 1,000 objects a second? And, you know, basically what we want to talk through is, you know, DEMA's and the team's experience of trying to answer that question and, you know, a few other questions and what they had to go through in order to try to understand the system behavior to be able to answer that question. Okay. As you see, it's hard to understand the system behavior. So, the first thing that I asked from Michael to get some hardware to run the test on it, and I got three proxy machines, seven storage nodes reaching hard disks and SSD drives, and also 256 giga RAM and 10 gigabit network. And I did my first test by using the Cosbench. Cosbench is an Intel cloud object store benchmark tool. And I got 520 operations in a second. I said, okay, it's a good start point. Now I need just to multiply it. But I couldn't scale up my cluster because then I need more hardware. So I decided to try to scale it down to find at which point my performance reduced. And then I found, oh, this is my daughter. Okay. And then I would understand what I need to increase to get the 1,000 operations that I need. So the first thing I reduced the number of proxies. I set one proxy instead three, did the benchmark and got exactly the same number. Then I reduced the amount of object nodes and performance doesn't change much. I reduced the number of disks and got the same number. And then I said, stop. You are doing something wrong. Let's see inside. Let's understand how Swift works. So we have number of layers that Swift and first one is the proxy. It gets all the put requests from the user. And then for each put request, it creates three and other requests to the object servers to create the three replicas. In general case, it's our replicas. But I would speak about our use case. There we have three replicas. And then each object storage node writes its data locally on the disk and also updates the container server. It done asynchronously. It's almost a synchronously because object server waits half a second to the response from the container server. And if it doesn't get this response, it tries a sync pending object locally. And then at the background object updater would take this async pending and update the container with the relevant data. Okay, so after we got the time out or we got the response from the container server, object server would return to the proxy. And then if proxy got the quorum at our case, it's two successful responses from the object server, it returns acknowledge to the user. So if I want to understand where is my bottleneck, I can try to remove, to replace the current servers by the dummy servers that do nothing but catch the request and immediately respond with the success response. So if I remove all the container servers and put the dummy, I would reduce all the sync pending stuff. Then I would put it instead the object servers. And I would see how costly it was to write the data to my disks and to update the container. Then I can put it to the proxy and to measure the roundtrip network from the client to the proxy. But of course it doesn't useful for the production. But it's very useful for the Annalisa and for bottleneck finding. Okay. So we don't want to rewrite now all the servers. So Gil Wernick, he's sitting here in the first row, he helps us, implemented a very simple middleware. You can find it at his GitHub. It's 12 or something like that, lines of code. And this middleware can be inserted to the pipeline at proxy object server or container server. And then we would get this behavior that I talked about before. So when we got this middleware, we start to make the measurement first of all, we did the measurement of the vanilla Swift. It looks like it works now. Yeah. I will move it. Okay, great. Okay. The full screen doesn't work. But you can see now the numbers that we got when we did the measurement of the vanilla Swift. The Swift version that I'm talking about, it's 1.30 because the story starts at the September previous year, as Michael says. And I got 192 milliseconds. Then I put my middleware to the container server. And I got 43 milliseconds. And I moved it up to the object server proxy. Got all these numbers. I put them together. And this is the chart that I observed. We can see that container server takes the majority of the response time. And now the question is why? We looked into the code and found that container server has a lock on the directory. And then when the directory is locked to update the SQLite database, another writes couldn't update the container. So we found how we can resolve this problem, how we can to reduce the overhead of this lock, and how we can to do it easily without changing the Swift code, only for our measurement needs. And we decided to run 100 containers instead of one container. So then I would distribute my lock to many containers and not all of them would be locked together. And we observed that running the same workload with same amount of objects with 100 containers is almost five times better than one container. And we see that the container part is also reduced to nothing. Okay, but time is going and we have a new Swift release 2.2. So what we can get on it? This is the problem that yeah, because I have all this in animation and it's much more cooler. Okay, so as we see, the native Swift 2.2 is already three times faster than the previous version. And we see that the most improvement was done at the container part. But we still see the 1.5 factor improvement when we use 100 containers. So we checked the code, what is the difference between Swift 2.2 and the previous version. And we found that indeed there was a container made item patch that makes the huge difference. So thank you to community that you did this great work. But still, after this great improvement, we still have a factor of 1.5 potential improvement that container sharpening can produce because instead of one container, spreading the objects to the 100 containers makes the performance improvement. Okay, so now when I have my Swift 2.2 or a container sharpening or I just can ask from client to put the data at 100 containers instead in a single one, my response time is okay. And now I can achieve the Michael's requirements of 1000 objects in a second. But there was additional requirement. The response time should be reasonable. Okay, what does it mean? 45 milliseconds on average. Is it reasonable? On average, yes. But what is going in the worst case? Okay, to answer to this question, we can use Kibana. We can get the histogram that represents the percentiles of the request. And it looks like that. Then 60% of the data is very fast. But on another hand, we still have some requests, for example, 1% of requests, that take more than one and a half second each one. So we try to understand why it happened. And then we took the graphite and look inside on the response time based on the timeline. And we observe two behaviors. This is the same data. But the upper chart represents it by moving average per minute interval. Okay, so it's less noisy. And the lower chart represents it at the one second granularity. And the first thing that we can observe that the upper graph has a slope. And it's a bad slope. It's going up. And it's not something that was running at background. We return these runs many times. And we see that it's going up continuously all the time. And we think that the XFS temp directory bug that was opened before three weeks is relative to this behavior. And we hope at the moment that the patch for this bug would be available to rerun this test and to see the flat line. But another behavior that we observed that we have very noisy chart. And we have some peaks each 30 seconds. So we zoom in. It's also the same chart. But now I am looking not on the hour interval, but only on 10 minutes. And we see that most of the responses are very fast. The response time is awesome. But we still have some peaks each 30 seconds. We start to dig to understand what is going on. Because probably I have one response that is stuck somewhere. And it takes a lot of time. And then it takes all the response time up. So at this point, we used Spark. We took the proxy logs. And we filtered all the responses that took more than half a second. And this is how it looks like. You see that it's not a one request. There are bunches of requests. And we still see the 30 second interval between these bunches. So where it came from? We found that there is XFS sync the parameter that is also 30 seconds. And it is responsible for the flashing the XFS data, the XFS metadata to the disk. We start to play with this number. We put one minute, five minutes. And we observed that our peaks really moved. The interval between them changed. And also the height of these peaks was changed. So one observation that we can make from all this story that there is some XFS background activity that harms our swift performance. But we can maybe tune it somehow. For example, if the most important thing is for our use case is a average time and 100% is not interesting because we are providing 99% as a layer, something like that. We can increase this parameter and get better response time, better 99% at price of the 100% performance degradation. Okay. And now we can swap to the story number two. At this story, I want to present how we are looking on our swift cluster based on the tools that we described previously. Specifically, at this story, I am mainly using the graphite. So it's another cluster here. I have two proxy four object nodes. Let's add this class SSD. Now I run my container and the count on another servers. Also visas as D. This is how the cost bench summary of the workload look likes. You can see that I did the one hour workload that started at 540 at the morning. The performance that I got was something like 30 megabytes and 21 megabytes at the main workload. So what the first thing we are doing after we run the cost bench stuff, we are checking that the data that we sent from the client is equal to what the cost bench reports. This chart represents the throughput that the client machine sent to the proxy. It's the total throughput. And we can see that indeed, the average is somewhere at 21 megabyte. We also see that there are some spikes. I don't know why they happen. Maybe there is some another not XFS background process that make these spikes. But on average, it looks okay. So if I send 21 megabyte, I expect to receive the proxy 21 megabyte. And this is what I see great. So now I want to check what I'm sending from the proxy to the object storage. And I see three times more traffic. Great. We can stand it. We have three replicas. So on each megabyte we got we sent three megabytes. Everything is clear. So now I want to understand what they get. And the storage notes. So what they sent that they get great. And now the question is, what is the traffic to my R disk? And what we see the traffic to the R disk is 12 times higher than what you write to your proxy. On each megabyte that you write to your proxy, you write 12 megabytes to your disk. Of course, it's not the 12 times more data at the end of the day. Because some of the data is rewriting. Because for example, I write my object to the time directory. And then I move it to the final directory that represents its partition. And I need to update all the I notes and all the stuff locally. And this is the overhead that we observe. Another good news is that it's not multi-legative, but additive. So it's 12 times more. When the object is small, but if I'm taking bigger object, for example, one megabyte object, it's only 3.5 times higher. And it's almost what I expect, because I expect to see three times, because I have three replicas and I don't want another jump. Okay, great. So now we understand that our traffic is, for this is 12 times higher. It's not bad. It's not good. But we need to remember it. Because when we are planning our cluster, we need to take it to the account that our disk would work hard. This is the disk capacity utilization. The first part is the new object creation. It's going up. It's okay. But then we expect to see it flat, because we don't create new objects. We only rewrite them. But we see that we still have some growing. And after some time, the data is going down and then we got the flat state. All these behavior is caused by the async pending. The amount of async pendings that you can write is huge. Okay, if your container server behaves slowly, you would expect to write a lot of data to the disk and the time that takes to clean it is long. It takes something like seven hours to clean the data that I managed to write in one hour workload. Additional thing that we can observe by using graphite, it's the response time of our servers. So we see the bottom line represents the response time of free our object server. The yellow and brown line represents the proxy. And the upper blue line represents object server. Is it okay? No. Because any object server shouldn't be higher than our proxy. If I see such a chart, I understand that there is some problem with my server. So the first thing that we did, we remove the Swift and run the microbench by using the video bench tool. And we observe that indeed, we have different performance for our servers. We checked what is going on and we found that object one has also a different hardware. I thought we expected that all the nodes should be similar. This was the request for that hardware. Okay. And by using older hardware, we get worse performance. Okay, so what we can do? We can remove this bench server. We can reduce our cluster size by 25%. We are reducing CPU RAM everything by 25%. And we see 10% improvement at throughput. Okay. So it's important that all the machines at the cluster are good, otherwise your overall performance would hurt. So we also hiring. So you can talk with us or with the recruiters that are going at these soccer like t-shirts. And if you have some questions and apologies for the lack of a reasonable presentation and the technical difficulties we've had. But are there any questions? It means that everything was clear. Just because we asked, are we talking about objects mostly 8K or any particular size? I didn't see the distribution of object sizes that you're putting under the cluster. So in the first story, right, which was a real customer who came to us, it was 15K objects. They actually what they said was up to 15K. We took it as 15K because you know, got it. So this is a very specific customer requirement. But we've seen these small size objects in other places. I mean, if you look at the HP talk on Monday, on Monday, their average size or the median size was smaller than that even. I don't remember. And you know, semantic was talking about doing everything with 16K. And so these smaller size objects actually turn out to be quite common. Because you had the 12 times more workload the back end on a smaller load, right? Yeah, if you have bigger chunks, you're doing close to three times. Yeah, exactly. Okay. Exactly. And probably it is something that community should think about. And maybe there should be some another pass at the swift to serve the smaller objects, maybe at some another way, because for example, now we have the erasure codes. Yep. What you would do with 50K object. I mean, you're talking about slower to local cluster talk about how could how worse it could be across multiple sites if it is. So time to do 12 times over locally is bad. Based on your data, you try to do erasure code across multiple sites. You're like, dead. Right. So what's going in with the community with the erasure coding really is intended for cross site. Okay. All right, that explains. And you said micro benchmark with VD Bench. I used VD Bench reasonably well. Is micro benchmark different? We call it micro benchmark because I'm taking one server and run on it, read, write or any workload. And it's not the swift workload. I check only for example, the single server performance. This is the reason why I call to it micro benchmark. Okay. Got it. Thank you. Okay. Thank you guys. Question for your earlier explanation of the how many containers for certain objects ingest those operations are going to factor performance. I just want to know in the production, how you control how many containers will be involved for certain object operations. So again, this was a particular customer scenario and part of what we could go back with the customer was how to structure their workloads such that they could get maximum performance. I've actually encountered a couple of customers where they're trying to upload hundreds of millions of objects. And then the question is, what is the way to organize those into swift containers such that they can get the maximal performance. Now in general, you don't really have control of that. And as you saw, as Dima showed, there was a huge improvement in single container performance with 2.2. And so it becomes less significant. There still is room. We have a 1.5 gain, so 50% improvement potential here by spreading stuff out. So it really depends, you know, how much you're trying to optimize. And, you know, as Dima said, there was a session this morning in the design summit on container sharding that that would sort of put that into swift and make that transparent as opposed to making the application be aware of it. Thanks. Sure. Any other questions? Can you come to the microphone? Any changes you observed or any updates regarding background processes other than the sink, like scrubbing and peering and that kind of stuff? Have you looked into those aspects? So in the particular scenarios we were running, we didn't, but you know, if you were in Christians talk, I don't know if you were in the talk just before lunch, right? Christian talked about some of these things. And, you know, I think it really depends on your workload. And what we just tried to do is work through, you know, walk through our specific experience with specific scenarios. It wasn't an issue in our scenarios. That doesn't mean it isn't an issue necessarily in all cases. Thank you. Sure. Other questions? Okay. Thank you, everyone.