 Welcome and thank you for joining us. I'm Brent Compton, joined here today by Kyle Bader and Cotton Singh. So we've spent about the past five months inside of a lab provided to us by Quanta Cloud Technologies, benchmarking a lot of RGW-SEF clusters. And so we are here today to share with you what we found, so our results, as well as some recommendations based on those results. So we're going to begin and end with the conclusions. And then in the middle bits, we'll focus on the empirical data as well as some of the concepts behind that. So that's how we're going to spend our 40 minutes together. We're going to try to leave maybe five minutes at the end or so for Q&A. OK, so again, we're going to begin and end with the conclusions. So here's what we started off. So obviously, object storage, I think everybody got that. Here are the architectural dimensions that we considered. So listed there. So object size, object count, data protection. Actually, we didn't consider that architectural dimension because we only used erasure coding. We did some erasure coding and three extra applications. But we focused mostly on erasure coding for this study. One of the architectural dimensions was placement of bucket indices. So whether they're on HDDs, whether they're on SSDs, whether you use indexless configuration. So that was one of the parameters. Likewise, caching. And we'll get into more detail there. Server density. Standard density, we considered a 12-base server. Dense server, we considered a 35-base server. So server density. Likewise, client to RGW to OSD ratio. Questions that we hear all the time. Well, how many RGWs do I need? So that was one of the things that we studied as well. And then likewise, price performance. Because obviously, not all servers that offer the highest performance are necessarily the best price performance combination. So these are the architectural dimensions that we considered in this study. So again, as noted, we're going to start with the conclusions. And then in the middle, we're going to focus on the supporting empirical data as well as concepts. In the end, we'll finish with the same conclusions. So there are four conclusions. So conclusion number one, when optimizing for small object operations per second. So we found that a 12-bay OSD host was optimal versus more dense servers. Likewise, placing bucket indexes on SSDs with a 10-gig fabric. And I noted 10-gig fabric. Of course, there's a 10-gig front end, 10-gig back end. But as opposed to, for instance, a 40-gig fabric. OK, so that's effectively number one of four. Number two of four, so when optimizing for high object density. So in this case here, we had, oh, we topped out at about 130 million objects in ratio to our 210 OSD cluster. So again, the important thing here is the ratio between the two. Of course, you could have lots and lots of objects if you have thousands of OSD. So it's important to consider this in conjunction of the ratio of object count and OSDs. So this 35-bay OSD host offered the optimal price performance. And then likewise here, this will be the first time you'll see the Intel cache acceleration software featured. And just note, we cached metadata. So more on that to come, but in case you're wondering. And this was with most of our dense server. So anything over 12 or 16 bays, we typically opt for a 25-gig or a 40-gig fabric. In this case, we were using a 40-gig fabric. So that's number two, optimizing for high object density. Number three, well, we have two number twos. That's OK, it's close enough. So the second number two, optimizing for large object throughput. So this one here was for purely performance standpoint. A 12-bay OSD host was optimal. From a price performance standpoint, a 35-bay host was optimal. And note the bottom bullet item here. We received about a 40% performance boost by tuning one of the SF-RGW tunables, the chunk size. So 40% was material enough for us to note. OK, so that's number three of four. And number four of four, optimizing the RGW to OSD ratio. So this one here, this is what we found. So after, which means for, in this case here, we focused on writes. Because of course, that's the most challenging. It reads obviously much easier IO pattern. So the most challenging IO pattern, 100% writes. Focused on large object versus smaller object. Larger objects, what you see here is after one RGW host per 100 OSDs, effectively adding more RGW host did not make a material improvement in performance. So hence that ratio, then likewise for smaller objects, for one RGW host per 50 OSDs, adding more RGW host beyond that did not result in material improvement. OK, so that's kind of again, the format we've selected here is we're starting off with the conclusions. And then we're going to go into the body, the supporting empirical data as well as concepts. Then we'll finish up right where we left off. So we're going to kind of keep this a little bit interactive with Kyle and Katan. Kyle and Katan were not only the, well, they were the architects as well as the benchmarkers on this project. So it began and ended with Kyle and Katan. So OK, so first off, Katan, if you want to tell us a little bit about the benchmarking environment and methodology, just so you have the context, what the lab looked like, et cetera. All right, sure, Rant. So here's the lab description that we have. So we have tested a couple of different configurations. So standard density servers and high density servers. The difference between two of these is on standard density, we were having 12 spinners backed by a single NVMe Intel P3700 for journaling and one single 40 gig ethernet. On high density, we have 35 spinners, 7.2k, and backed by two of P3700 for journaling. And we have the other settings like RGW host, number of clients, and monitor machines. They were typically the same. So this is the two configuration that we have tested in the rest of the study that you're going to see. One thing I wanted to note here, just so you're clear, it was the same set of servers. We effectively, we changed the configurations to make it look like standard density servers just because we didn't have a whole bunch of standard density servers available. So just so that you understand that, go ahead, Katan. So here's the lab architecture looked like. So our single rack with all the nodes installed in it. And yeah, it's pretty standard. So front face and review of the machines and the rack here. Nothing fancy. And here is the benchmarking methodology that we have chosen to do our benchmarking. So we typically start with doing the baselines, doing a single node disk baseline where we install just native Linux OS without installing SF on top of it, and going through the normal FIO benchmarking tool to get to know how my disks are doing, what's the performance I can get from the disk and what's the performance I can get from NVMe underneath. So again, without installing any self-binderage there. Moving to the next step, we used doing the full-fledged fabric benchmarking, making sure that all the network links, all the network paths are working optimally and we don't have any problems that we can later find out. In step three, we introduced SF layer. So we installed SF on the nodes, and then we do native benchmarking using Red Ops Bench, which is internal benchmarking utility present provided by SF. To get to have how my cluster, what's the top limit I can achieve on my cluster? So just to set a baseline, and then we can have real world workloads kind of, obviously, synthetic using a cost bench benchmarking utility. So in step four, we have introduced Intel cost bench, which is, again, open source benchmarking tool, most popular for object storage benchmarking. In step four, we have done a couple of tests using cost bench, and yeah, that's much it. On the table here, we've used RHCS, Red Hat Self Storage 2.0, which is based on dual, and standard RHCL 7.2, and Intel guest acceleration software at the time that was 3.0.1, which was the latest that we have used in the time. So that was our benchmarking methodology, and payload selection. So we have tested a couple of different object sizes to try to mimic the real world workloads, like we've taken 64k object size representing small images and small files. Then we have also tested a couple of bigger block sizes, so 1 meg, 32 meg, and 64 meg, and trying to mimic large images, text files, and backup files, videos, et cetera. So kind of a ratio between small objects and large objects. All right, so Kyle, can you walk us through the object? One thing to mention on the previous slide. So all of the results shown here, we're about three weeks away from publishing the paper on this, so watch for this. It'll come out on redhat.com. It's about a 60-page document that has all of the empirical results for all of those. So Kyle, you're going to take us through first the concepts for this first one here, and then we'll look at the results. Right, so as we were starting to plan to do these tests, we had to think about the ways that we could stress the different parts of the SEF system and what sort of workloads would be appropriate for kind of finding the edges of the system. For the first workload, we're discussing small object operations. When you have the Rados gateway, obviously you have your clients, in this case Cosbench, talking to the Rados gateway. When someone is writing an object into the Rados gateway, the Rados gateway is first going to create what's called a head object. So the S3 object is actually represented by multiple objects, in some cases, in the actual SEF cluster, which has its own native object store. When you have a really small object, it is usually just what's called the object head, where it has a little bit of metadata about the object and the actual data. But along with that, because protocols like Swift and S3 support buckets that have an index, there has to be metadata associated with which bucket that particular object belongs to, along with the list of other objects that are in that bucket. So when you're doing the write, not only do you have to write into the data pool, but you have to update another object in the underlying object store that has this bucket metadata, the index. So if you're writing a bunch of data into the cluster and you have multiple buckets, one way that you can increase object throughput is by simply having more buckets, because you have multiple indices. And those indices are going to be located on more OSTs. So you can kind of scale out the approach by adding more buckets and having more clients writing into those buckets. That works up to a certain point. But if those objects that hold the index metadata are under contention, because underneath them, they're going to be serviced by an OST that has spinning media. So if you want to basically, that's a serialization point in the workload. So if you want to increase the throughput that a bucket is capable of achieving, the number one way to do that is to move that contention onto lower latency media. So if you're able to do more updates per second to a given OST, because it's faster media, then you're going to be able to do more updates into a bucket and see higher performance. This also reclaims IOs that would normally be hitting hard disks, which you can then use for writing the head objects, which actually contain the object data. So it's kind of double good. So yeah, increasing the throughput, because you can reclaim it. And now the buckets have more performance. Is that current? All right, so these are the benchmarking. These are the numbers that we found in our study. So starting with, again, small object read operations per second. So the key metrics that we chose for small object was read operations per second that we were interested in. And we have seen linear scalability while increasing, while adding more load to the cluster, by adding more number of reduced gateways, and together increasing with the clients. So we have seen like, if you add more, more clients and more gateways to the system, you will see linear scalability. And typically what we have seen is you are limited by the number of gateways available in the setup that we have here. And incidentally, how many worker threads did each client have, do you remember? Each client having 128. 128. 128. And yeah, each client 128 workers with four physical machines. So it's like, you can do the math. We have also tested different configurations of bucket indexing, as Kyle mentioned about, moving the bucket indexes to faster media has really shown good performance. And as you can see in the graph, so if you see the red line, the red bar here, so as soon as I add more gateways and more load to the system, I'm doubling my ops per seconds. So yes, bucket index on flash media was optimal for small object workloads in our study that we have seen here. And we saw a similar performance with the indexes on SSD as we completely disabled the indexes. So there's actually a way in RGW to make it so that you can't list the objects at all. It's just blind, right? You insert a key. And then if you read that key, you get the object, but you can't see the objects that are in a directory by doing a get against the object. So I mean, that's useful for a lot of applications. And a lot of applications that have been designed to consume S3 or Swift are going to expect that. So you can get as good a performance without losing that functionality. So moving on to write operations per second. So these are the numbers that we found in our study. So basically, for write operations per second, we scaled sublinearly when increasing RGW host in the test suite that we did. And mostly, we were limited by the disk saturation. So we were applying the load from gateways into the clients. And we have seen disk drives to be saturated by four ops, operations per second. So the disks were not capable enough to do more than that, like 100% hitting the 100% limit. So if we would have added more OSD host in the same setup, we would have gathered or we have gone higher than this. So the bottom line is that we were limited by the disks saturation in the same cluster. So as you can see in the graph, so as you can see this from moving from two clients and two gateways up to two clients and four gateways, and even increasing the number of gateways and the number of clients on the system, I'm not able to draw more performance on the system. So it's still somewhere around 30, 31 operations per second. And these numbers are normalized per OSD just to have a better understanding here. These are operations per second per OSD, not per cluster. We typically do that, of course, as we're comparing standard servers versus dense servers. Of course, you can say, oh, a dense server offers so much more throughput or operations per server. But of course, when you look at the unit of cost, which is the drive, so that's why one of the reasons why we normalize per OSD. Go ahead, Connor. All right, that's it. And we have also seen latency improvements. We have typically reduced the percentile 99, the long tail latencies by introducing flash for bucket indexing, as Kyle mentioned before. So I mean, the errors are like some mess up here. Sorry about that. And yeah, if you see this, so in standard density server where we have 72 OSDs in the cluster, for smaller object, which is 64k, and bucket index configuration on NVMe, we have reduced the latency by somewhere around for read latency around 149 or 150%, which is pretty, pretty big for our numbers. And even for the right latencies, we have reduced by somewhere around 70%, just by moving the bucket indexes on flash media. And is this P99, right? Because it says average, it's not average, it's P99. P99, yes, it's P99. So long tail latency is P99. We have also seen similar behavior on high density servers, 210 OSD cluster. So moving bucket indexes on NVMe flash has reduced the percentile 99 long tail latencies for the setup. OK, moving forward to high object count. Kyle can go through this one. This was a fun one, because we made it got to make millions of objects. And we ran the test for a long time, so it was pretty cool. So underneath, when you have a CEP OSD, with the current file store backend, this is not talking about the future of Blue Store. When you're using file store under, there's the journal. And then periodically, the journal is flushed based on either periodicity or fullness to XFS file system that's typically on a separate partition. Within this file system, this is just kind of pseudo. This is not the exact layout, but it's enough to demonstrate the point. So you have the OSD. It's mounted into Varlib, Ceph, something. And then within that directory, there's going to be a number of placement group directories for the different placement groups that that particular OSD is participating in. Within that placement directory, there is what's called a subdirectory. And the number of subdirectories is dictated by the number of objects that are being stored by that OSD. When you have over a certain threshold of objects, which is based on your file store split and merge configuration, it will split. And you'll have multiple sub-dures within the different placement group directories on that particular OSD. And you continue writing, continue writing, and you have maybe a third, maybe a fourth, ad infinum. So you can adjust the file store merge and split parameters to adjust the number of objects in your sub-dures. So you have more objects in each of these subdirectories under this tree of data that has objects in it, which kind of flattens the hierarchy. You don't have as much nesting. That reduces the number of de-entries that you have. And when you have all this information, when you go to write objects into the cluster or you go to read objects from the cluster, you have to know where those objects are in a file system. So if the inodes and de-entries are in the kernel cache, then it's coming from memory. And you don't have to seek to read where the object is going to be read from. That's a good thing because RAM is much faster than disks. There's a kernel knob that you can turn called VFS cache pressure. And you can adjust it so that it favors inode and de-entry caches over other things like file system page cache. The kind of other edge of this particular sword, though, is that if you have a whole, if you have millions of objects, if you have a dense system and you have a bunch of six terabyte disks in them and they have tons and tons and millions of objects, and there's 4k bytes per inode or de-entry. I can't remember the exact number. But if you have millions and millions of objects on a particular system, it's going to equate to basically gigs of kernel memory. And you have to traverse this long list whenever you're doing a syncFS. So you can imagine that if you're doing lots of writes and you have lots of objects hitting the cluster, that's going to be somewhat problematic. So it's like either you have to read the stuff from disk or it's going to take longer to write. And that's not really a great situation. So we wanted to find a solution. Yeah, so in this we have tested a couple of multiple configurations that we've tested. So we've tested default OSD file sourcing, which Kyle was mentioning before. So no tuning applied on Ceph OSD file source tunable things. And then we have tested a tuned OSD file store by changing split and merge threshold parameters in Ceph.conf. The third testing that we did was using default Ceph file store, again, no tuning at all on Ceph side, and then introducing Intel cache acceleration software and just for metadata caching. So there was no data caching going on in this configuration. And in the fourth test, we have applied tuning parameters on Ceph file stores just to see how does it behave together with Intel cache acceleration software, just to kind of have how these four configurations looks like in a performance chart, and which is the best one out of this. The test details were standard density servers, 35 spinners on a single node. And we were having six of those, so 210 OSDs, and all small object workloads, 64K. And indexes were kept default, which is, again, on the spinners, no tuning there. Each test, each of these tests ran 50 hours, and total 200 hours of testing, and filling the cluster up to 130 million objects. We wanted to see how does the system performs when you load up multiple millions of objects into the cluster. So we didn't just kind of pick an arbitrary number of objects. We were just like, we're going to do 130 million objects. We did some math based on the number of OSDs we had in the system, and took the number of OSDs in the system, figured out, based on object counts, how much memory I nodes and D entries we're going to take. And given that we knew that a certain amount was going to be used for the actual OSD processes for their own internal allocations, we are only going to have a certain amount of headroom for the kernel slab cache. And so the goal was so that 90% of the inodes and D entries would have to be fetched from actual disk. So only 10% of the inodes and D entries for the file system would be on these machines, because we basically wanted to make the cache as cold as we could. All right. So the numbers, the result that you're seeing here is based on the configuration 3 here, which is defaults FOSD files to setting together with Intel CAS. We have seen this is the optimal configuration in our base of the numbers that we have got. So just take off time, we are not having all the results here. But as Brent mentioned, the paper is coming pretty soon, and it is covering all the things in detail. So coming back to multiple objects in the cluster. So the blue bar here represents the number of objects stored in self-cluster. So at the end of 50th hour of testing, we were filling the cluster to up to 130 million objects that we have calculated, as Kyle mentioned. And the red line here is representing the reads, operations per second. And as you can see that, starting from the first hour of testing, it was like 17,000 operations per second. And then it's gradually coming down as we are filling up the cluster. So in the beginning of the test, you're able to have more of the inodes and D entries and memory. And then as your population of objects increases, your cache efficiency in terms of cache hits on inodes and D entries is going to continue to go down. Right. And this is the slide which explained about all the four tests that we did in a single slide. So here you have seen that the blue line in the bottom of the test represents the default self without any tunings, file store tunings there. And we have quite a lot of things going here. So the red line is a tuned file store without Intel CAS. And the violet one is a default self file store with Intel CAS. Right. So the interesting thing here is with, obviously, with the default configurations, you see that you have this rapid kind of, like it starts out really high. And as the cache gets lower, because it is splitting a lot more, it kind of degrades in performance and then it kind of flatlines. With the tuned file store operations, you see the higher performance, but it's really just kind of pushing that cliff off the end of the screen. Right. So if we were to continue to run this test for, I don't know, to 400 million objects, we would probably see a similar kind of dip at some point because you're going to hit the barrier through the new split and merge threshold. So the solution isn't just to set a preposterously high split and merge threshold because that can cause other problems. When you're in recovery, you have to list those directories to compare objects and such. So you don't want to just set that to 400 million or something. Yeah. And we have seen somewhere around, like, 500 percentage improvement in read operations per second, comparing with defaults f without intuning and then defaults f with Intel CAS. So it's a big difference in performance. OK, moving on to the same kind of test on and testing out write operations per second. Same graph with different numbers, obviously. The blue line represents, again, the number of objects, red house objects stored on the sep cluster and the red line is the write operations per second. So again, you can see, so this one would be more clear. So this is the comparison graph. So the blue line is the defaults f, which is, you can see a drop here. So this is what Kyle was mentioning before. So as the OSD starts getting more write operations to it and it starts writing to the PG placement group into directory subdirectories, it will split at some point. And if you don't tune your file store and split merge values, it will split at some point and do it later. It'll split later, right? So it starts on the blue line. It starts splitting. It goes from having one subdirectory per placement group roughly to multiple. And then it's balancing objects between them at around hour 9 or 10. And then it dips for a little bit as it starts, more and more of the placement group start to split into multiple subdirectories and balance the objects around. And then it normalizes and flattens off. Right. So this is a latency comparison between both the tests. So the right side is we have a write latency comparison. So first 10 hour of percentile 99 tail latencies that we have compared. And again, using Intel CAS, we have seen almost 50%, 500% lower and steady latencies using Intel CAS with default cell configurations. And somewhere around 100% latency drop using Intel CAS. Can I write something here? So a couple of things. First is Gents. We need to do the next two in like four or five minutes to leave a few questions for Q&A. The second is how do we get to Intel CAS is because of the Yahoo work. Yahoo had done some work with SEF with Intel Cache Acceleration Software. We'd read the report. We'd known some of the people involved. It looked quite favorable. So that's why we chose to use this. Do you have any questions about CAS? I mean, we just use it because it looked favorable from previous reports. Armon, you want to raise your hand out there. Armon, she's from Intel. She can answer any questions that you have on that. So Gents, four minutes to the next two sections. So concepts and then results. And let's see if we can hit the next two in four minutes. Quick concepts. So with throughput, you have your S3 object. Like I kind of stated before, you're going to stripe that across multiple objects. So when the object is over the four meg striping boundary, which is the default, you have the head object and then all the subsequent tail objects. When those objects are being written into the cluster, in the case of erasure coding, you're going to split them in. You're going to take that four megs and you're going to split them into, in this case, it's a four or two erasure code. So you're going to split it into four K chunks, which are going to be one meg each. And then you're going to generate two additional parity chunks. Now, when those chunks are getting written to the OSD by the RGW, they're written not in one fell swoop. There's a chunking parameter for the RGW and it will write it in by default in 512 KB parts. So you can kind of see how you can take this object that's 16 megs and it has all of a sudden turned into a whole lots of small IOs. Because you're breaking into four megabyte chunks. Each of those four megabyte chunks is getting broken into six one megabyte IOs. And then those IOs are further broken down into two 512 KB writes each. So you kind of have this lot of operations from one put. When you're doing a read, I mean, when you're doing a write, I'm sorry, that's where this is most apparent. Because it's writing out in the 512 KB writes. When you're doing a read, this is a little bit less painful. Because when you read the first chunk of 512 bytes, the read ahead kind of keeps things going. And then when you have that subsequent read, some of it is coming out of the page cache. So not too bad on reads, but on writes. It can be particularly painful because you're doing lots of small IOs from one big IOs. So you're losing some of the benefits that you might get from otherwise writing a large object to the system. All right, so quickly going through the numbers here. So a large object. And the metrics was throughput, which is the mix per second we were interested in. And the performance we have seen is scaled nearly nearly. So as you add more gateways, the performance goes up. But for reads, they are limited by the number of available host, gateway host, that we have in our system. And the write performance, again, scale nearly nearly. But we have seen the performance was limited by the OSD saturation. So we have seen severe OSD saturation like this in the graph here. All the disks are hitting 100% saturation, and the disk can't do more. So definitely, if we need to have better performance on the cluster, we need to add more OSD host into the system. And this is the graph which Kyle was just showing through his picture. So this is comparing with default RGW settings and with the tuned RGW settings. So we are actually reducing the IO amplification going on here by a big, big number. So we're ready for request on through disks. And after tuning, we can reduce that to 48. And after changing this tunable, so RGW max chunk size to 4 meg, we have seen a 40% higher throughput for the writes. Right. So at 4 meg, because we know that's the striping boundary for the RGW object, no matter what, we're going to write each of the erasure-coded chunks in one fell right. We don't have to worry about having the EC chunks be further chunked down into smaller writes. And it clearly helps. OK, the final one, optimizing the RGW to OSD ratio. When people are trying to maximize throughput, they want to know, OK, I have a given cluster. How many gateways do I need to fully saturate the cluster? So start out. You have your cluster of OSDs. And you have one RGW gateway. You have a client load. You're pushing against it. You'll add another client to it. And up to the point where you hit saturation. When you hit saturation, that's when you can scale out, add another gateway. Then you can start adding more clients. And at some point, despite adding more gateways, the underlying clusters, the OSDs are going to be saturated. So you're going to have diminishing returns. At that point, that's when you would scale the lower portion of it, where you would add more OSDs to the cluster. Yeah, so in this slide, you're seeing a comparison between 10 gig ethernet for RGW, dedicated and 40 gig for, again, RGW. So what you've seen is 10 GB RGW gives better performance and better results as compared to 40 gig. Quickly moving to the next one. So how many RGW do you need? So this is based on the numbers that we and the study that we have done. We came up with this number. So if your workload is 100% large object workload, so you would require, typically, you would require one RGW host with a single 10 gig ethernet for every 100 OSDs. So by OSD, I mean spinners with general on flash. So these are number valid for spinners. In IEC pull configuration. And how many I need is to fully saturate the underlying cluster. Like if you're doing a capacity play and you're not accessing it frequently, then these ratios don't hold true. But if you want to get the most in terms of throughput, this is what we found to be kind of the right size. And if your workload is 100% small object, 64K or a small object workload, so you would require one gateway machine for every 50 spinners. So basically, if you are running a 200 node cluster and it comes like, OK, how many gateways I need for a 200 node cluster and my workload is small object, so it's like you would require four gateway machines. Bucket indexes on flash. I don't know if you caught this thing on this previous slide here. Sometimes people ask, see the thing on the right-hand side of the co-located RGW? That was co-locating RGWs on OSD hosts. A lot of questions we get is, do I need to have separate hosts to run my RGWs? So shout out to Neil Levine here, Director of Product Management here for Ceph and Federico, one of the product managers sitting there right there. You see them. They're the ones smiling, trying not to raise their hand over there. Anyway, so this is something that they're working on is co-location of various OSD Ceph services on hosts. So this is a sneak preview. It kind of slipped into the slide. It's not yet supported in Red Hat Ceph storage, but possibly a portent of things to come. OK, so that is it. So here's the summary. We told you we're going to go through the conclusions that we're going to go with the empirical data followed by the concepts, and we're going to go to the key takeaways. So key takeaways. So first is, in fact, it's going to take too long to go through this. So we went through it once. You can take pictures. You can read it later. Last thing we'll say is, that's what our team does, is we create reference architectures. We listen to architectural patterns that you're wanting to deploy both hardware and software wise. We go into the lab. We reproduce those architectural patterns, hopefully answer architectural questions, and we publish. So here is a library of reference architectures and performance and sizing guides that are published. A couple to watch out for, there is this one that we're talking about right here. So the first time we presented this in public, so that's about four weeks away. A second one, we're doing this. Part of the reason we did this work is because seph sitting underneath multiple analytics stores, sourcing and seeking from a common object store requires high performance seph storage. So watch for that one as well. OK, so that's the summary. So we'd now like to, now that we're exactly out of time, we'd now like to open it up for questions. Please. The 40 gig on the RGWs are on the OSD hosts. The middle section right there. It was more. So the questions, so the first question is the IO depth, so I don't know if you want to talk about the Q depth. Did we measure the Q depth? Did you look at the Q depth? No, I think. Well, it's like Q depth in terms of like FIO? Yes, threads. Those 180 threads, they were on from the cost bench. So cost bench is putting the threads in so that there's no FIO thing on that side. Yeah, the IO depth would only be relevant for the low level baselines of the block devices. Block devices. And then on this on the second question, are you want to talk about what we observed in terms of RGWs use of more than 10 gig per the conversation we've been having? Yeah, I mean, we saw diminishing, like we didn't see consumer increase in performance. Like when we were saturating the 10 gig hosts, when we moved to 40 gig, it didn't give us four times as much headroom, right? We tried tuning, there's threading configurations for both RGW and CivetWeb so that you have more RGW threads that can answer and respond to requests. But there's clearly some sort of contention in there to where it doesn't scale well. We've done some tests where you run multiple RGWs on a single host of 40 gigs, and then you can see more. So it may be some sort of contention or locking issue, I'm not sure. Next, please. Maybe I think I missed it. What did you use as the front end before Redis gateways? HAProxy. So on each of our RGW hosts, we ran an instance of HAProxy that was just had one back end that was local, right? So we used it as kind of a connection shield. So each RGW has a single HAProxy instance on the same machine, so the reason we haven't used a single load balancer layer was we don't want to have load balancers to saturate, right? That's why we have used a single HAProxy in front of each RGW. But did you use some load balancer in front of HAProxy? No, no. We had these two Cosbench clients talking to this one, these two talking to this one. So for each? Production setup, you would have to set up a routing to distribute across your virtual VIPs or something. You said that you used quite a small amount of slab. So you used, can you tell what the value of VFS cache pressure did you use? It'll be in our paper. I'd have to go back and look. I don't remember offhand. We tuned it from the default. I think we set it to 10 initially. So the idea is not to read from cache, right? We were preferring to keep inodes and de-entries in this lab, if possible. Balanced with Intel cache acceleration software caching onto MVME Flash. So we were looking for a balance between the two. Did you also measure the performance of bucket index? Oh, yes, we did. Yeah, do you want to go to that slide? So while he's going to that slide here, there's been one gentleman with his hand raised in the back for a bit. He's actually walking to the microphone. That's bucket index. Yeah, so we have seen bucket index. If you move your bucket index pool to Flash Media, it gives you better performance. We have an issue, actually, with the bucket index. When everything is working fine until some users, we have pretty large buckets until some users hit the bucket list command, and everything is just hanging until that command is executed, actually. And when I see the OSD, which is keeping the bucket index, it's one CPU totally, one CPU, and everything is effectively blocked until that OSD finishes this job. So I wonder, is it ever possible to somehow tune it? Because, well, we have one bucket with 500 million objects in it. And yeah, using bucket list is going to be a problem. There are things you can do to make it better. You can shard, you're sharding for the bucket, so it spreads that bucket across a lot. You can have faster media underneath your indices, but fundamentally 50 million objects, or 500 million objects in a single bucket is just one. I wonder if we could do this, and one if we could just make sure we get this gentleman's question for her. What if we could take the rest of it offline? I invite you absolutely to chat with the team offline. Please, yeah. A couple of questions. First one, did you use HTTPS for S3? No, we didn't get to setting up SSL for the testing. And second question is, for Intel Caslure, you used just for metadata. But aren't you using SSDs for metadata already? No, we were using the P3700s. So we sliced it up for journaling. And the remaining partition was used for caching. So Intel P3700 for metadata caching using Intel Cas. So you used faster SSD for caching? Yes, yes, yes. Yeah, we used the same SSD partition into chunks. So some of it was for the journal. Right, if you don't do, Korean. So some of it was for the journal. And then we used the balance of the capacity on the SSD for Intel iCas. So iCas would see, oh, this BIO is a DNTRIRI node, and it would cache it. You could have created an OSD just for the metadata pools, like the index pools out of those. No, we did that. Not on a cache. We did that. Yeah, so when we ran, well, there's two separate things. When we ran the indexes on NVMe, we created separate OSDs with the balance of the NVMe instead of iCas. And had a separate, in our crush hierarchy, we had a separate branch that had just the SSDs. And then we configured the pool for the bucket indices to take root at that particular point in the crush hierarchy. So I believe the next presenter is here getting ready to set up. Is that true? So we need just to make sure we have some. Let's make sure you come up and ask your question up here, and we'll let the next presenter set up. Thanks, everybody, for coming. Hope you have a good day.