 Is this on? Okay. Well, I guess we'll go ahead and get started. My name is Chris Jones. I'm with Bloomberg. And we're going to talk about some of the things with scaling CEPH, in this case. If I can get it to work. All right. Real quick, 30 years and under 30 seconds, we are primarily a financial services provider. And you can see our Bloomberg terminal there. It has over 60,000 functions and does a tremendous amount of things for the financial market, including data streams, et cetera. So, what do we use of CEPH? We use the object store, the block volume, and we also, with the open stack size, we just started using the ephemeral storage. One of the guys here that worked on that is here. And I believe that the ephemeral storage has now become one of the more popular things that we offer, which is kind of cool. We actually, I guess it was at Vancouver last year that we saw that, and so we went back and kind of implemented it. Now, we, it talks about hyperconverge. We're kind of hyperconverged before it was cool. You know, everything now is 100% buzzword compliant with hyperconverged this, super hyperconverged, you name it. But we started doing that probably three years ago, and where we had everything on, for example, headdodes. And headdodes are controllers in the open stack world. But then we had, not only that, we were running MONS, OSDs, you name it, and those same things. And that started coming to problem. So, this past year, we kind of re-architected it a little bit and went to a pod architecture. And that gave us the ability to scale out, do different things of that nature. So, we can actually have three of these per rack. Now, you see there in the middle, it talks about a tour. So, not only, we actually put those with each bundle, with each pod. And so, you can actually make tweaks. We can do things we can scale however we want to, within the data center, et cetera, et cetera. So, it gives us a lot of flexibility to do that. But what we were seeing is that our object store was actually becoming very popular. So, we wanted to scale that. But the problem was we didn't want to have, you know, have a massive amount of compute space because of the fact that it was so expensive because all of our nodes are running SSDs. All our whole SEP clusters SSDs. I mentioned a while ago, the ephemeral piece, and you can get an idea, it's a, give you a little bit of a visual on it because I'm kind of a visual guy. But the, if you're using the SEP side and then you're looking at the ephemeral, then you can see that there's a lot more network traffic, et cetera, et cetera. But there has, you know, there's a lot of trade-offs. Like with SEP, it's safe from, you know, in this particular standpoint. So, if one of our customers stands up however many VMs, et cetera, et cetera, they're okay. But with the ephemeral, but it's a little slower. But with the ephemeral side, it's kind of, it's a lot faster. And we kind of look at that and say, okay, are you going to live life dangerously because of the fact that, you know, it could go away at any time. Now, the numbers here don't represent actual, don't look at them and say, oh, this is what they find on their production cluster because that's not true. These are actually done with some of our lab equipment and some of it's old. So, look at some of the comparisons between it. And so, when you look at this, compare, like, do I run SEP with everything or do I mix in some ephemeral, et cetera. So, we get into the thing that we're doing right now, which is the object store. The object store, we started kind of breaking that out. And so, what we're doing now is we have, basically, we start out with three racks and our initial piece where we had it is, we had hardware load balancers, but now we're actually, don't, we've actually created our own custom load balancers. Now, in that case, and also the other thing to, I don't have a pointer, but one of the things to remember is each one of these racks are routed. They're on their own subnet. So, you get into a situation with, like, for example, with Keep It Like D, that it doesn't want to, you know, transfer over the IPs, et cetera, the VIP, but the fact is it's there. So, we fixed that with other configurations. And I'll show you that in a minute. So, our OpenStack cluster runs on Ubuntu, but our object store actually runs on Rail. Now, there's no particular reason why we did that. It was more, oh, thank you, there was more or less an Olive branch, because we have a lot of storage groups, we have a lot of other groups within the company, and a lot of those guys, they use Rail. And so, they're comfortable with it. And so, I wanted to get more of these guys involved so that it would help us, especially with the Ops side of it, because I didn't want to sit and do stuff all the time. And so, I got those guys involved, so we basically put it on Rail. Now, in this, you see that the top part, is our tour. And then, of course, then we have three 1U nodes. Now, the 1U nodes are going to be there, basically our mon nodes, they are our Radius Gateway, and our load balancers. And then, the other 17 are 2U nodes, and those are all of our OSD nodes. Now, that's important because I just came from a talk, which was a great talk, from Comcast, and they actually are running large density servers, and I think roughly about 72 drives per one. In our case, we actually have 12 drives in this, 12 spinners. And there are six terabyte drives, and then we have two SSD journals. Now, the interesting thing about that is that they're not, those SSDs are not just the journals, they're actually, they're co-located or co-hosting, so to speak, with the OS itself. So, basically a small portion of the first top of it is actually the OS rated to the second, and then we have six journals on one SSD and six journals on the other. Now, and then the journal sizes are larger. We went a little larger on our journal sizes because we had the space. And so, they're running at 20 gig. Also, our interface is we have two nicks, or two ports. One's for our cluster size, it's 10 gig, and one for our public is 10 gig. Now, Redis gateways and Mons don't use the cluster side. So, the only thing that they actually do is the public side. So, in that case, we're actually bonding those. And then we're bonding them in the, basically LACP mode, so it's mode four. So, we can get some aggregate pieces out of them. This is pretty much some of the things we talked about, but one of the things I skipped over there, I was going to talk about a little over here, is to actually get scale, we are, currently our OS, I mean our OpenStack ones are running all replicas, three replicas, on straight SSDs. But with the object store, what we've done is it's all eraser coding. And kind of find out what we're doing with eraser coding. There's not a whole lot of docs out there. There's not a good explanation of why you do this, why you don't do this, et cetera, et cetera. So, a lot of this stuff was trial and error. And then, of course, bringing in some folks from Red Hat to take a look at it and see. Then we came up with some combinations from there. But the interesting thing, too, on the Redis gateway side is, you know, if you've ever created a Redis gateway, it creates a pool set. Roughly about 14 pools. Approximately. And each of those, you know, has some sort of function within Redis gateway. The most important one is the bucket pool. The Redis, the .RGW.Bucket. That is the only one that you actually do eraser coding on. The rest of them are actually replicas. And you'll see some of that here in a minute and what I'm talking about. Also, too, another important thing that we've done, and it was also mentioned in the other talk, too, is the OSD nodes we run a hardware controllers on because there was a significant performance increase if you try to do software pieces or on-board controllers, et cetera. We actually, one of the guys here actually started doing some testing with some new equipment. They got it in and apparently we, somewhere along the lines, we didn't have a controller and they were like, hey, why are these drives and everything supposedly better than all this other stuff but it's slower than our current architecture? And then we got in and started digging, he did primarily, and realized that, oh, we don't have our controller. So we wasn't doing apples and apples comparison and it was a significant performance increase. So this gives you more of a logical view of how we have our object store. We have our spines, but then we're also, and we have our two load balancers. Now, the interesting thing, like I said before, with keep alive D is the fact that you can't really span multiple subnets in this particular case. So you will see, if you do like an IP space A on the first load balancer and you do the one on the second one, you'll see the VIPs over there, whatever hybrid mini VIPs that you may have. The difference is we're now going to use BGP that's going to advertise these routes and we use Bird for that because everything we're doing, we're trying to stay in an open source or basically a non-vendor solution. That's where we're heading. We're trying to get everything toward a non-vendor solution. So we set up the Bird on the BGP so that it advertises to its peers and its peers are the spines and then the rest of the network can recognize where they are. Now, in doing that, we don't want to advertise the secondary because it will get confused and it won't happen because it did. What happens is that when you start doing the latest gateway calls, you start doing different things, you'll see just connections will drop and what it was, it was the routes and everything else was getting confused. So there's a configuration setting that you can do in Bird which basically makes this a secondary advertiser sort of a primary and so that worked out pretty good. The Rados gateway, we run multiple instances of Rados gateway per Rados gateway node. Now remember, it's a 1U box, it has 10 gig ports and it has 256 gigs of RAM. Now, the original piece called for 128 gigs but when they came in, they had 256 and I wasn't going to turn it away. So I kept it. I just didn't say, that ain't life's good. So we were kind of looking at that and say, hey, why don't we start doing something a little different and we can actually approach this a little different and so in that case, we started doing some investigating on how can we run multiple instances of Rados gateway on a single node. We get split it out from a standpoint of, and the reason for this is we have what's, we have private networks in our group. So in that, we can't let this private network see what's going over in this private network, et cetera. And so typically all of our clusters are inside a private network within itself. So if the private network A wants a cluster, then we have to build out a whole cluster because there's from security standpoints in the past, the way it was all worked out, you couldn't do that. You couldn't say, here's a converged piece of hardware. Why don't you all come into that? So but with the object store, that's the very first product within Bloomberg that's been allowed to do that. So we have now a centralized server cluster that now takes access from private network A, private network B, C, D, whatever. And so each one of those Rados gateway boxes are also weighted, and so the reason for that is we want to be able to scale them out if we need to or if we have failures. We can set up, for example, we can set up OSD nodes as a lower weighted Rados gateway if we have to. And the same goes for other things too, even Mons, and you can see that in here in just a bit. So each one of the load balancers basically weights this, puts it on a different port for if the VIP is coming in off of private network A, goes over here, and it goes to a different port if it's private network B. Now here's some important configuration pieces. The, you know, it was pointed out in the last section that CIF, you know, it has, I don't know how many, but I mean it has so many knobs you can't even count. And so you do this, you do that, and you turn one over here, something else happens, you're kind of like, oh, what's, you know, this doesn't make sense sometimes, and sometimes it does. So you have to actually tune it for your given environment. So one of the things that we've, like what I was telling you before, is each rack is on its own subnet. Cool thing, you know, because all the examples you see show it in one aggregate, like a slash 24, whatever, et cetera, et cetera. But we actually have it in slash 27s for that, and they're all routed, all routable. On our OSDs, we, those are all XFS with on-board controllers like we talked about, and then our Rayless Gateway components, one of the things that we're testing right now, we haven't implemented it yet, but what we're testing is the federation with regions and zones, and we've got another cluster that we're about to stand up so we can do that. And then of course the eraser coding pieces, and the thing to keep in mind about eraser coding, it's different than replicas. Replicas is like dead, dead simple. I mean you've got one object, another object, another object. Eraser coding is different. The crush maps are different. Everything about it from that standpoint is different. And so you actually have to have a reasonably good custom crush map. And so we have two rules that we created. One is for the replicated and one's for the eraser. And what we've done is we've done it by racks and then by hosts and then by OSDs so that we can kind of distribute the load. Because the whole thing about CEP is data distribution. That's the whole point. Because you don't want a rack being full over here or almost full, and then you have two other racks, three other racks or whatever, and they are not completely full, but you're kind of like wire all this funky stuff going on. Well, and this comes back to the crush map. And also too, one thing to keep in mind and I would recommend this with almost any configuration is to do sharding on your bucket indexes. Now, that's an interesting thing. And the way it works is that there's a setting, I think it's right below there, that you just basically tell it like a max shard of five. Now that five is just a sample and you're going to see something too, because everything we do is all open source. Everything is open source except for the data of our given machines itself, such as the MAC addresses and IPs and all the other good stuff. But everything else is open source. But you can take it, so you may do some of your buckets, you may have it set at five, and then you say, oh, you know, I need to go up a little higher, and then you set it to 10, but your previous, all your other buckets that you had, or buckets or actually pools, in this particular case, or buckets, and they'll all still be at five, but all your new ones will take on the new configuration. So it's not going to go back and change any of the old stuff. Also, Sivitweb, we kind of booted Apache out. It was a memory hog and a lot of other things. And then, of course, the front-end pieces. Now, this is an interesting little piece because a lot of people don't know about it or know that you can do it. There's actually nothing that's just left out of docs. And that is, there's Sivitweb itself has a lot of options. It has a lot of different config settings that you can do. You just have to go look at the Sivitweb project, and then you start seeing all those different components, and then you come back, and you look at some of the code within Redis Gateway, and then you say, oh, I can use that, I can use that. The one there that says number of threads equal 100, that's actually a default setting. We were playing, or I was playing, with increasing, decreasing, looking to see what happened, et cetera. But then I just left it in there. But that actually 100 is the default setting within Sivitweb. Network. All right. Most everything you'll find performance-wise basically falls into these two pieces, the load balancers and your network. The, like I said before, the Redis Gateways and the Mons and the load balancers all are bonded. You get two 10-gig ports, and they're bonded with Mode 4 for aggregate, in this particular case. Also, we're using jumbo frames. So we set the MTU to 9,000. Definitely want to do that on your cluster piece. That's a given. And the reason, one of the reasons for that is last year we were doing some testing on our clusters that were in our DMZ, and I was comparing it to S3. And I had to do that because one of our customers was doing stuff and said, hey, we're going to move over here. We're going to do this, whatever. And so we needed to get good benchmarking. I think it was canonical yesterday in here was talking about that when you're doing comparisons with your storage cost-wise, it doesn't really matter what's going on everywhere. You need to base your cost off of what's Amazon doing. That needs to be your baseline. And the reality is, for us, not only was it the cost that we had to look at for baselining the price, but we also had to look at the performance because of our customers say, hey, I get faster doing this or I get faster over here. Now, that's only in, like I said, our DMZ side. We have many, many private secure networks and so that's never a factor in that particular case, but on the other side it is. Now, if you do, you know, oh, and the other thing about the MTU 9000, so I had problems last year. So it basically takes a cloud to test the cloud. So I was testing a lot of different things with JMeter from Amazon back into our cluster and then from Amazon to S3. And by the tweaks we've made with Redis Gateway, I had parity with S3. So the irony was, I'm on EC2 comparing myself coming back to our DMZ and then going to a close region for S3 and I was on parity performance-wise. But we also saw some things dropping along the way and it was because of the fact that the discovery modes on some of the devices, et cetera, et cetera, wherever they may be, we're not doing what they're supposed to do, which is auto-adjusting MTU from that standpoint. And so we just went ahead and set it at 1500 on our public side. The config setting that I talked about earlier with Bird is like on that bottom line there. It's talking about setting secondary nodes with your ASN and this is real important if you're leveraging any BGP stuff because you want to make sure your routes are advertised correctly. Now, obviously everything we do, I mean, we can't even approach this without automation. There's this absolute middle way. The last talk they were talking about, you know, it was tweaking this hardware to this hardware and this hardware. And again, it was a density node and those were purpose-built components. And you had the time. In our case, we're using lower density because I don't care for machine guys. I don't care. Throw another one in. File a ticket, get it in. I don't care if a drive fails. File a ticket, get it in. So we have spares for those very reasons to do that. And so we're not tweaking all the hardware everywhere we can unless it can be fully automated. If it can be fully automated and it makes sense, then we definitely do that. So all of this stuff is in Chef. Now, you see the first one there, that's our Bloomberg OpenStack. It was originally called BCPC. It still called that, but it's made about 400 other changes in names and stuff. So it's BCC now for Bloomberg Cloud Compute. And then we've matched with our Bloomberg Object Storage to Bloomberg Cloud Storage. And so if you see, if you go to the GitHub up there, you can actually go ahead and clone it. You can do it right now. I did it a while ago in the other talk because I was just testing the performance of the network and all that, and I even built a Chef cluster on this laptop while I was sitting in the other session. So you can clone it. You can build you and then basically run the beggaring up, or actually there's a couple other things that you would do, and that would build out a full open stack along with Chef in that particular stereo. The other interesting thing here, if you look at the second one, that's Chef Chef. Now that, and if you look at the GitHub on that, that's actually managed at the Chef repo. And we actually, we created it, and we basically are the admins for that. And so, and that is a complete cookbook that will give you everything that you see here plus more for everything, including ChefFS, et cetera. So, and then the next piece is our object store, which I'm talking about, and that actually implements the Chef Chef. So in essence, what the Chef BCS does is it actually, when you basically say, Chef up, in this case for development environment because we do everything on virtual walks for our development, and then we roll it into our hardware side of it. It will actually go out, grab all the cookbook, grab all the dependencies, grab all the packages, everything it needs because in our scenario, we cannot access the outside world. So for security reasons, et cetera, et cetera. So we operate behind lots of proxies and you name it. And in, like I said, all of this stuff is on GitHub, completely free, go out and get it right now. Actually, that would be awesome. And issue some pull requests, that would be really good because there's a lot of enhancements out there that need to be made to everything. So here, it talks about, again, just so you know I wasn't kidding, here's a screenshot of the actual GitHub page, the same here for the cloud storage, and then our BCPC. Now this is a much larger project, obviously, because it has all of the open stack components, et cetera. But again, it has the configurations and things necessary to build out full clusters because I don't want to, like in the storage side, I don't want to build one way on Harddoor over here on Vagrant and all this other stuff and then do something different, completely different on the hardware. So we try to keep everything as close as possible, but there's some challenges with that, especially when you get into the networking side, so that you can't do with the virtual box side of it. And then you actually have to say, hey, I've got to actually put this on hardware to see what happens. Now, this is something I added. So if you, because this question gets asked a whole lot on eraser coding, how much capacity will I get if I do eraser coding versus replicas? So I'm like, a lot? I don't know. How much are you going to do? I don't know. I can't answer that. So I got a little tired of kind of answering that question or not answering that question. I thought, you know what? I'm just going to create and then go through the formulas, set it up so that you can actually plug in the number of OSDs you're going to have, what size they're going to be, et cetera, show you what your raw compute size is, and then allow you to do what if scenarios where you can look at your K-side and your M-side and then kind of trigger out where you kind of balance where you want these eraser code settings to be. That's also in the GitHub repo. So you can actually go and download it right now and start playing with it. Now, the interesting thing here, if you see, and this gives you a better visual, because I'm visual, so it gives me a better picture of what my capacity is going to be. So for example, I know if I increase my K or decrease my M, then I'm going to better utilize my storage. But there's trade-offs there. Remember when I was talking about replicas or something, you know, copy A, B, and C, simple, no problem. The eraser coding takes, for example, if I have, and this uses 10 for easy math, and I have a 10-gig file that I have out there, and then I say, okay, I'm going to do a K of, say, five or any of those, really, in that case, then that 10-gig gets split into 10 evenly into basically the number of Ks. In this case, 10. So now I have 10 objects that are going to be floating around the storage. I have to do something with that. I've got to balance it because it's distributed. And so then it allows me to look at this because I have to look at those two values to determine how I want my crush map to look like. Because you can't just go and use the default crush map, or you can't use a simple one that says, okay, step, choose leaf, blah, blah, blah, rack, you know, tight rack. And then it's going to make it where everything looks good. Failure domains, et cetera. It won't work because you'll go to implement this and you'll say, why aren't any pools being created? And the reason is because the crush map is looking for K number of whatever you told it to look for. Do you have K number of racks? Do you have K number of hoses? Thanks, Mr. Nature. I don't know what your failure domain is in your racer-coded profile. And this gives you the abilities to say, okay, I'm going to trade off a given pool for whatever reason so I can actually maximize my storage more efficient or I want to see something a little different. Let's see if I can go back to that. Okay. Well, one thing I didn't mention was on the M side, this is important because those are your parity chunks. So it takes the, say in the scenario I have my 10 gig, it takes that, divides that, for example, by K and then it'll add any buffering so it's all even. And so you've got a 1 gig, a 10-1 gig, but then all of a sudden, but let's say I have 5 set at my M on my M side, that's my parity side, you're also going to see 5 1 gig pieces or other chunks. So you're going to have a total of 15 that are floating around out there. This gives you kind of what I was talking about before and this actually, these numbers that's percentage-wise are actually taken from the PGCALP calculator, you know, you go to SF.com slash PGCALP and you can go in there and kind of set and play with, okay, how do I look and set up my pools or my PG's and things of this nature. And inside the cookbook itself too, it actually implements that same PG calculator inside the cookbook, but it gives you another option because instead of going to the nearest power 2, you can also say, hey, I'm going to go to a higher power 2 so you have the option to do whichever way you want and that will actually help you with your PG distribution. But you can see the amount of data that each of these pools in a typical, just plain simple, out-of-the-box, ratents gateway, pool set. The, it's 96.9% of that data is stored in your buckets. Now, it's not, you could probably get away with some of the others being eraser-coded too, but this is the recommended way of doing it. Where only the actual buckets themselves are eraser-coded, everything else is replicated. So, the, okay, so now you're kind of looking at eraser-coded because I know everybody wants to maximize their dollars, they want to maximize, but they don't have a whole lot of information on it. It sounds really complicated, et cetera. And it sort of kind of is when you start to look at it at the first time, but then when you start working through and looking at how the objects are laid out within self, it begins making a lot of sense. And so it becomes more and more clear and then you can actually begin doing some really neat things with your crush map and distributing your load a lot better. And the main thing too with your eraser-coding pieces is think about your failure domains. Like, for example, a typical failure domain in the replica is a rack. That's the typical thing that people try to do is, hey, I can lose a rack or whatever, and it doesn't bother me. So, but with eraser-coding, that's not going to be your failure domain, most likely. It definitely won't unless you have a whole lot of racks. So, in this particular case, our failure domain is a node, an actual storage node. You can make them as OSD or at different pieces. And so what happens, the reason why that's important is because your crush map and all that will try to basically keep going out so it doesn't repeat an object, one of those aggregate objects inside of the same host. Now that would be horrible, horrible, because what happens when that host goes out? Well, you just lost, you lost your data. So, instead, it will actually try to disperse that, et cetera. So, there's a couple different ways to kind of check on this, play with it, see the settings, et cetera. One is, again, you set your profiles. You can set as many of these profiles as you want. The default, too, on these, all of these defaults are, for example, the plugin is J Eraser. You can change that. I left it at J Eraser in that particular case. And so, you can set these values, like 10, like I was in that scenario I was just talking about, and I set the failure domain to host. You can create multiples of it, it doesn't matter. They're just profiles, nothing happens at this stage. Then, you come down. Now, what I like to do, because we manage our whole crush map itself, so we set the own start equals false, all that stuff. So, what causes that? If you start having OSDs go down and you want to start flapping and all that, then you have to kind of roll them a little bit. So, I don't like to do that, I don't want to do that. Less works better for me. So, what I typically do before I do any of this is I'll set the no down in that particular case, because I want to keep everything up. I don't want to have to go back and fix it later from me playing with it. Now, of course, that's going to say health warning, but that's the cause you set a flag. That's the only reason why I would say that. Then you create a pull, and so you create a new pull, add your PGs, your PGSs, and then you give it the name of the profile you just had, and then the name of the crush map rule set that you're going to use. In this case, I'm going to use, and I happen to call it acd underscore erasure, because the other one is for replicas. Then you come down and you start looking at your, you want to get your PG number after, and if everything, if it goes out and starts building your pools, and then that's good, but you want to see where they're distributed, and because you want to make sure that you don't have things, like I said, all in one host, or even all in one rack, and because it's actually tempting when you say, oh wow, I had a health, okay, and on my first erasure code, cool, great, we still support erasure code, but then you start looking at it, oh no, everything's in one rack, so if this rack goes out, I'm hosed, so you've got to start playing a little bit. So, taking the stereo, in this case, you want to look at your pools, your OSD, so SF OSD LS pools, you can do an OSD dump on that and just look at the top part as well. And the PG's and all that are going to be unique integers, and so it'll plus some other things behind the decimal, but the primary part is the, actually I'm talking about the pool, sorry, and you want to get the pool ID and then you want to do like a PG dump and then you want to grip on the PG number, and I was a couple, like I said, there's several different ways you do it, this is the way I do it, and then what that does is that shows me, okay, now I'm only looking at the, my PG map are dumped on just the pool I'm interested in, so it doesn't have any data in it now, but it's going to tell you where it's mapping to what OSD's, and so you'll see something like, in this case, 10, 0, 5, 2, 12, so it's on basically five, just in this scenario, five OSD's, and then you can find out, do the SF OSD find, and then you can find out where, like 10 being the ID of the first OSD, and you can find out what host it's on, and if you segment in your, obviously your host name is properly, which you should do always, and then you'll know what racks, et cetera, right out there, and then you start making adjustments to your crush map based on that, because that's where you're going to find where your placements are, and that's, like I said, it's critical, you can't skip that part, because you want to kind of understand where the distributions are. Testing, so everything, obviously testing is critical, with the open stack, when we first started doing things, we've used Tempest, and we actually have Rally inside of our cost, we don't reuse it as often now, as we did last year, we used it a little bit, but on the south side, the Rados bench, obviously, the cost bench, and then the FIOs, and the reason for some of that is just you want to test, you want to look at your drives, et cetera, and that is, I've set it up in a master slave configuration, so that I have several many instances that are running, and they're going to be running tests which are going to query objects and things of this nature, so I'm going to see what's actually like a real use case of something that's going on, and then I'm going to most likely do random, which I do, I actually do random bite range requests, or you're not dealing with one, the first 10 of each object or whatever it is you're segmenting, I do random in that particular case, and then you can find out where some of your performance bottlenecks are, and again, most of the time you're going to find that it comes back to your network, your load balancers, those are typically there, but then at the same time, if your OSDs are just screaming and doing work, things of this nature, then you're going to see tremendous amounts along with the other tools. So, what are we looking at going forward on some of this stuff? We're definitely, obviously always looking for improvements, better monitoring, even some DevOps pipelining, and also, too, the one thing about that, the Chef Chef cookbooks and the BCS is built so that you can build it and plug it right into a pipelining system, something like GoCD or that nature, or even Jenkins, Jenkins doesn't do pipelining and all that great. That's a debate, don't throw stuff at me. But, and then the where we're going to is, in essence, Chef with non-vendor storage solutions. That's kind of where we're heading with some of this. So we're definitely looking for performance improvements, no matter what, and we're looking at better multi-tenancy capabilities in Jule. That's where some of that's being kind of laid down a little better. We're going to be testing that. And just so you also know, our typical, our smallest cluster is roughly 3.6 petabytes, is our smallest. And so I got somebody to agree. He's sitting in the back of the room and he got it agreed with some others that, hey, I've got another cluster over here, that's 3.6 petabytes that hasn't been stood up yet. Can I use that as a lab? And they're like, okay, you have 90 days. So I'm like, great. So that's what I'm going to be doing with it and testing a lot of these other pieces with it. As well as enhanced securities and then also the other thing that we're kind of looking at too is for specific use cases, some of the NVMEs that we may be looking at some hot swappable components and any other high-performance type SSDs and controllers, et cetera. And then maybe, and this is just a maybe with the RDMA pieces so that that way your OSDs can communicate a little faster directly without going through all the tiers of the networking components, et cetera. And that's it. And here's the, again, I just want to put this back up here. I got a guy yesterday said something about Twitter. He said, hey, I don't tweak much, but here it is. I'll put some stuff out there. I don't tweak much. But the findings that I'll have, the things that I'm allowed to share on our findings with this test cluster, et cetera, I will put it out there. So, and that'll be one way that it'll do it. And then of course we'll always continuously be updating. Matter of fact, I just made pull request and merged this morning on some of the cookbooks, et cetera, et cetera. So, questions? Thank you. Thank you. I'll try to answer them, so I can't guarantee it. Which version of stuff you're using? All right, so we're running beta with our object store right now. And that version is hammer, 94.6. So I didn't mention that, but that's what it is. What we're testing, the new cluster is actually going to be dual. So that's what we're doing on that. Hi. So, your analysis is brilliant. Thank you so much. But what you're using as a hardware looks like or sound to me is enterprise grade SSDs, right? That's true. So if someone asks you, for example, telco service provider, hey, I want to swap, I'll change my tape solution into safe based, low cost storage that cannot be high end SSDs, but regular drives. What will be your input and what will be the variable that will be added into your analysis? Increase your pain tolerance. That's the first thing. No, that's actually true. And the reason I say that part of this talk here is part of a much larger talk and it talks about you have to have buy in along the way. Everybody has to realize you're going to give and take. That's just the way it is. And so you've got to be able to have pain tolerance. So you can get so what you have to look at with SSDs you can talk to some of these vendors that are out here. They know way better than I do. But you have basically they can only write so many times and you know your things of this nature in the meantime between failures is really low on consumer grade but can they be used? Yeah, but it just really depends on your use case. I mean just really look at your use case. You don't use do it because you can do anything that you want. So question about the probably replication part I don't have much experience in it as a core thing but for replication part when you use that you use the default to bucket type in your crash map you have to specify like a straw or Can you go back to the first? I missed that first piece. Yeah, I'm just talking about the question context in the first part. I'm asking about which bucket type did you use for the replica based pool? There are a bunch of algorithm lists or straw or tree you have a rack aware crush map definition there. Yeah, so let me pop over here. I don't know if this will work. Can you cut that back on for a second? So here it gives you an example of some of the tunes we're actually using the second release of the straw calculation for tunables. So this is our base and this is actually because it's the same this is actually our base that's in production what we start with and so you'll see the rule sets coming down for the different pieces and then you'll also see different and actually the bottom one down there with the minus three etc. That's actually not in our production piece. I was actually just testing some stuff on Begrant this morning on that build. But from there what happens is the Seth Chef piece in the OSD sections if you have a racer coded enabled in the cookbook then what it does is it moves when it creates the OSD it moves that into the appropriate slot inside your crush map tree and then it balances based on those weights on the rack and the nodes etc. So I was wondering in my company we have a really bad we're starting to put Seth in production and our network is very slow is there any point in your opinion to tune the performance when the network is just slow like do you think we need to talk to infrastructure to get the network better before doing that? Well, yeah. I would get everything I could get. Do you think it makes any sense to do performance tuning now and then wait for them to fix network or just fix network right now and then do performance tuning after? So that's a good question. So in essence what you can do right now is get a baseline with what you have. You get a baseline on your performance you get a baseline on how you're delivering everything and then from there start tweaking it a little bit and see where the deltas are between where your tweaks are hopefully they're in a positive direction and not a negative direction and then you can actually compare that and so then you have a better idea that when this you get better throughput etc etc and you have a little lower latency network then you're going to be able to take advantage of it immediately. Okay, so you think it's like predictable because I was worried when you said you're playing with so many knobs that maybe the reaction will change when the network changes. No, you actually it should be to the good side of that. So definitely on that side of it. Thanks. Anything else? Okay great, thanks.