 So, hi everyone. I'm Christopher McGowan. I'm the co-founder and chief scientist of Piston. With me, I've got Mandel Dejrenus, Jonathan LaCour, and Jordan Tardiff from Dreamhost. Mandel's with me at Piston. We started this talk as the idea we were going to have a panel on how we use Ceph and Cinder together. To that end, there's no slides. We've got a Google Doc open, and we're drinking beer. So, the rules are, if you guys have questions, you're free to interject at any time. It's going to be a moderated, or a moderator-less panel, but the rule is you have to have a beer in your hand when you're at the microphone. So, you don't have to be drinking the beer, but you do have to be holding a beer. And I'll let everybody go through and make their introduction, so we'll start with Jordan, because we promised him he wouldn't have to talk until he was called on, and I don't believe in that sort of thing, so. This is gonna work? Cool. My name's Jordan. I've worked at Dreamhost for 11 years, I think, now, and I'm one of the two cloud engineers on our team. That's right. I'm Jonathan LaCour. I'm the vice president of cloud. It's my fault that there are only two cloud engineers on our team, which is why one of them is not here, because he's tired, which is pretty much true all the time. And our product is called Dream Compute that's built on top of OpenStack, and it's a public cloud, and we have a multiple petabyte storage cluster running on SEF for block storage. And I'm Mandel Degrenis. I'm working with Piston Cloud, and I've been working on their storage side since it started, and pretty much having to do it myself until recently, we finally hired someone to work with me. Hi, I'm Christopher McGowan. As I said, I'm the chief scientist and co-founder of Piston. It's actually my fault that Mandel's the only person that's been working on SEF and Cinder and all of our storage for the past three years. So, yeah, if there's anything wrong with what's going on, blame us. So I think we were going to start by describing our kind of, you know, we have very different use cases for the, for SEF and Cinder, maybe describing a little bit about the architecture and kind of what the footprint looks like for each of us. Right. So you want to kick it off? You want me to kick it off? We could rock, paper, scissors, or we could just have Mandel do it. All right. All right. You go first. All right. So we use Cinder as both block and SEF, sorry. Yeah, that'll help. Yeah, we use SEF for our block storage and for our object storage. Closer. Does that work better? Oh, there we go. And, oh, ouch. Anyways, describing our architecture is a little difficult because all of our customers use it differently. However, I understand USDA is giving a talk this afternoon and talking about how they're using us. So that might help. Basically we've got a configuration manager that allows you to define how you want to use various different types of disks and put some into multiple pools in SEF. And one of the things that we do is since we have a converged infrastructure distribution of OpenStack, we actually run compute and storage all on the same hosts. Because of that, we have a very different use case than most people who are deploying Cinder. So we operate with monitors and OSDs on the same hosts, and those same hosts are also running virtual machines, which puts us into a kind of an interesting... We discover a bunch of edge cases that Ink Tank and SEF, the SEF community wouldn't otherwise have found because of assumptions they've made and assumptions we've made that don't always necessarily match up. Cool. Yeah, our use case is completely different. We are a public cloud use case. And in our case, we have a multi-pedabyte cluster with 126 Dell R515 storage nodes. Those have 12, three terabyte disks in them. So you can do the math. That's Bukku data, a bunch of it. And each one of those has 32 gigs of memory. In our block storage, we're using 10 gig networking everywhere. And that helps a lot. And we also have three monitors right now that are Dell R415s. And we're going to be adding two pretty much right after the summit because we're opening up our beta quite a bit. So in our use case, we don't have the converged infrastructure. Our hypervisors are separate. And those are our Dell C6145s with 192 gigs of memory and 64 cores. And we have 48 of those. So we're supporting thousands and thousands of VMs on a single storage cluster. And we use Cinder for the block storage. And we use Glance to store our images in Cep as well. Yeah, and we're doing the same thing for writing our images back to Cep. And we have some other extensions that we write into Cep as well. We use it as the backing for our memory page sharing service that we have in our piston open stack. Also, MySQL database sits on Cep. Oh, yes. MySQL sits on Cep. And that is hosted for the infrastructure services for Cinder, et cetera. That's interesting. What in particular got you running MySQL on Cep as well? We don't want the MySQL database running outside of the cloud. Basically, our installations turnkey. So we had to be able to orchestrate the services that the services depend on. And since we use master election to do that, we needed to be able to ensure that the service could run from any of the hosts in the cluster. And that gave us very few choices. I mean, we could have used Gluster, but we were already deploying Cep. So it made sense for us to orchestrate Cep and then at the same time deploy the services on top of them. So we have a question. He's got a beer. He went through the whole thing. So please. And introduce yourself as well. We haven't got this, so I could ask this question. I haven't been drinking it for a while. So you've got MySQL on Cep. Are you having MySQL right to a raw RBD volume? Or are you actually using Cep FFS to share that out? We are not using Cep FFS. When we started off, we were using Cep FFS. And when I mean starting off, I mean when we were doing the initial development. And it turns out that Cep FFS at the time in 2011 wasn't performance enough to be able to handle a single concurrent write or a single read. We discovered this when we put it on there and we're like, what the fuck's happening? The whole thing's going away. This doesn't make any sense. And then in 2011, this has changed. We haven't tested it since then. But if you created a directory tree four levels deep, so A, B, C, D, and then CD into D, an LS on an empty directory would take 30 seconds to return. Which is not ideal. Yeah. So since then, we've tended toward the Cep community's recommendations not to use Cep FFS. And so we really haven't evaluated it since then. So it's in ODB using an RBD volume as a raw volume? Yes. Thank you. And my understanding is Cep FFS has come a really, really long way since then. So it's definitely something that we're going to be looking to it a lot more very soon now. And also one of the things that we're going to discuss later on is how Cep has rapidly matured since they've started and since we started using them and since stream host has come aboard using them as well. So I think we should describe our sender setups. I think that might be a valuable thing. And I'm going to kick this one to Jordan because he is, I'm amazed he still has hair. Since he's been battling with, what was that? It's actually a wig. Oh, it's actually a wig? Yeah, I wouldn't be surprised. So our sender setup is something that we've been tinkering with a lot lately. And I can talk a little about the latest stuff that one of our guys, Ryan, has been working on. You can go through what we've been bodging together to make things work so far. Yeah, so we've been looking into a couple different ways of doing things. With sender running multiple volume services, it's not necessarily a supported way of doing things. So there's a hack that a couple of people use that we learned from CERN. And it's to change the host in your sender.conf to all of the same hosts. So it kind of ends up balancing out sender volume crates and deletes across a bunch of those. But apparently that introduces some race conditions that we haven't actually hit yet. And I don't think many other people have hit. So we do run four sender volume services on separate hosts. And then we run also four, I think about four API services as well right now. And so yeah, that's where we're at right now. We're also researching into some way to actually split the rabbit messages out so we can separate deletes and creates off to separate sender volume nodes too. Yeah, and the reason that's important is I think the interesting thing about sender is sender was really designed for use cases where potentially the driver is going to have consistent performance and very rapid. Yeah, I'll finish my statement and then we'll call you. Really rapid responses to each operation where they're fairly consistent. When you're dealing with a large kind of multi petabyte cloudy storage cluster, it's a little bit of a different situation. And so not all operations are always going to be fast. And it's not that sender is dumb to do it this way. It's just it's a newer use case, it's a different use case. And so we're working with a lot of the sender devs and they're doing a really great job of helping us. And kind of the approach we're taking right now that Jordan was alluding to is we've implemented a RPC driver that allows us to send and shard the sender operations that are going to go to sender volume to different topics. And then we fixed a bug in sender volume where if you configure it to listen on a different topic it actually listens on that topic. And then we can shard like certain operations that we know are slower off to a separate pool of sender volume workers. So deletes for example are pretty slow in SF and RBD compared to, you know, if you're using something like LVM, right? So what we do is we send those operations off to a pool that can handle those slow operations then really hyper-fast operations like creates, don't get stuck in a queue behind a slow operation artificially causing them to look slow. So that's kind of why we're doing that and what we're working on. We will upstream that once Ryan finishes. This guy right here, you can hand him. And I've also volunteered Mandel and Lyssa to work with Ryan to get that so it's able to be upstreamed. So Lyssa's in the front row. So yeah, grab her and go ahead. Yeah, so this is a question on the convergence of structure. When you're with the convergence of structure, did you end up separating the compute traffic versus the storage traffic? We do. We do it kind of interestingly. We have some in the configuration that we use that deploys the cloud. It's sort of a meta configuration. We specify a bunch of different VLAN networks primarily for things like dealing with management traffic because we have a distributed control plane and the storage traffic is on a separate network than the compute traffic which is coming from the virtual machines. And those are specified and in the neutron deployments those are actually on completely separate networks that are user defined. So we have a host network that handles the Paxos data for Ceph and at the same time for some of the other less chatty Paxos implementations that we use. And actually this is something that maybe Jordan could chime in on as well. We actually have multiple networks set up as well to ensure that Ceph has, so basically the tenant traffic is separated from the management traffic for Ceph. So when Ceph is peering and shuffling data around it's not saturating the networks that customers are using for their actual operations. So I mean, could you expand on that a little? Yeah, I mean there's not much more than that. We basically have two networks we're basically both all on 10 gig each and so we have a 10 gig Ceph front end network where all of the tenant kind of traffic is gonna happen and then we have a Ceph back end network that's all 10 gig as well that all the migrations and stuff like that happen on. And then we also have another management network that's just gig for actually talking to Cinder and stuff like that. You wanna interject Mandel? You haven't talked in a while, you've been drinking. Yeah, well drinking's more fun. Anyways, the architecture, right, Cinder? Just checking. We use single master elected scheduler and volume at the moment. API is spread out across the whole cloud and load balanced. We use multiple Ceph pools as I mentioned for different storage. Close, close, close. There we go. Sorry about that. So multiples have pools backed onto different types of disks for our multiple volume types. Everything else is an operation of the install, so. Are you guys, you guys are storing your images inside of Glance as well, right? Yes, we are. Yeah, and the benefit to that if you aren't familiar is that Ceph supports copy on write. So the great thing about that is instead of having to suck an image down out of Glance to a hypervisor when you're gonna boot, it boots directly off of, it does basically a super lightweight copy on the underlying Ceph storage and you get to boot basically instantly from that image and it only stores the delta. So you get some operational efficiencies there. And we've actually had really good luck with that. That's been something in terms of our war stories. It really hasn't been much of a war there. And that's allowed us to get boot times in like the 42nd to 72nd time range right now. It spikes up higher than that sometimes, which is why we're working on the sender sharding. That's really the big reason that it can spike up. Our goal is to get around 30 seconds, which we think would be pretty spectacular. And that's a lot of that is to do with Ceph and copy on write. So I would say that in the community as a whole that people at Piston and the people at Dreamhost are the ones that have deployed Ceph the longest and have probably the greatest amount of expertise with building the scalability and dealing with scaling out their clouds. Could you guys describe how you guys are doing the scalability and how you're doing that orchestration and then I'll have Mandel do the same? So you wanna maybe, we're using Ceph a lot to do all of our kind of management of our Ceph, configuration management of our Ceph cluster. We have had some interesting fat finger moments there. And we've had some fun upgrades and such. We'll get to that a little bit later when we get into the horror stories. But Jordan, you wanna expand on that a little bit? Yeah, I mean, we've been using the Ink Tank Ceph cookbooks I think for a while. We've had to make a few kind of customizations to spread out our OSDs across racks and different power zones because we've had a few power issues that have actually taken out the whole cluster. So we had to actually make sure that we can lose certain parts of racks by setting up different roles for each of our Cinder OSD clusters and make sure that we can lose those. Before I have Mandel answer, could you like describe some common scaling problems you guys have run into? I mean, we could talk about the one where the configuration issue is one of our major ones. Well, let's hold off that one until we get to our section on the Ceph Pocalypse. We can get there. So scaling for us basically means that we have to add new servers because we've got converged servers. So that's our Moxie scale out works great. We've got disk profiles which define what each disk looks like based on profile, we profile disks when they come in. So we look at them, see the size, see the speed, et cetera. And then we based on that, we make a decision on is this going to be used in model A, model B, model C, get divided up and it just gets added into the Ceph Pools. Our crush map has been, was fun. That was figuring out how to make sure that we ended up with data actually going to multiple... Yes, that's important, making sure data goes to the right places. Yeah, it's really easy when you set up a single crush map where you've got one root and everything goes. The default works great, but as soon as you've got two different classes of storage, then you have to build your crush map individually. Yeah, and when you have hundreds of storage nodes with a thousand disks, the crush map gets large and bigger than it can fit in your head necessarily. And so you have to be really careful and sure you do a lot of auditing. We've written some... One of our cloud engineers, Jeremy Hanmer, has written some Nagios plugins, which are really helpful for actually auditing your crush maps that ensure that you don't screw it up, basically, because we're humans and every human makes mistakes. We have errors, so we monitor ourselves, especially the VPs. Pretty sure all of those that we've created are actually up on our GitHub, too. Yeah, they're up on GitHub. And if you need to know, talk to me. Happy to share. So, are you guys using the current interface to the Chef Crush map? Or are you doing it by hand? Oh, the Calamari stuff? Yeah. Not... Okay, so the funny thing is, the guy who wrote Calamari, he's got it running on his laptop and looking at our Chef cluster, but we're not using it, so... But we plan on it, especially now that Calamari is gonna be open source. Thank you, Red Hat. That's fantastic. We're excited about that. Yes, thank you very much. I would clap, but I don't have a hand because I would drop it. That's the sound of one hand clapping. We actually, because there was no interface to the Chef crush map API when we first started, we actually have a bunch of really ugly, ugly orchestration around that, and we really hope that once Calamari is open sourced and available to the community, that it's something that we'll be able to get rid of all of our ugly manual... It's automated, but it's still manual, like file editing and munging out of that to use the supported API. Shall we talk the Chefpocalypse? Yes, let's. All right, so I'm going to start with the introduction to set the stage for the Chefpocalypse because we've both had Chefpocalypse I. Hours didn't end quite so happily as Dreamhosts did, and it started off even sadder. My grandfather in 2012, he passed away, and I had to go back to my home for the funeral, and when I landed, my great grandmother was in a nursing home, and she was also going, and she passed that night, so I'm in Maine, I'm attending two funerals in a week, and I get a call. Oh my God, you won't believe what happened. And I go, what do you mean? I won't believe what happened. And they go, well, you know how ButterFS says you can never lose data? You can. It turns out that ButterFS is the only world's first production-ready, write-only file system, and what happened was we got hit by a bug where the ButterFS file system for the Cep OSD got corrupted, and there was a horrific Cep bug where there was an off-by-one error for all the replicas, so replica one was written perfectly, replica two was here, and Cep thought it was here, and so the data disappeared. You want to go into more detail? Probably not. Not too much, but anyways, the painful part with ButterFS is there's no FSCK because you don't need it, it's a journal file system, that's what they tell us. It turns out that you can't automate that recovery when you lose power to a disk and it hard crashes. And the interesting thing about that is we never actually have access to the clusters because there are customers, and because we have a converged and automated installation mechanism, in general, the end user, the operator shouldn't be getting on the host at all because then you can't trust the automation, and we have customers in places like Cambodia where the guy on the ground doesn't know what's running on the hardware, and the guy who has access to the operating system and the cluster as a whole, as an admin, he's five, 10, 12,000 miles away, and he has no idea what's going on. This happened in our dev lab, so it was a little bit less terrifying, but getting that call when you're away was kind of terrifying. Subsequently, SEF has changed their recommendation from using ButterFS to using XFS, and the performance is slower, but it is safer. And it's actually gotten faster and faster. When we were doing intensive measurement on ButterFS versus XFS, what we also found was, in the beginning, ButterFS was so fast, it was amazing, and then over time, as more and more operations happened, it would get slower and slower and slower and slower until it was so mind-numbingly slow that it was not worth using. But the real issue is just ButterFS isn't quite there. It's an amazing thing. We are excited for what ButterFS can offer us. It's just, use XFS for now. Yeah, it's probably better for now. I would like to point out that even if the SEF bug off by one error didn't exist, we lost three drives on three different servers, and they were all unrecoverable, so we lost data. So in our case, we actually didn't lose data. So the SEF Pocalypse happened to us in our Dream Compute cluster, oh gosh, three, four months ago now. And so we had lots of instances running. We don't have any ephemeral storage at all. Everybody runs on the SEF cluster, and we were making a change to the SEF cluster. And when you do that, when you add in new OSDs or remove OSDs, you effectively changed the topology of your SEF cluster, and then it will attempt to rebalance the data to ensure that you've got a good even distribution, which is brilliant and wonderful. The problem was we had a bug where we had, and I'm reading this from my other engineer so I can make sure I get it right, we had our crush map had all the OSDs weighted to zero, which you shouldn't do, and all the machines and racks at one. And the result of this was that all of the data in our entire thousands of disks cluster, all of the data was stored in the first disk of each rack. So it was in like eight disks. And then when we changed the topology of the cluster and fixed the crush map, it attempted to then rebalance all of the data out of those eight disks all at once to hundreds and hundreds and hundreds of other disks. This made everything very angry. And the funny thing is we were actually able to make it work. We just had to slow things down and SEF offers you all sorts of tuning mechanisms to say, okay, don't freak out and move all this data like all right now. This is all coming out of one or two disks. Slow it down, relax. And the really amazing thing for me is I think it took like two weeks to rebalance the data. But once we had it all rebalanced, all of the instances that I had running with applications running just came back up again at the same point. Like my applications were all there, it was shockingly awesome. So yeah, this is why we created those Nagios plugins that I recommend that you go download from our GitHub. If your Crush map is managed properly and monitored properly, Calamari makes this a lot easier as well which is why we're excited about it. You know, you won't have these kinds of problems but we wanted to share what not to do. Just make sure you're careful with the Crush map because computers do what you tell them to even when you tell them to do stupid things. I kind of feel like we're the old, old guys telling all these young kids that they're gonna be in our day, we had to walk up and down hills to get to school. My chef clustered red on butter FS. Darn kids, get off my lawn. We had other questions. We have 15 minutes left. So if you guys have questions, beer, go. So about XFS, are either company brave enough to run it with barriers turned off? Is that bad news? Was there no barrier mount option for XFS? I have no idea, do you know? I have no idea if we're doing that. That's a good question. We're not. We, because of that week, the week from hell, we have a instinctive reaction against losing customer data. That's been my gut reaction as well. I like to sleep. Awesome. We tend to choose the safe option for data safety. Do you mind passing off your beer to the guy behind you? He doesn't seem to have one. Thank you. It's empty. Good. Okay, my question is, our customer need a disaster recovery. So that means we need to have two installation of SAFT. So when you transfer the image between, or the volume between different set of installation, you are going to lost a couple of things. For example, the theme provisioning, right? And also probably you lose those benefit that for Kapiak write. So how you deal with that problem? Kapiak write actually works between different volume services, between different pools. So if you're using two pools in the same SAFT cluster, Kapiak write actually works between both of them. But if there is like disaster recovery, one data center, one another. If it's in a totally different SAFT cluster, then yeah, I think you lose Kapiak write. Well, you lose Kapiak write, you lose Kapiak write unless you have the image in both places. And also you lose the theme provisioning also. Right. So you can do it. And actually SAFT has grown some really, we aren't using this right now. So it's hard for me to speak to it too much, but you can talk to some of the ink tank guys. I'll tell you all about it. I know they have some capabilities now for those specific DR use cases where it'll do like replication and stuff. I can't speak much for them now. But that's only for the, oh, sorry, go ahead. They were talking with that at the SAFT community session on Tuesday morning. Okay. The remote replication stuff is currently only for the object store. So it'll work for like, sorry, Glance images, but it won't work so much for volumes that are backed on Cinder. Right. Currently, that's expected to be fixed. But we have, do you have another question? No, thank you. Okay. Does the guy behind you have a beer? Ah, that, all right. Welcome. Have you had any problems with the physical hardware failures, disks, servers, that kind of thing? Sorry, I didn't hear the question. Do we have problems with physical hardware failures, disks going out, et cetera? Oh yes. Discs are not reliable, turns out. They fail. The great thing is that SAFT deals with it very well. It's designed for failure. That's what it's designed to do. So yes, we have that problem. Discs fail, servers fail, racks fail. NTP fails that don't ever let your NTP servers go away. Oh God, please. The last time that happened, it resulted in a catastrophic temporal cascade failure. And the entire cluster disappears into the same temporal rift that Tashiyar came out of in season three of Star Trek on the Enterprise C. Just don't do it. Really, that's interesting. It did come back eventually, to be fair. So did Tashiyar in season four and five. Yes. Is this thing on? Yeah, it's on. Apologize for not holding a beer in my hand. I still have too much beer on my system from the latest live party yesterday. I hope you accept that as an apology. We'll accept, yes. You are a beer, that's great. I can't have any more beer for the next week. Anyway, wanted to follow up on the discussion of DR recovery in a remote data center and replication for different self clusters. What you currently have is only Rados Gateway, which means you can't even use that stuff for Glance. All you can use that is for Swift API. But in the next self-release in Giant, which is planned to come out this summer, there will be same kind of replication for RBD, which will mean Glance, Ascender, Nova ephemeral volumes, the whole thing. We're looking very much forward to that. That will be fantastic. So what he said is there's RBD replication coming for the Giant release, so. Are you working on Ceph personally? I work on Ceph support in Mirantis OpenStack. Are you, damn, I forgot what I was gonna ask. Are you working on getting the, or do you know off the top of your head if the Erasure Coding or the Erasure Coded RBDs will have the same sort of disaster recovery? I don't know personally, so this was curiosity on my part. I don't think this was brought up in the Ceph community session yesterday, but I don't see any reason why not. Well, you do know that you do need to have a cache pool in front of your Erasure Coded pool. Right. We actually run that on top of SSDs within the cluster. Yeah. And if anybody does wanna ask any questions about the object storage stuff, we know, we both know about that as well, so we're happy to talk about the Redis Gateway too. Thanks. Thank you. Are there any other questions, or should we start looking at our list again? We have 10 minutes left. The guys in the back are giving us evil looks to make sure that we stay on time. Apparently the last guy didn't, so. Two questions. Two questions. Him first. I do have it. My question is about BitRot and Silent Data Corruption. Since you guys had to switch away from ButterFast to XFS, I'm wondering, do we, do you care about it? Or, it's just one of those. Well, you know, we got copies all over the place, so not really. I mean. Well, but it's an inconsistency, that's the potential. So, I was at the Library of Congress last year in September talking to their data archivalists, and they very much care about this. The OpenStack community on the Swift side and the Ceph community don't actually care about this right now, and we should be. There are some middleware we should be building as a community. I built some of it for the Swift middleware to determine, or to continually upgrade, or check the hash of the data, but neither Swift nor Ceph do that currently. And it's something we really ought to be doing. Especially if you're gonna be running Cinder, if you're running all your block storage through it. Right, I agree. Is anyone brave enough to run ZFS Fuse? Not yet. Nope. If you look up here, we're cowards. Terrified, terrified cowards. We look out into the. Hashtag OpsLife, that's what we all are. Yeah. Do either party or anyone else in the room have any experience offloading journals for performance onto low latency media? We put journals on SSDs is a really good idea. Can you quantify it really good? I'll let Mandel talk about it. We've experienced the whole gamut of where journal placement is. Yeah, so we originally put journals on the disk. Turns out that's not optimal, especially on spinning disk. Even putting it on a separate partition on a spinning disk, it just gets really, really painful. Especially when it's on the same disk, it turns out, because you're seeking constantly. We put it on SSDs now. We recommend our customers do that. We don't force them to do that, but we highly recommend it and usually they can see the difference if they try it. Actually, with the SSD or journaling on SSDs, we've had some really interesting benchmarking cases where if you build a volume on our slow pool, which is running on rotational media like 700 or 7200 RPM, state of drives. The IO benchmarks are small enough that they fit in the journal, so you can't tell the difference between a fast pool volume and a slow pool volume because all of your data's actually, or all of your benchmarking's happening on the journal itself. For the rights, anyways. For the rights, anyways. How big are your journals on average? How big are the journals? I can't remember offhand, but I think they're four gigabytes on the SSDs. I'm not 100% sure. And for our customers, that's configurable. They can change that if they want. We don't make any recommendations one way or the other. What about you guys? What's your journal size at? Or do you just consume an entire SSD? I have no idea. Yeah, I'm not sure. Where do you, I think Jeremy would probably have the answer to that. So my recommendation, if you wanted the answer to that question, is go to the DreamHose booth and camp out until Jeremy Hanmer comes out. He's going to look like he's- He's mostly dead right now, but I understand that he will be mostly living sometime this afternoon. Cool. Thank you. Do you have a beer or are you a beer? I actually have a breathalyzer hooked up to Soodoo, so I'm not allowed to drink and root at this time. Oh, fantastic. Cool. My question is, are you guys using, and I came in late, so my apologies if you covered this, but are you using RAID underneath as you're backing for SOF or- Absolutely not. Okay, so you just have individual disks managed by SOF itself? Right, SOF operates against OSDs, which really should map one OSD per disk. And doing RAID underneath, you actually lose all of the performance benefits that SOF or SWIFT, object storage in general is designed to maximize throughput of files through the system, and doing that by throwing multiple replicas onto individual drives is better than trying to stripe them across multiple ones. Okay. Yeah, we actually do go through a RAID controller, but we're not actually doing any RAID on it at all, so we run an OSD per drive. Are you taking advantage of the caching that the RAID controller provides? No, no, we don't use any of that. Okay. Actually, we hate RAID controllers. We can use them, but they're such a pain in the ass. Like, from the perspective of the end user who's actually deploying it, it's wasted money on the part of the vendors that are selling them. Yes, I agree completely. This is probably one of our biggest performance gripes is just around the hardware. Like, as the software gets smarter, I want the hardware to get dumber. I want it to be the most dense, stupid, high speed, like just focus 100% on IO. Give me as many disks as I can wedge into a rack as possible. That's what I want, right? But there's all this other stuff that they do, and it's like, for this particular use case, and the good thing is, I think Ink Tank has done a good job now of working with the hardware vendors to make sure that they know what kind of device makes good sense for this use case, and now that they're with Red Hat, I think that'll accelerate a lot, so. And we've had some success dealing with these guys. It turns out the value add that the name brand vendors think they're providing is... There's value cache, but it's actually negative value. You had a question? Yes, you're talking about some interesting thing, for example, put in the journal on the SSD, but isn't that frequent write going to kill the SSD pretty fast? Could you repeat the question, please? What I mean is that you put the journal on the SSD, right, to make it perform better, but isn't that frequent write going to kill the SSD pretty fast, right, because you have limited lifecycle on the write? It can be limited lifecycle, but I don't know about how Dreamhost sees this, but the volumes, Cinder volumes, have frequent writes because they're really network file systems, they're block devices that you're going to be writing to, so if you're doing anything on them, you're continually writing to those disks. And they're 64K blocks, 64 meg blocks? I don't know. Man, you should know. I should know, but they're, it's small. Yeah, so they're really small blocks. Can you go to the, yeah, do you want to answer? Yeah, you're going to burn through SSDs really quickly, is the question. My apologies. Okay, I think the answer to that question is it's worth it. That kind of depends on the size of your Cef cluster too, but as with any SSD usage, you need to monitor and make sure that you're replacing them before they're failing. Yeah, get good SSDs, meaning Intel, and, you know, yeah. We've had issues with cheap SSDs. They fail way earlier than you would expect. Friends don't let friends use anything but Intel SSDs. Never mind. Have I had too much beer? Maybe, so we have three minutes left. Any other questions? This worked really, really well. Thank you guys for, for drinking with us. Yeah, I appreciate it. Okay, go ahead. I believe Cef, the default number of copies of each object is two. Do you guys run with a higher number? Three is the magic number. Monitors, we run, oh yes, we run with three copies. I think that actually is the default three, but I could be wrong. It didn't used to be. Okay, recently, okay. But we've always, you use three on our object storage and our block storage, always have. And if, monitors five, five is the magic number. That shall not count to four, that shall not count to six. Unless you're using three. Unless you're using three. Or seven. But don't use three. Yeah. Go to five as quickly as you can. And we go to seven at 120-ish nodes, but that's just a thing. You had a question? Do you have a beer? Are you a beer? I am a beer. Okay. How many OSDs do you run per journal? We have one journal for one OSD. Yeah. Anyone have an abacus? So I heard the ratio is normally five OSDs to one SSD. Oh, we, I mean, like I say, our customers have the ability to do more, but we strongly recommend no more than four. No more than four. Okay. Great, thanks. Upgrades. That's, so. Upgrade ability. One of the things that we were going to cover was upgrade ability. We didn't ask about it. So yeah. Well, we have one minute. So I'll discuss very briefly. We know a lot about upgrades because we were the first production deployment of SEF at scale. We have a multi-petabyte object storage service called DreamObjects. And it's essentially just a deployment of SEF in the Rados Gateway. And we were live in production that that in pre-Argonaut releases of SEF. And so we've upgraded through basically everything ever. And all of the QA that has happened on that upgrade process has been us. So you're welcome, everyone in the room. And the answer is we're getting better and better at that. And each upgrade is getting better and better. And Ink Tank is doing a better job on each release. The good news is on when you're starting from a newer release, the upgrades are so much smoother now. So the upgrades that we've done on Dream Compute's block storage cluster have been easier than the ones we've done on DreamObjects which is coming from the dawn of time, if that makes sense. So it's just, there's still vestiges. Every single time we upgrade, we see something new and different that we're the only people ever to experience. So you guys won't have any of these problems as long as you don't install that extremely old version of SEF. Start new and it upgrades are easy. Yeah. Ur. Our only issue with upgrades is that we've got a converged architecture and so the recommended upgrade path doesn't work really well for us. Ah. The OSDs and the monitors in the SEF upgrade path, you can upgrade live as long as they are as far away from each other as possible. So if you are going to deploy and you don't have a requirement for turnkey converged, follow the recommendation because you have to restart all of your monitors before you restart your OSDs in order to do the upgrade safely. And I think my biggest recommendation to close out, like the one thing, if there was one piece of advice I had for you if you're gonna run SEF is follow Ink Tank's advice. There's a reason that they come up with the defaults and they aren't always right, but for the most part they are now, right? So they make mistakes like anyone does, but the recommendations are really good starting point. Your use case may vary a little bit and you can tweak from there, but it's the best place to start is the recommendations from Ink Tank. They fixed the eight pools per, or eight PG's per pool, right? They did, I think so. All right, we're out of time. If there's any other questions we can come up and talk to us. The beer rule still exists until we get out of the room, but thank you guys, thank you very much. Thanks.