 All right, guys, in the spirit of a proper Texan welcome, how do y'all? We are now women of open-stack breakfast and Jessica Marillo, a native Texan, told us that's the proper Texan salutation. So I'm James St. Rossi. I'm a storage engineer for Comcast. I'm responsible for large-scale SEPH clusters that we run in Comcast. And we use block and object out of SEPH for our open-stack deployments, and we have quite a few of those. I'm John Benton. I work at my consulting systems engineer at WWT. I've been working with Comcast and these guys for a while, helping to tune SEPH and get things faster and work on some new architectures and some fun stuff like that. So today we're going to be talking about designing for high-performance SEPH at scale. So quickly I'm just going to run through today's agenda. So we're going to talk about our lab and production environments, holistic architecture looking at the whole picture and not just certain aspects of it, strategies for benchmarking, performance bottlenecks, lessons learned, tuning tips and tricks. And so to start with our environment, let's talk about our typical node configuration. So we have pretty large storage nodes. These are 72 drives, 6 terabytes, SATA, 7.2K. So pretty commoditized drives except that they're pretty large at 6 terabytes. Then we have 3 NVMe journals, PCIe, 1.6 terabytes, 2 Intel, 2.7 GHz, 12 core processors, or processors with 12 cores apiece. And the cores are important in SEPH. You've got a lot of parallel operations. So having more cores definitely helps. 256 gig of RAM, if you follow the rule of thumb that you'll see on SEPH's website or on Red Hat's website, it would recommend more RAM, but we actually haven't had an issue with using 256 versus 512, which is a lot. And then dual storage, I'm sorry, dual port, 40 gig NICs. We don't really need that much bandwidth, but it's just to play it safe. Then we have the mon and the rados gateway nodes, two processors, 2 GHz, 32 GHz RAM, dual port, 10 gig NICs, nothing special. It's really more about isolation for mon and rados gateways than it is about having a lot of horsepower to drive these things. All right. Now, this is a nice diagram of R. So this is what we tried to do with the lab environment is make it as similar to the production environment as possible. One of the really important things that we found was that as we started growing these things out, we started finding new problems that you just don't experience with one node or two nodes or three nodes or something really small like that. Things like a network flow control, we started getting just tons and tons of packet drops, like up to like 10,000 a second. I think it was one of the worst I had seen on any given node. The mon's actually started kind of become like, like I read on the SEPH site a long time ago that your mon's are going to be a bottleneck. I'm like, yeah, they don't know what they're talking about. They've never been a bottleneck. They actually started to become a bottleneck. Very cool and interesting ways. Exactly. Also probably one of the most important things actually that we found as we started growing out these clusters is the performance actually per node started to diminish as the cluster size grew. So that was actually, that was a really key thing that we found and we'll actually come, we'll circle back around to that in a little bit. The clients are just sort of general purpose boxes. We wanted those to just kind of look like any ordinary client. Nothing too huge, nothing too small, just so that it would be sort of an accurate representation of what you might find in the wild. Let's see, and then the, yeah, so the general purpose nodes, we used them for tool boxes for benchmarking, all kinds of additional mon's. We kind of went up and down on the number of mon's, up and down on the number of RITOS gateways that we were testing and that kind of stuff. All right, so we did kind of a cool thing. We took, we were lucky enough that we could take hardware we had for production region and we were able before it got to us to actually put it into this lab. So we actually had a full scale lab to test with. So talking about our production environment, I can just say, there it is. Which is really, really nice to have that scale. Some of the minor differences are in some of our regions, we actually have two of these side by side. So two 16 node storage clusters and the reason for that is that we can directly correlate the benchmarks and tests we ran in our production environment. Well, in our test environment, to our production environment. Also, we have some clients that had really big asks in terms of IOPS and storage. And so this gave us a very trivial way of partitioning that, so we could avoid the noisy neighbor syndrome. Sadly, we learned that one through experience. I highly recommend don't learn that through experience. And then the RITOS gateways, so we have a six per cluster, but we don't necessarily split it up three and three. We might split it up like four and two or something different. It might be six on one if we really want high performance object storage on one cluster and no object storage on the other. So we can kind of slide that dial when we need to. All right, so now let's talk about our holistic architecture. So one thing we've done is we've tried to look at as much as we can. Software, hardware, the data center, really just try and get it all in there because you really need it. The more data you can collect, the more things you're looking at, the more problems you're going to avoid going forward. So first we have customer requirements. So IOPS, ask your customers, how many IOPS do you need? What's your read, write, mix? What's your object size? And then they're going to look at you and they're going to go, I have no idea. We haven't deployed yet, so what's it going to be? So then you need to coach them a bit, right? So you ask them the simpler questions like, is it going to be more read than write or write than read? So kind of slowly start them down that path, object size. So they'll say, our objects can be from 4k to 4 meg. And so start to say, well, how about giving me an average? Is there an average object size? Because the more of this information you know, the easier your testing will be. And the better you can architect the solution, right? If you know you have a very specific object size, where you have very specific IOPS requirements, you can engineer the solution for those instead of trying to engineer for everything, which is kind of, you're never going to get a solution that's going to optimize for absolutely everything. And you'll wind up with something that isn't optimized for anything. And then how much replication? Seth is, by default, is triple replica, which is a good default. But you over say like two replicas, you do take a hit in terms of storage, right? Triple replica means that you've got essentially one third of your actual raw storage is available for usable. It's actually less than that when you take into account like the near full ratios and that kind of stuff. Also, the more replication you have, you take a performance hit, right? Because it's copying all these bits all over your cluster. So you're going to take that performance hit. In some cases, if you've got a customer that's like, yeah, we can rebuild all that data from our other systems. Like if it's a distribution or CDN, they've got sources and originals and they can just rebuild that. So maybe you can take them down to two replicas and give them more performance, see if they want that kind of compromise. And then one of the beauties of Seth is it's APIs, right? It supports, it's got object, it's got block, it's got Seth FS, it's got iSCSI, it's got all these different things. Do your customers need all of those, right? Because if you don't need object storage, you can get rid of the radios gateways. You don't have to have these boxes for radios gateways anymore. If you don't need Seth FS, you don't need the metadata servers, for example. All right, so keeping cost in mind, a lot of engineers will just look at it from a purely technical standpoint. But a lot of times cost is very much component to your architecture, right? You might be able to get more hardware and do more of a scale out. If you're looking at the individual components and maybe keeping your scale up constrained and then you can scale out more and you can get more performance that way. So failure domains, I didn't fix this. So servers rack, servers row. It's supposed to be servers racks and rows. There's the failure domain for our slide presentation right there. So do you want to fail on a per server, on a per rack, on a per row basis? There are costs involved in that in terms of you have to step up your replication and so forth and so on. So you have to keep the failure domains into, take them into account. Data center constraints. So real quickly, talk to your data center architect. Know what he's capable of. What is your power budget? What's your thermal budget? Can you pull the unit out of the rack to access the drives from the top? Do you have a fiber network cable that'll snap when you pull it out on that cable management? These are all things, and this is part of the holistic architecture. Keeping that in mind, operational complexity. So we have a case where we have this one chassis where to remove a drive, the sled had two drives on it because of the density of the system. So to remove one drive, you had to disconnect two. Well, that adds tremendously to your operational complexity. So you've got this huge procedure because you actually have to take two OSDs offline, even though you only want to change out one. All right, and so to wrap up holistic architecture, and no talk of SAP is complete without mentioning the journals once. Are you going to go with co-located journals, SSDs versus NVMEs? They're cases for to go either way and SSDs and every other way. And we could spend the rest of this presentation talking about the differences of that. All right, so strategies for benchmarking. So whenever we started benchmarking, we decided to use FIO for block and Cosbench for object. Reason-wise, they're good tools. They're publicly available, so you inherit a lot of things. I can run it in one lab. Another guy can get the exact same config file and hopefully reproduce the exact same result. So that was pretty handy, like we're sharing with vendors like Red Hat. You're sharing it with Comcast and all the back and forth. It can be easily reproduced. Like, we're not seeing that, but we are. So you can kind of help narrow down problems. One of the other things for benchmarking that's really critical is to analyze your production load or get very, very good projections. So whenever you're developing synthetic benchmarks, you just really want to make that look exactly like you possibly can and really on a storage system like Ceph, it can be a little unpredictable. Like, you might be sending it like a 4Meg IO or whatever. But what's actually being seen at the disks is completely different. Like, it might be 740K or something weird like that. So what you want to do is actually be aware of that and whenever you design your synthetic benchmarks, make sure that they take that sort of behavior into account. Another thing is that you want to do end-to-end tests and also test all the individual components. By doing the individual components, you really get a good picture of, like, are all of my systems balanced for my goals. So you can find out, like, and you sort of find these weird things. Like, for example, we found whenever we were testing 40 gig NICS, we even saw one time where we were only getting like 12 or 13 gigabit. Like, why on earth is this happening? So we were able to just narrow that down really quickly, whereas if we are just running Ceph, we may not have ever seen that. And then, conversely, whenever you do an end-to-end test, you can kind of see how all of these systems work together. We just straighten the cables, the ones get stuck going around the corners if you have too much of a bend in the Ethernet cable. We'll get more to actually why, what we did to solve that later. Yeah, so the other thing that's pretty critical to do is to understand the various APIs that your customers would be using, whether it's RATOS Gateway, if it's libRATOS, is it RGW or whatever, or RBD, I mean. The reason why is because each one of those systems have sort of a, they're sort of nuanced in how they actually show up to the storage. So never sort of take for granted that like, oh, my staff cluster using FIO and RBD is going to behave the exact same way that RGW is going to do it, because they may be completely different. All right, so I just want to say IOPS isn't everything. And I have a certain customer that's probably vehemently disagreeing me with me right now. Chris Powers, I know you're out there somewhere. So this is a pitfall we've gotten to do a couple of times. It's something I definitely have to always watch out for when I'm doing benchmarks. You'll have a certain number of IOPS and you're like, man, if I increase the workers, I'm still getting more and more performance. And then all of a sudden you're running 1,000 workers and you've got 30% more performance and it's great. And then you look at your latency and you're like, it takes five seconds to get one object through. That's bad. So always keep latency in mind when you're doing these tests. And know that, yes, you can push a doorbell with a rocket launcher, but it may not be the best strategy. So you have to back off the workers. That's something that I always have to keep in mind and watch the latency. So and verify published stats with benchmarks. This is very important. And I don't think anybody's fibbing on their stats or anything like that, but they have different situations or conditions that they're working under. They're definitely probably not going to use your exact workload. Always, always verify those stats. Never take those as the law that they could be. But always, always verify them. Also verify the scale out if you can. I mean, obviously, if you're deploying hundreds of nodes or something like that, it may be very hard to verify that, but really, really push. It is not like if you get a certain number of IOPS out of a five node system, you're not going to get double that out of a 10 node. So you really have to try and verify that as much as you can to avoid surprises in the future. And so one other thing, whatever you're doing, the benchmarks, is you want to introduce some randomization. One of the things that was really surprising to me is that while we are actually doing some of the synthetic benchmarks, we actually found that doing some sort of like slightly off block sizes, like for example, 250K objects versus 256K objects actually cause some really weird performance benchmarks in some of the components. Like these really weird things, but they translated one-to-one to the actual performance in the live cluster. So you actually always want to make sure that you're doing that. And sort of along the same lines, you also want to make sure that you're always running different scenarios. Like if you don't ever want to just sort of like it might apically become focused on a single workload, because what you might find is that if you're doing 100% writes and 100% reads and then just maybe one thing in between, you might find that like there might be some like really weird nuance like at 9010 or 8020 or something like that. So it's a lesson learned from James, is you sort of go through every last one of those and don't take anything for granted. Right, a lot of time spent there. So TC Malick, you guys might have heard of this by now. It's kind of going back to what I had sort of hinted at earlier. As cluster size, whenever we were testing these things in the lab, as the cluster size increased, what we actually found is that the performance per node actually started drastically dropping. And this was sort of going back to what happens as you start scaling out these clusters that you might never actually see if you're only testing on like say three nodes and your production ones are gonna be 12 or 400 or whatever. So what we did, so we started looking into this and what we noticed is that as the cluster size increased, the percent sys CPU utilization was increasing disproportionately to the actual performance per node of these boxes. And that's why scaling out your test can be so important. If we didn't scale out, we would have never seen this until we went into production. Exactly, and then been wondering why things were running so slowly. So the system profiling using the utility perf top, if you know that one, it revealed that there was a bunch of free and ALIC functions from a library called TCMALIC. They were consuming almost all of the resources. It turned out that they were like can complete that log all the time, waiting for things to free so we could ALIC and then waiting for things to free so we could ALIC and it was, that's what was causing that weird thing. So the graph here is the per node performance before tuning and what ended up happening whenever it turns out TCMALIC actually has, you can actually tune it's cache size with an environment variable. It's like one of the coolest benchmarking things I've ever done where you can literally change one line in one file and get up to 50% performance increase ahead of your whole system. If only it was always that easy. We needed to find like five or 10 more of those and like done. Exactly. We can wrap up now, we can go home. 40 gig notes will actually be running at 80 gigs that would be wonderful. So what it is is actually it's a parameter called TCMALIC max thread cache bytes and it's set to a very conservative value, I think 16 megabytes and you can dial that up to, I think it's 128 megabytes and what we actually saw is that like this whole performance deficit like that you see on the 19 node cluster where it's roughly 50% slower than it is on a six node cluster that that becomes a lot more even not in complete parity but you're not suffering nearly as much of a performance to it. So doing that it really has no downside. Plan B is using Jemalic. We did not actually test that but it seems like that's actually like that's really, really the fix in the future but we did not, we chose sort of not to test that because to use Jemalic you would have to actually recompile SEF and we sort of felt like that was too far off to reservation to be supportable in production. So we just kind of, we went with the environment variable fix and one other note in case you all go home and do that if you haven't done it already is that the certain versions of the library specifically like the Ubuntu, the original 14.04 release like 14.04.0 actually had a bug where it ignored that environment variable so there's a C++ program out there that you can actually just sort of scrape it's a little one, you can just sort of screen scrape and compile it and verify that it's actually honoring that variable. So before we move on to the next section I just wanted to take a moment to discuss the modern PC architecture for the multi socket systems. There's a system called QuickPath Interconnect, QPI that what it does is it enables bi-directional communication from the process is located on one CPU socket so that it can access resources, memory, PCIe devices, that kind of thing located on another CPU socket that's sort of the bus for all that happens and it's sort of a point-to-point so if you have even like a quad socket it's still only going in one direction like it's the, let's see where's the line here. It's called NUMA if you, I'm sure everybody's heard of that. The theoretical bandwidth on modern systems is in aggregate 25 gigabytes per second which just sort of entails, it's actually about 12 and a half in either direction but Intel wanted to sort of boost those numbers a little bit so keep that in mind and so you're thinking, well I'd never actually use 25 gigabytes a second so why does that matter to me? It's way higher than we'll ever see. And one other, just one other note on that, core is located on the same CPU socket will not go over QPI. This is literally for just socket to socket communication. And when you're using the high density nodes that we are, well, you see the QPI as a bottleneck at that point. So, so quickly, this is sort of a crude overly simplified workflow of what happens like actually inside each of this F nodes. It's coming in the NIC, it's going up and down the TCP stack, it's going to the OSD, the application layer, then it's going into the storage system and you're talking about all the layers of the kernel, the file system, all the various buffers in between, all the various caches in between, all of that stuff. So you can see that there's, if you think about this, there's actually a lot of opportunities. Like if you have everything sort of sprayed out over the system, which is really the default behavior, you can see that there's a lot of opportunities for that same data to just continually into a hairpin back and forth between sockets. So this is what our nodes QPI bus looked like. That's about right, I think that's a good approximation. It turns out though that actually the OSD workflow is actually more complicated than this. If you've ever looked at the diagram of all the functions that get passed and back and forth, it's actually quite complicated. He started explaining it to me and I was like, okay, I need a little bit of a rest break and maybe an aspirin. So the original architecture actually had two NVMe cards and so we had an additional PCIe slot and we thought, well, since performance is dictated almost completely by the journal performance in most cases, we'll just pop in another NVMe card and life will be great. So we popped in another NVMe card and what we noticed is that instead of getting like 50% more performance, it basically remained the same but the individual NVMe cards were performing at the exact same level that they were before. So Numa, our good buddy Numa. It turns out that in stuff, because of that sort of inter-node communication, you've got the journals, you've got the OSDs, you've got so many processes and so many threads and functions and all this stuff with that data potentially ping ponging all over this box. The larger the nodes and or the faster the nodes, if you're running at all SSD, configure something like that, just the more data that you're moving in and out of these nodes, the more opportunities that you're actually gonna have for crossing that gupi bus. So what we actually did is we ended up tuning three areas, trying to optimize those trips. I just, I put this in here kind of as a reference for everybody but to really the pieces of information that you need to get, you need to map which CPU cores live on which sockets. They, depending on how many cores you have and sort of the architecture of your box, they can sort of be a little weird in how they're numbered like zero through 12 on one socket and 13 or zero through 11 then 12 through 23 on a different socket and they can just be kind of weird so you need to map that out. You will spend a lot of time staring at the PROC file system. Looking for that information and then PROC interrupts will drive units because it's got about like 80 columns or you're like, where is all this information? Yeah, I highly recommend 4K monitors, they help with that a lot. Or dual monitors, I had dual monitors and I was just like, okay, it's over there somewhere. Yeah, so you can actually get that information in PROC's CPU info. The other thing that you wanna do is you need to map where the PCI devices, what NUMA node are actually located on. That's located in slash this. That's a file called NUMA node. It's a little hard to guess. I mean, it's obviously not impossible because that file system is auto-generated but it's hard to guess as a human of what file it is. So what I actually do is just define slash this minus name NUMA node and then I actually grab for the PCI address that you can find in LSPCI. So just a little tip for you guys in case you're wondering where to find that and it'll basically be zero or one or whatever the NUMA node is which corresponds to the socket that you found in CPU info earlier. The other thing that you need to find out is you need to map the OSD disks and the journals if applicable, if you're not co-locating, to the associated HBA that they're actually living on, a RAID controller or whatever. So in our hardware configuration we actually had three HBAs. So it was three HBAs and three journals and of course two processors and one NIC. So it became kind of tricky to do the mappings there. So the first place that we wanted to optimize that is a soft IRQs. What happens is whenever data comes across onto a PCI device it sends an IRQ to be an interrupt request to the CPU and then the kernel is actually it spawns these things processes called soft IRQs that in turn they pick that data up off whenever the IRQ comes in and says I have data these soft IRQs are able to sort of balance that across all the CPUs by running the soft IRQs and they pick that up, they put it in a buffer and then they mark it as pending and then that's when it starts moving up your application stack. So what you wanna do, the whole point of that is that you need to get these soft IRQs running on the CPU core where that interrupt is coming from. That saves you a trip across the QPI bus, one or more trips, whenever you think about where those buffers are located and all that kind of stuff. Let's see. Yeah, you wanna do this for all of your PCI devices like you, the network card, all the HBAs and potentially if you have your journals in a different spot like on it in VME you wanna get all of those soft interrupts pinned to the CPU socket where that's located. The other thing you wanna do is you want to enable one shot equals yes for IRQ balance D. Different operating systems sort of handle that differently like some operating systems basically you put one shot equals yes and it promptly ignores you. So what that does is that tells IRQ balance there's tons of soft IRQs running so you don't, and the default behavior is to just assign them all to CPUs either so you actually want the ones that you're not mucking with to actually get balanced. So that's what we did is we did that one shot yes so that it would do it one time and then the IRQ balance demon would go away. They don't really need to move in most cases. And then after that, what we did is we developed a script that took all the data that we talked about just a minute ago and basically the way that you do that is you echo a CPU core or range or comma list or whatever you can echo that in a hex value into proc IRQ then the IRQ number which is really the PID and then SMP affinity. So it's actually pretty easy in Linux to pin those guys. So watch the range. So in production when we were doing this we actually wound up just pinning the individual core instead of doing a range because the logic behind the range was like great you've given me a range now I'm gonna pick one. So everything would go to the first and so it kind of defeated the point because it would put everything on the first core versus actually distributing it throughout the core. Yeah, I should just be careful. I mean that might be fixed and it might be different per Linux distribution but just have to verify that one. Yeah, exactly and that's always the key thing to do is look in proc IRQ or proc interrupts and just validate that what you think is happening is actually happening. The other place to look is in if you look at something like MP stat minus PR then you can see there's a percent soft column and if you've always wondered what that is it's how much CPU time is spent handling these soft interrupts and if you see any single socket or CPU core handling a disproportionate amount you probably need to go back in and look at that. So pin your NIC, your HBA, your NVMe all to one CPU and if you, in our configuration we actually had two HBAs and two NVMes pinned to one CPU and then to the other one we had the remaining HBA and NVMe pinned to the other with the NIC. So that's kind of how we balanced an unbalanced situation. So the next thing is the new map on the mount points. Fairly simple, all that you wanna do is you basically want to align the OSD and the journal if it is not co-located. You want to align those so that they're on the same NUMA node. What you want to avoid is like if you have an NVMe journal on socket one and you have the OSD is on HBA off of socket zero you have a potential drip across QPI there. So that's really the situation that we're trying to avoid. So that was pretty simple. You can actually find that information and if you look in a dev disk by path in that long sort of crazy string actually has the PCI address of the HBA that that disk is connected to. So it's just a little FYI. This is really the whopper I think. This is where we saw the most performance. Pinning the OSD processes so that the core that it's actually that that process is running on it aligns with the storage that it's actually controlling. That's really a huge one. And the reason why is just because that's or the OSD process workflow is just so complicated you have so many opportunities to balance that back and forth. So it's much, much, much faster for you to go over L1, two or three cache than it is to go over QPI bus. Because with a separate journal and a separate disk you're gonna ping pong across the QPI bus three or four times if you don't have the OSD on the same core or on the same socket as you do the controller and the MPME. So the other thing that you wanna be aware of what can be sort of a potential like you make this change and you're like, hey, nothing changed. It might be because if you actually just pin the process which again is another slash proc PID number. And then I think it's SMP affinity also is the same thing as the RQs. Where's that going with that? Oh is it the mask, don't use the mask pin it? Yeah, oh yeah, so you just need to make sure that the processes are pinned so that it's controlling the storage where the PCI devices are. That's really the key thing. Let's see, yeah, one of the things that we did like you can sort of spray them evenly or you can just give CPU ranges. The CPU or the Linux kernel will actually do a lot of auto balancing for you unless you pin them. Either way seems to be ideal. It seems to be okay, but oh yeah. So if what happens whenever the OSD processes start is they actually allocate memory for buffers and all that kind of thing. So you just need to make sure that they are, you pin them to the CPU when you start them and the way that you can do that is by doing it in the init scripts that actually saves a situation where the process starts on socket zero and it allocates memory on socket zero and then you switch the process over to socket one and then it's having to go to socket zero again every time it wants to talk to its buffer. So that's the reason why you would want to do that at start time. And there are two ways of doing that. The two tools, there was Taskset and then Numis CTL. So Taskset will pin the process, but it won't pin the memory, which I mean I don't know why you wouldn't want to do that, but if you use Numis CTL there's an option in it. It's like memory affinity or something like that where it'll at least make a very good attempt to also pin the memory to make sure you're not like using the memory from the wrong socket. Now for your OSD that's pinned on the right socket. So definitely I would advise you use Numis CTL as the tool in the init scripts. It's a newer tool and it does a lot of the same things that Taskset do, but I think that Red Hat developed it and they do have that memory migration feature if you use it, which is not necessarily the default. Yeah, so doing these changes, like they minimize latency in basically every situation like and really the more your bottlenecks by Neuma in the first place is going to be the more that you are gonna benefit from aligning all of these things. All right, so just some general performance tips. So these are just things where we kind of found we're the best cases or best things to do. So of course use the latest vendor drivers. There's two schools of thought, be conservative, don't use the latest, use the most well-tested ones and we found that using the vendor drivers provided a 30% performance improvement, specifically when we were talking about upgrading the network and the HBA drivers over the stock built in Linux ones because the Linux ones can be very old what's actually been upstreamed. Then we have OS tuning, focusing on increased threads, file handles, et cetera. This is important. So Ceph out of the box does not come tweaked. You definitely need to tweak it to get good performance out of it. And also not only do you have to tweak the resources on the Ceph nodes, but if you're using this with an open stack configuration, something that recently happened to us was the U limits on the computes weren't set high enough so that the Ceph client on the compute node was actually, what was it? When somebody would create over a two terabyte volume, it would fail because it ran out of file handles or IO descriptors. So yeah, definitely tweak the, keep that stuff in mind. So jumbo frames, use jumbo frames for obvious reasons, right? More payload, less headers. You're gonna be dealing with lots of data. It's a good idea. Flow control. So we found this with the 40 gig NICs and then we also found it with the 10 gig NICs where we had these flow control issues that we only saw when we were using eth tool and not if config. We would see errors and dropped packets and stuff like that. And it turned out to be the default setting on the switches was not honoring the ethernet Mac pause frames. And because of that, it was just dropping packets. It was very inefficient. We weren't getting the right bandwidth out of the network card. So definitely look at flow control in terms of these pause frames. So scan for failing. So if the drive hasn't completely failed, there is a situation we've seen it multiple times where the drive will still be responding but it'll be responding very slow like a sloth. And Seth won't mark it as off or it won't mark it out yet but you really should proactively scan looking for very high latency IO and drives and take them out of your cluster. Otherwise that can affect all of your cluster performance. All right, here we are, questions. We have one minute for questions. All right. Yeah, I mean that's definitely something we're considering. I think that there are two cases, right? So if you're looking at kind of the bulk storage model where you just want to make it as cheap and if you're worried about dollar per terabyte, you want to go with these huge notes because that's really going to economize on that. But I think for most workloads simplifying it, we've seen several solutions where they just went down to a single CPU and had the OSDs directly attached to it. And that definitely avoided that issue. Goal source? I'll give you a data look. Oh. Now I mean we've got multiple storage systems that we use. We do do object and block there. We are, now that CephFS is available, we're definitely going to be turning that on. I just talked to Sage and Neil today and it's still brand new, the wrapping's just off. So we're not gonna roll it out to all our customers immediately. But that is definitely one of the nice things about Ceph and it's I think what diversifies it from other storage systems is that it really can do such a large spectrum of the different storage systems. But is it the perfect fit for everything? It's not, when you're talking about object, for instance, it's not gonna be the most performant. So if you've got a customer that's only worried about IOPS, you might be advised to look and compare the performance to other, you know, maybe purely object-based systems. Or well, ones that do Swift purely, maybe. So I have a question. You're okay? So you had mentioned that you used jumbo frames and the jumbo frames really helped in the cluster network. I was wondering if you had experimented with jumbo frame sizes. I mean, did you notice like a particular size that helped cause with 40 gigs, 9,000 bytes might not be enough. Seems a little small, right? Yeah. It's tough because 9K is the standard, right? And so I know in our environments, all we really tested with at 9K and we had, because in production, we're going through, we've got other networking groups that we're dealing with and that's the standard they're based on. So we didn't really test past 9K. It's also one of those situations where sort of a two-edged sword, where if you go too large, then you introduce latency. So 9,000 is kind of a nice compromise. And again, like it's so widely supported. So that's really all that we tried. On the hardware, one thing I'm curious about, which was lacking from up there, did you go with a straight, dumb, pass-through HBA or did you use a ring with any sort of catch? Don't like bypass a raid, like get it out of there. This is software-defined storage. Reduce the complexity. Don't waste the money on the battery backup or the memory. Like you're better off spending that for your SSDs or NVMEs. Yeah, completely agree. I just saw another session that said the exact opposite. So I was kind of curious. Oh, yeah. Fight, fight. I did a study on that, actually. And it turns out that there actually are some workloads where if you do like a bunch of one-disk raid zeros, it actually is faster. But it's a very specific workload, like just doing the pass-through HBA is faster in almost every other situation. And the caveat of that is also when you're deploying to production, like dealing with having to set up the raid controller and stuff like that, it's a real headache. Yeah, and you lose the ability, you sort of have to rely on the raid controller's diagnostics, like you lose all of the smart D data. So I, nine times out of 10, I would go with just a plain HBA. Do you have any learnings around erasure coding? Oh, that's a good one. That's a hot topic. So we are very interested in erasure coding. We've started testing it for objects. Please call up your Red Hat rep and tell them that you really want erasure coding on block. I recommend go out and do that. They would love to hear that from all of you. And I would love it too, because I would like to use not a quarter of my raw storage, but 80% of my raw storage. That would be great. Thank you, sir. Can I raise both? And I mean the caveat, of course, is you take a bit of a performance hit when you're using erasure coding, but I would love to have that option. I would love to test it. I think there are definitely work cases. If you're doing bulk storage, for instance, that's a perfect example where iApps aren't the important thing there. And they explained to me that that performance hit is what makes doing block storage so complicated with erasure coding. It's because you're constantly doing read writes, and that's not great for erasure coding. I'm sorry, could you talk into the? I didn't. Can you, can you? Sorry. I can't hear you up here. So RBD with erasure coding is probably coming, targeted for the K release. So we are in track of working on that. Great, great. I love it. It's definitely something that's very exciting. It would be nice to get a lot of that storage that we're all installing to get more of it back and usable. Yeah. Definitely. Yeah. Yeah, the dollars per gig is a hard case to beat. Right. So we'll be kicked out. Okay, last one. We gotta go. What class of disk drive are you using? Enterprise or sort of this cloud class? Enterprise. Yeah. All right, thank you. Thanks. Thank you guys.