 All right. We'll go ahead and get started here with the next presenter on our Ceph Day track at Open Source Days here. Thank you, everybody, for making it. Our next speaker is going to be Rob from Melanox. He's going to talk about Ceph performance across high-speed networks. So, Rob? Thank you very much. Can everybody hear me in the back? OK. Awesome. All right. So I'm going to talk about how to improve the performance of Ceph using high-performance networks and protocols. I'm going to focus on four areas. First of all, just the wires, making the network faster. Secondly, the architecture, architecting the network to be faster, to be able to utilize the faster wires. Thirdly, flash storage and how to get the most performance out of flash in Ceph on a network. And finally, a new protocol, well, not a new protocol, but a new protocol to Ceph called RDMA, or Remote Direct Memory Access. A quick bit about the company I'm from, Melanox, we build end-to-end ethernet and InfiniBand high-performance solutions, so one to 100 gigabit ethernet and all the different InfiniBand speeds. By end-to-end, I mean we build the adapters and the silicon that go into the servers and the storage systems and the appliances and the switches that are in the middle and the cables that connect them all up. And we provide software drivers to hook them into the operating systems and the different applications. We're pretty good at it. So last year, we owned or we had, according to the analysts, more than 85% of the market for ethernet adapters over 10 gigabits. And that's the part of the market that's projected in the next three years to grow to be even bigger than 10 gig. So that's where all the growth is. So we know a little bit about high-performance networking. So I'm gonna start off with the wires. So this is some testing we've done and there's actually a white paper on our website on exactly how we did it and all the configurations. And I think they're actually handing out these white papers or some flyers for it at our booth over in the marketplace if you wanna stop by there. But what we did was we tested starting with one gig all the way up to 40 gig. And you can see, of course, some major results and differences between one and 10 gig, but even going from 10 to 40 gig, two more than two X the performance on bandwidth and 15% higher IOPS. So if we look at some of the new network speeds, the 25, 50 gigabit compared to 10 gig, we see the results there are just as dramatic. So even going from 10 to 25 gig, you get almost 100% improvement in bandwidth and 100% improvement in IOPS. Now 25 gig is really widely available now. It's kind of what they wanted. I think there's a saying now 25 is the new 10. In fact, Cisco switches now 10 gig switches, 10, 25, it's the same switch and they don't charge a premium for it as well. And the adapter prices at 25 gig are very, very close, if not the same as 10 gig. So it's not a break the bank thing and much easier than putting two 10 gigs in. If you do an internet search of recommendations from different vendors of Seth or Vars putting Seth together, what you find is most of them on average are saying you want a 10 gig NIC for every 15 hard drives and an OSD. So 25 gig is a great solution for a lot of hard drives. And these high performance networks are available today. So multiple vendors, it's not only my company, I know at 25 gig alone, there's at least four other vendors providing products today. And you're gonna have a higher speed networking starting later this year. We'll be introducing two and 400 gigabit ethernet solutions. So with these faster wires, we need to also look at the architecture of the network. In Seth, there's two logical networks that can actually be two physical networks. The public network or where the clients are connected and the cluster network that just connects the OSDs. And there's a lot of performance that happens between the OSDs or a performance needed between the OSDs because there's a lot of traffic there for basically the replication and recovery and rebalancing as well as the heartbeat, making sure that the OSDs are all up and running. In fact, if you look at the gains you get by separating those networks by making that physical, those two logical networks physically separated, you can see that if you've got 10 gigabit ethernet on a single network with no cluster network that you're using over 50% of it just for the OSDs. If you separate them and put in a faster network like 40 gig, you can see that you get a huge boost in your overall networking capabilities and you free up that private side network for reads and writes to the OSDs from the clients. There's a lot of traffic on that cluster network I talked about for replication and erasure coding and I'm gonna show you what I mean by that. So for a read operation, pretty simple, the client goes to the OSD, reads the data, but if you look at what happens on a write, so there's no extra cluster traffic in this scenario. But if you look at a write, you get two more writes that occur assuming you're doing three replication on that cluster network or if you have it all on one network, you're actually for every write tripling the amount of data that crosses the network. So by segmenting it, you're gonna see an improvement on the client side in a loaded system. And if you look at the recovery from a replication, you see a large impact. So for example, here's some different network speeds and the time it takes to recover from an OSD loss, two terabit size, a 20 terabit and a 200 terabit size OSD, you can see with 10 gigabits it's gonna take you half an hour if you lose an, I'm sorry, with 10 gigabit ethernet and two terabytes, it's gonna take you half an hour to recover from a lost OSD. And that's full bandwidth, so that entire 10 gigabits is being used to get it to that time. If you've got other traffic like you'd have in segment of the network, then it's gonna take a lot longer. And remember that time is risk, right? Because if you lose two more OSDs or if you're only using two replication, that time is a risk that if you lose another one, you're gonna have a data loss. You can see that if you've got a 200 terabyte OSD, it's gonna take you almost two days on a 10 gigabit cluster network, private cluster network doing nothing than that to back up that data. But you go, or to recover from that. But if you go to 25 gig, just 25 gig, it's gonna drop it to under a day. And you can get it in just a few hours with a 100 gig. So you really wanna fence off that client network from that cluster network. And that's gonna give you a lot of performance and a lot more reliability. So the easiest way to do that, if you look at a typical network architecture where you've got the core and the distribution and the access point, usually the core is at a higher speed. Nowadays, probably mostly 40 gig with 10 gig to the clients. You can simply add a switch in the rack where the OSDs are. And now your OSDs are on a cluster network independent of the client network. And you also now easily can change the speed of that cluster network because it's not part of the overall network infrastructure and it's not gonna cause bottlenecks higher up in the core to distribution layers. And it's not very expensive. This is a little half wide one-use switch that we sell but there's multiple vendors of switches at these speeds as well. But this product, it's 40 gig, 40 gig this product would cost you for 16 ports, just over $5,000. So $5,000 to decrease your recovery time huge and improve the overall reliability. And if you wanted to put in a hundred gig you could do it for under $10,000. Let's now look at erasure coding. So in this case, instead of making copies we're using a special algorithm that can take the data and break it up into small pieces across multiple OSDs. And the advantage here is instead of needing 3x the storage you only need one and a half times the storage. But there's a payment for it and another reason to have that separate cluster network is that there's a lot more traffic that goes on. Small message traffic as the shards are called those little pieces that are broken up by this formula and distributed across the different OSDs are sent out. So it's less than the traffic that you would see with replication if you're using three replications because it's only 50% more than that one right but there's still more traffic and a lot of small traffic so latency is important. The read operation though is the other way because in a read operation with just replication you're only asking for the read from the client and you're getting back the data. But here you've got to decode that data that's spread across the different OSDs. So it's gonna create more traffic on that cluster part of the network. Good reason to have it independent. And I don't have a slide for it because it's rather complicated but you also when you have an error here and you have to recover think about the recovery mechanism and the traffic. So you not only have to decode the data from the OSDs that remain but then you have to re-encode it across the OSD to the replacement OSD or if there's less OSDs you have to redo the calculation. So that can be very heavy traffic on the cluster network as well. The other downside to using erasure coding is that it's a very, very heavy load on the CPU because the CPU has to do that calculation and it's a very complicated calculation on all the data in order to put it into the format needed for erasure coding. One way to get around that is our NICs that have offload engines for erasure coding. And the way this works at least with our solution is when the data is sent to the adapter, the Ethernet adapter, it actually does the calculation in hardware and distributes the data across the nodes thereby offloading the CPU from those calculations. And everybody knows that Ceph loves to use CPU so it's a good thing. So here's how we're going to implement it. There's a module in Ceph that does the erasure coding and what we're doing and it's in development right now is we're creating a replacement for that module that simply uses this offload. So it's just a matter of switching that module. By the way, if anybody else has questions, don't hesitate to interrupt. I should have said that earlier. So the next area I wanna talk about is Flash. I think everybody realizes that Flash storage or SSDs are becoming super popular across data centers. In fact, many data centers are going all Flash because it solves the problem, believe it or not, of reliability because disk drives have all these moving parts and although Flash was initially supposed to time out and have all these problems, it isn't turned out to be that way because of the software or the load balancing and all the special software on Flash that distributes the load across all the NAND chips. And so many data centers are just saying, hey, I'm going all Flash because the maintenance on my hard drives is very expensive. And secondly, they're doing it because you don't have to worry about matching up the storage to the performance of the applications anymore because all your storage is fast. So anyway, Flash is becoming super popular but that comes with a price because it puts a big load on the network and here's why because the change, first of all, the Y-axis on this chart is logarithmic but the change in performance just between hard drives and SSDs is 100 times. That's huge. And the change between SSDs and the new persistent memory that's gonna come out over the next five years is another 100 times. So there's a 10,000 times improvement happening in storage in a 10-year period. To understand the magnitude of that, think about how far it is from here to St. Louis. Google says it's about 1,000 miles and 18 hours. Now think about, hopefully a lot of you are from Boston. Think about the distance from here to Boston College. It's roughly 10 miles, 15 minutes, according to Google, versus 18 hours. That's the magnitude and difference in performance just between hard drives and SSDs. If you add the persistent memory in here, that's like 1,500 feet. So 15 minutes to Boston College or how long does it take you to walk 1,500 feet? So this is putting a huge change or load on different components in the whole ecosystem, especially the networking side. And here's why. So if you look at this chart shows how many hard drives it takes to fill a 10 gig, which is the red line, and then a 40 gig, which is the yellow line, and then the blue line is 100 gig. How many hard drives it takes to fill those wires? And you can see almost 25 drives it takes to fill a 10 gig link, and hundreds to fill the other wires. If I just switch that serial ATA interface to a SSD interface, so go from a hard drive to NAND, now it's just two. And nine will almost fill a 40 gig link. Now there's a new technology for SSDs called NVME, and this was a redesign of the interface to SSDs because they initially came out with the legacy hard drive interfaces, which were slowing them down. If you switch to NVME drives, one of them overflows 10 gig, two of them overflow 40 gig, and four of them almost fill 100 gig link. So if you're buying SSDs for their performance and you're not upgrading your network, you're wasting some of your money. You're getting performance locally, but if you're remoting that storage, you're wasting your money. You're not getting the full value of that high performance SSD. So this is some testing that's been done by many companies. It's a little bit old, so it's a couple of years ago, and they did testing on Ceph systems with disks changing the disks, changing the SSDs, the networks, the CPUs, the operating systems, and you can access this on the internet or you can go to our booth and they can show you how to get to it. And what you can see is the difference that just adding some SSDs provides to the performance if you also improve the performance of the network. So you can see on the far one here, I can get the pointer to work. In this one, you had a mixture of SSDs and NVME, and then here, taking the hard drives out, you increase the performance just using the NVME SSDs. And then you can see if you add a lot of flash, the more flash you add, the faster the performance you get. So it's important to improve or to increase the performance of your network if you're using SSDs. Here's some data from Quanta on the same area. Here in their test facility, they were able to show almost a seven times improvement in performance by using a faster network with an SSD implementation. Here's some more data from multiple different vendors. And you can see different speeds of the network and the performance they got and different numbers of SSDs. These are all straight, full SSD implementations of CEP. The last subject I'm going to talk about for improving the performance is a new, is a technology called RDMA or a protocol called RDMA. So we've talked about how to improve the performance by having faster wires and re-architecting the infrastructure of the network and how to take advantage of the SSD performance. But now we're going to talk about a new protocol or a change at the protocol running on those wires. And this isn't an old technology. It's been in the market for many years, but in the HPC world, the world of supercomputers. It started out about 20 years ago almost now and it's now, it's embedded in InfiniBand technology and the protocol that runs over InfiniBand wires and it's now a dominant part of the HPC market. In fact, if you look at the top 500 supercomputers in the world, you can see it's the navy blue line. It's the dominant interconnect between those supercomputers. Supercomputers these days aren't the big circular craze that they were when we were kids. They're now a cluster of PCs or risk processors all connected together with a very, very fast low-latency network and the protocol those networks run is RDMA. If we look at the regular straight storage market and look at the three different areas, object, block and file, as we talked about, you can improve the performance of those with these, these are common protocols for those different areas just by flash or high bandwidth ethernet. But also, you can use RDMA technology and it's been used in the ethernet world for many years just over ethernet in a technology called Rocky or RDMA over converged ethernet but in the segments that needed high performance. So for example, if you think of block, there's a technology called iSCSI or protocol called iSCSI for networking block storage and there's a version of it called ISER that is iSCSI over RDMA. If you think about file, Microsoft has a technology called SIFS or SMB now I think is the new name for it and there's a version of it called SMB direct which works over RDMA for higher performance. NFS over RDMA is for the NFS file system or file protocol and then on object, the technology we're gonna talk about is SIF over RDMA. Now, I said it was a niche for high performance but that niche is becoming mainstream because that very high performance SSD interface called NVME that I talked about a few minutes ago is now being put into a protocol called NVME over fabric so that it can be transmitted across a network and that standard came out boy over almost a year ago now and it includes what's called a binding layer in the standard for transport over RDMA. And the new memory technologies, the new persistent memory technologies that are again a hundred times faster than SSDs are there's work underway for the last year and a half on how to put that across a network and all of that is focused around RDMA protocol. So because of that this RDMA technology is going to become much more mainstream in the data center over the next few years. So what is it and how does it work? Basically it's the remote, it stands for remote direct memory access so it's the remote version of DMA and DMA is how you can move data inside of a computer that's the local version, how you can move data inside a computer without sitting in a software loop and moving in a word in a time. So instead there's a hardware engine in the CPU that you give it a pointer to memory here, give it a pointer to memory here, kick it off or give it a counter and it moves the data without the CPU being involved. This is the remote version. So what happens is you tell the RNIC, the adapter that supports RDMA, the memory location local, the memory location remotely, the counter and it moves the data without any interaction of the CPU. That's different than how normal traffic goes through a TCP IP stack because there the CPU is involved and the CPU is controlling the transport layer which is the layer that takes care of making sure the data gets there, recovering from errors, those sort of things. That's all handled in the hardware of the RNIC. So that's where you get very, very low performance in high bandwidth and you get also your CPU cycles back which like we said earlier Ceph likes to use. An example of that, actually one more thing on RDMA over Ethernet. So initially RDMA over Ethernet when it was in those niches it required in the earlier implementations technology that came from fiber channel over Ethernet called PFC or priority flow control to be implemented on the switches and that gave it the flow control it needed to keep it from overloading the network because it provides so much data. Now that, you can still use that with RDMA over Ethernet and it does improve the performance but it takes some, it takes more effort because you have to configure the switches. Now later, question? Okay, now as that transport layer has been improved and no change to the protocol so it's not a change in the Rocky protocol it's just an implement, better implementation of the embedded transport layer on the adapters that need for priority flow control has gone away and so now you no longer have to have special settings on your network to run RDMA over Ethernet. Go ahead. So this is the diagram I was looking at and it looks like the RDMA does not need any kernel does not involve kernel at all or the application is directly talking with hardware or just trying to understand this paper a bit better. So the user memory needs to be pinned for the hardware to access it, right? So who's taking care of all that? So there is a software interface, an API driver for the ARNIC that has commands to take care of those issues. Okay, so the API is a kernel API for RDMA? Yes. Okay. So the kernel is still involved then? True. Okay. The picture doesn't show, that's a little confusing. Yeah, it's a simplified picture. Okay. So looking at what you can see from doing RDMA, this is the Microsoft SMB Direct and basically you can double the performance of SMB by using RDMA. You can look at a demonstration on YouTube that they did at the Microsoft, I think their conference is called Ignite. So at the Ignite conference a couple of years ago because this technology has been out for a while, they did a demonstration, a live demonstration which is you can look at with this YouTube, on this YouTube site or video. And what they did was live turned on and off TCP, IP and RDMA so they switched back and forth. And basically when they did that, the performance or the bandwidth that I showed before of course was doubled, but the latency also halved when you went to RDMA and the CPU utilization dropped by over 30%. So this is using a software defined storage application like Ceph with RDMA that's in production today. So if we look at RDMA over Ceph, this has been an ongoing project with multiple companies involved. Our company, a company in China called Exkai, Samsung, Sandisk, Red Hat have been contributing to this effort. It was first released in beta in the Hammer release back in June, almost two years ago now. It provides RDMA on both the version, on both the public and the cluster side and it did have high performance and I'll show you some of the numbers in a minute but it did have limited scalability. A lot of it had to do with the memory requirements that you brought up a minute ago. And I'll address that in a second. But, and now recently there have been lots of updates, they're being pushed upstream in the community constantly and the focus has been on increasing the scale and the performance and also we've implemented a version that no longer requires pinning of the memory. So one interesting thing about RDMA is that you have these memory locations on two different sites across the network. Well, you have to pin down that memory and you set up memory pieces of memory for all the different connections you have. And what was implemented recently was a way to do that dynamically so that you don't have to pin all that memory. You just pin the parts that are needed are pinned and the rest it's dynamic pinning basically mechanism. Question? Yeah, so does this have to require a single layer to network across the whole step cluster? No. Or does it work when there's a routing step? Yeah, yeah, yeah. So that's another good question, a very good question. I didn't go into the history of Rocky but there's two different versions of Rocky, Rocky V1 which has initially came out many years ago and Rocky V2 which came out about four or five years ago. The difference is that Rocky V1 which is kind of obsolete now. I mean, it's not used very much. Most everybody uses Rocky II now and the difference is Rocky I was layer two and Rocky V2 has an IP UDP layer so it's routable. Thank you for the question, good question. So you can go to the Melanox community pages and it walks you through how to configure SEF to use our DMA. And here's some of the performance numbers for that. So we measured it in two different ways. One just raw performance, the same setup and the other to see the CPU savings. So we could save multiple cores on both the client and the OSD and still get 44% more performance. And if we had the same number of cores used on both sides we could get almost 60% in performance improvement. Like most things with SEF, depending on the workload you get different performance. And in this case, which is where we've seen the best performance, where you've got a high IOPS workload with small block sizes. And here we could see performance improvements of up to 3X using our DMA with very low latency. So to kind of prove to everybody that this isn't just slideware, we have lots of customers around the world in multiple different industries, from financial to cloud to education and more. And if you wanna know more about using our products with SEF we have a booth over in the marketplace and I'm also open to answering some questions to kind of summarize the benefits to improve SEF. One thing to make sure you do is use faster networks. 10 gigabit is not enough. And especially if you've got more than 15 hard drives or you're using SSDs. SSDs just require, or SSDs will give you higher performance but make sure you improve the network as well. And having a separate cluster network is a way to get better reliability in your system and to improve the performance. And then if you want to go even further into the turbo area of performance, you can look at our DMA for SEF. Any questions? So it's my understanding right now that our DMA and SEF is still kind of upstream development, is that correct? Or is it fully supported in any of the GA versions of SEF? It is not fully supported in any of the GA versions today. Any idea when that will happen? I think we're hoping to have that happen later this year or early next year. And then the other question was about PFC in Rocky. Was there a specific adapter that no longer requires PFC like ConnectX4, ConnectX5, or is that just kind of universal across all the adapters at this point? So if you take our adapter product line just to give a little background for, you obviously know how well, but ConnectX3 was our 10 40 gig product that came out probably four-ish years ago. ConnectX4 was the 10, 25, 40, 50, 100 gig product that recently came out. And then ConnectX5 is the product that just came out. Each one of those products has a faster ability to recover from problems that it sees in the network. All of them will recover, but more and more of that transport layer has been improved in the hardware of the adapters going forward. So ConnectX4 can definitely do it in a rocky environment using, in not using PFC, you need to use an implement, a function called ECN, or extended notification of congestion, I think it stands for. And then ConnectX5, you don't need anything. But one thing to keep in mind, if you're trying to get super high performance, like we're talking about here, you can't just run it on a 10 gig network in the middle. You have to either over provision or implement some type of congestion management in order to get the performance, the highest performance levels. It nothing comes for free. But it does work, you just lose performance. Yeah, very good work. I have a couple of questions. Yeah. So first thing is, I didn't see IP offer IB numbers, right? Did you compare with that scheme? I mean, RDM is that. Yeah, so you can definitely run this with InfiniBand. There's really no difference in the API interface. And you'll see faster performance with InfiniBand, depending on the wire speed, because InfiniBand has lower latency. Yeah, my question is, compare with RDMA solution? Compare RDMA solution with the default one running over IP or IB. How much benefit do you see? How much difference between InfiniBand versus IP? No, how much difference between RDMA and IP over IB? IP over IB. Oh, I see. Okay, so you're just running IP on it, and you know, I don't, there may be some numbers for that, but I don't have them. Let me, I suggest you stop by the booth though, over in the marketplace, because there's a marketing guy there named John Kim, who has organized a lot of the testing, and he would know if we've tested that. Okay, yeah. The other thing is you mentioned about the memory consumption. Yeah. I'm wondering, do you support like UD or DCT, rather than just RC? Do you have any ideas? Yeah, it has to be the RC. So I'm not a super InfiniBand expert. My focus is on the storage, but I have done it in the past. So the dedicated, or the type of connection over InfiniBand where you have, or over RDMA in general, where you know that your message has been sent and guaranteed, it does that. And it also handles the just plain messaging rate, the UDP part of it. So this new dynamically allocation works across all the different types of RDMA. Okay, thank you. You're welcome. Question regarding the ConnectX evasion coding support. What kind of throughput performance improvement can you expect using these type of adapters versus doing it in CPU? So the ConnectX five, as I said earlier, is the one that supports that, and it's just come out. And we haven't released any performance numbers on it yet. So I guess I probably can't tell you in a public forum. If you wanna stop by the booth this afternoon, we can discuss it more. Sorry about that. It's good. Other questions? Yeah, I think probably a lot of people are still running file store-based stuff clusters, but with the change to BlueStore and the sort of, not necessarily the need for a separate journal. There's obviously performance implications to that. How do you see that feeding into this also dramatic increase in the capability of IO from these additional devices? So I'm not 100% versed on the differences in the load on the network, but if the load on the network is the same, meaning that you're not reducing the amount of data that's moving across the network with this change, you should see no difference. Anything else? Great, well, thanks a lot. Thank you.