 So good afternoon, I'm Brad Spires, I'm here with Ryan Meredith, and we're going to tell you a bit about Micron's capabilities around OpenStack and specifically Ceph, as it comes to Flash and how you can optimize it for your workloads. So I spent the first 20 years of my career on Wall Street, and I built technical systems to solve business problems. And one of the last things I was asked to do was be the tech lead of something called Project Greenfield. It was our software-defined infrastructure. Now I'd say I wish I had someone like Ryan on my team, right? Because Ryan has spent the last two years focusing full time on how to tune OpenStack and specifically Ceph to get the most out of Flash. So I'm really excited about some of the key insights we can share. It's a bit surprising, you know, because the thing was, you know, my teams, they had to actually worry about tuning, right, which type of SATA, which type of SAS, which type of NVMe to choose and why, and how to configure the software to get the most out of it. So Ryan has done a bunch of that work for you. Now the thing is, you know, we hope that not only with some of the solutions we'll offer today, but going forward, you'll consider Micron as a partner, right, because the idea is that, you know, we can help you with these things as well as you start to think about persistent memory, right, or even 3D cross-point. And of course, when it's time, you can also talk to us about archive, because even that can actually be best done with SSDs. So with that, I'm going to turn it over to Ryan to tell you about the Ceph solution we have. Thank you, sir. So yeah, so hi everybody. I'm Ryan Meredith, I work for Micron. And you might first ask, you know, what does Micron care about Ceph? And so just let you know, I work for a team within Micron called the Storage Solutions Engineering Team. And we're based in Austin, Texas. We have our own big fancy lab to work in, a bunch of equipment. And basically it's our job to do real world application performance testing against a whole host of different applications. And really the reason Micron is looking into this is because as the manufacturer of DRAM and solid state storage, we basically represent a large portion of the cost of most storage servers. So we figure we're in a pretty good position to do some of this testing for you and show you all the cool stuff that we do. Within my team, I specifically work on Ceph. I've got team members that work on vSAN, storage spaces, Hadoop, Spark, MySQL, basically down the line. So a whole bunch of different expertise within our group. And basically what I'm going to talk to you about today is the Ceph testing I've done on three different architectures. The first two being POCs and the third being a full reference architecture. And show some performance details on that third reference architecture. And then also share some of the future, some of the things we're going to test going forward. I've also included lots of cool pictures of octopus or octopi. So if you get bored of listening to me speak, you can look at the cool octopi. So I'm going to start by the performance comparison of micron-powered Ceph architectures. So basically two years ago I set out with the stuff in my lab and a handful of micron drives to figure out basically what we can do with Ceph, how to make it work, how to tune it, how to make it run. And so these are the three architectures. So the first one was an all SATA POC. It used 10 of our 510 DC, which is our SATA drive per storage node. It had the Intel 2690v3 processor, so pretty higher in processors. 256 gigs of RAM, 40 gig melanox. That was on Ubuntu 14.04 and Ceph Hammer. Moving on from there, basically did the same exact test, but I added Micron's SAS S650 DC drives as journal devices. We also began our partnership with Red Hat at that time. So that was with rel7.2 and Red Hat Ceph Storage 132, which is still hammer. And the latest is a full reference architecture that we released that was actually just published yesterday morning. So this is fully available online right now to look at. This is with 10 Micron, 9100 max NVME solid state drives. That's 24 terabytes per one U node. It's with the 2699v4, so it's the highest been Intel dual processor we could put in there, 256 gigs of RAM, 50 gig networking, and with Red Hat Ceph Storage 2.1. So basically comparing those three solutions, with this is 4K random read, 4K random write, it's RBD or FIO running against RBD. And so we started off with 125,000 IOPS. This is per storage node, because we use different numbers of storage nodes in each test, so to compare them directly, we're looking at just an individual storage node performance here. But the Micron, the SATA solution was 125,000 4K random read IOPS. By going to REL, adding SAS journals, we went up a little bit and on reads, that's also due to me learning what the heck I was doing and tuning it a little better. And then the latest one is with our NVME drive. So 287,000 4K random read IOPS per one U storage node. The random writes actually increase much greater than that. So our first one was pretty sorry, and I don't know if any of you have done SSD testing with Ceph, but when you do 4K random write tests, you start to cry a little, because it just looks horrible when you start. Adding SAS drives to that, of course, doubled it. So that was a great result. And then doing all NVME basically tripled our previous performance. So we're getting 60,000 4K random write IOPS per storage node. And that's kind of where we've tested. And now I'm going to focus on basically the third line there, which is our all NVME Ceph architecture. This was a partnership with Supermicro, so they provided the 1U servers we used. And then it used Red Hat Ceph Storage 2.1, which is the dual release. And so the specific hardware configuration, this is with the Supermicro Ultra Server 1028U TN10RT+, which are a very sexy name for a server, 2X. So 2Intel 2699, V4s, those are 22 physical core processors. There's two of them, so 44 physical cores with hyperthreading, that's 88 cores per storage node. 256 gigs of RAM. We used two Melanox 50 gig one port cards, and that's because the server had two X8 slots. So we wanted to maximize network throughput. And then 10 of our Micron 2.4 terabyte 9100 max NVME solid state drives. So these are the fastest NVME drives we make, the things are screamers. The endurance on those is three drive fills per day. So they're resilient for a solid state drive and they're pretty high capacity. For monitor nodes, we used three monitor nodes, 1028Us, they were asleep the whole time. Networking, we used 200 gig switches. So one switch for the client network, one switch for the storage network. And that's the model number for those. On the software side, we used Red Hat, Seth storage 2.1, Juul 1023, REL 7.3. The only other piece of software we put on there was a Melanox OFED driver, 3.4.2, the Switch OS, Cumulus Linux 312. And the deployment tool we used was Seth Ansible which comes with Red Hat, Seth storage and is awesome. And if anybody here works on Seth Ansible, thank you so much. Made my life a lot easier. So the performance testing methodology, so this is how I tested. Basically, I used FIO against RBD for block devices. I used RADOS Bench for object tests. I had 12 super micro 2028U load generation servers. Those were connected over 40 gig. And basically, I kicked off tests on one or more of those 12 load generation servers at the same time, whatever provided the optimal performance for each test. And all of the results I'm going to share to you are an average result over a 15 minute test run. Those test runs were run at least three times, and the average results are recorded. Also, the data set size on all of the tests, it was five terabytes of data on a 2X replicated pool. So 10 terabytes of total data. The reason I went with that size is that I had one terabyte of total system RAM. So I wanted to make sure that we had a good ratio of RAM to data. So about 10% of data can fit in RAM at any given time. So in testing the reference architecture, one of the important things was to test drive scaling. The idea was to see where the maximum performance was reached as we scaled up the number of drives in the system. So we tested with two drives per storage node, four drives per storage node, and 10 drives per storage node. So basically, eight drives, 16 drives, 40 drives total. You can kind of see the scaling up there. The other thing I did was I tried to keep the number of OSDs in the system as close as possible. I found that 8 to 10 OSDs with Red Hat Ceph 2.1 and NVMe drives basically maxes out the CPU on 4K random workloads. And so with two drives per node, I used four OSDs, which gave you 32 OSDs total. Four drives was two OSDs per node, so 32 total. And with 10, there was only one OSD per drive. All of those use 20 gig journals that are co-located on the same drives. And so this is the 4K random read performance that we reached. You can kind of see the 4K random read IOPS graph there is a little bit of a hockey stick, so the two drives per node was limited by both the drives and the CPU. When you reached four drives, you reached 1.13 million 4K random read IOPS. At 10 drives, you reach almost the same thing, 1.15 million 4K reads. And so really what this shows is that at four drives per node, you can maximize the CPU utilization of the system. You can kind of see the little graph underneath. That's a time series graph of the test doing 4K random reads. And so you can see they're kind of all topped out right at close to 100% there. For 4K random writes, it's a pretty similar story. So two drives to four drives scales up. From four drives to 10 drives, you effectively flat line. You're using all your CPU. The other interesting thing here is with two drives per storage node, it's more drastically drive limited. So you see the graph underneath with the CPU percentage. That orange line is the CPU utilization with only two drives per node. You can kind of see how it's way under 90% there, where the other two are flat lined up at the top. So really what this shows is that with Red Hat Ceph 2.1 and 4K random workloads, you can hammer these CPUs. Next, we moved on to object performance. And to note about object performance, really with an all NVMe system like this, the focus really was on block performance, right? This is what you would use this for block. Object tends to be served pretty well with the standard Ceph solutions. So object was provided here with just RADOS bench tests straight against the Ceph cluster. And I totally understand that that is basically the, that would be the theoretical max you would ever get from an object store because of RADOS gateway, which isn't taken into account here. So for object reads, you're basically maxed out on network. So you're getting 20 gigabytes a second, even with two drives per node. So you've maxed out your 50 gig network with object reads. And that's really, that's all there is to say there. There's very low CPU utilization, there's not a lot going on. You're just using up all your network. With four megabyte object writes, you actually do see scaling as the number of drives increase. So you go from 1.8 gigabytes a second with two drives, up to 4.6 gigabytes a second with 10 drives per node. Underneath is the object write drive latency. So that kind of shows that the individual latency on the drives as the tests were going on is extremely spiky. And I think that has to do with having the journals co-located with the OSDs. You see a lot of that interference, which means that you're not really seeing the maximum performance of these drives. So yeah, so kind of a summary of performance here. Basically 4.9100 max NVMe drives per storage node is your optimal IOPS per node. If you increase past four, you reduce latency, you add a little bit of IOPS, but you're basically adding high performance capacity at that point. Red Hat Seth Storage 2.1 can saturate to 2699 V4s with 8 to 10 OSDs, provided proper tuning and sufficiently fast drives. And reads on the network side, 4K reads will saturate a 10 gig link. It's a little bit over a 10 gig link, so we would recommend at least 25 gig for block workloads on a system like this. 4K writes were a little bit under a 10 gig a bit, so the 4K write throughput could be serviced by a 10 gig link on this one. For object performance, reads are always network limited, so get more network if you want more object reads. I would assume that going up to 100 gig would probably about double the performance we saw, and you probably still wouldn't use all your CPU. Writes are drive limited, they can saturate a 25 gig link, so they're pretty close to needing 50 gig also. And again, the large block write performance is kind of a symptom of the large objects being written to journals co-located with your OSDs. One thing that I did do in the meantime was I tested with Kraken and Bluestore, and I didn't spend a lot of time tuning the 4K part. I'm still working on that since it's a whole different thing. But just immediately right out of the gate doing 4 meg object writes, I almost doubled performance because you don't have the right penalty of writing everything four times. And across all of the object reads, CPU utilization was low. There's just some more notes on the specific platform we used. So the Supermicro 1028U, the way it was, the way it's set up is the CPU one has six NVME drives associated with it. CPU two has four NVME drives and both of the NICs connected to it. So it's kind of a, it's not a balanced system. And so I tried tuning CPU processes, I tried doing Numa pinning, kind of trying to figure out how to get them both on different nodes. But doing that really didn't produce an increase in performance. So in this particular case, good old IRQ balance did the job and was able, all of the performance here was with IRQ balance enabled and really ignoring Numa tuning completely. Also 50 gig is the fastest NIC for x8 slots, these servers had x8 slots in them. So we couldn't use 100 gig in it. But really the only place you would see a performance improvement with 100 gig would be 4 meg object reads. So that's kind of just a quick and dirty example of the performance results we got with 4K block. All of these, and just a quick kind of side note, all of this data and a whole bunch more, probably more than you want to read, is available in the reference architecture that we posted yesterday to Micronsite. So you can see how we did it, the accept conf scripts, everything to configure it and do the exact same thing that we did. So the next part here is kind of future testing. So having done all NVME drives, what are we looking at doing going forward? And there's really two pieces of technology that stand out in Microns portfolio that are out right now. The first being NVDIMS, so NVDIMS are non-volatile DRAM. Basically, they're 8 gigs to 16 gig capacities. They fit in a standard DDR4 slot and what they have a little bit of NAND in them and a battery effectively. So when the system power shuts down, they destage whatever's in the RAM to the NAND and then when the power turns back on, they copy that NAND back into the DRAM. These can be used as block devices. And so for Ceph, you would create journal devices out of these NVDIM slots and get absurdly fast read and write performance out of them. Testing in an HP server, I got over a million 4K random write IOPS against one of these block devices at five microseconds of latency. So they're just absurdly quick. So we're kind of looking forward to using these within Ceph. Even with the dual release, we can do it with small journals, two to four gigs. And then as BlueStore becomes GA, we can use it for BlueStore because it requires much less space. So it's basically the perfect use case for NVDIMS. The other piece of technology we have coming out is the Micron 5100 SATA SSD. So it's our new flagship SATA drive. It goes up to eight terabytes in capacity. So it uses our 3D TLC NAND. Basically, it'll allow 1U storage nodes to go up to 80 terabytes per U. The architectures we'd like to look at with this is all SATA, if you just need capacity, or SATA with an NVME journal. And most interestingly is SATA with an NVDIM journal. So those are kind of tests that we're looking at going forward. And I'm gonna hand it back to Brad and thank you very much. Thanks, Ryan. So one of the things that I found when I was on Wall Street is I would come to a conference like this and I would see a presentation like Ryan's and I would get super excited, right? The techie in me would say, this is exactly the solution I wanted to go by, right? And I would get back and I would sit down at a management meeting and everything would change, right? Because I would typically get some of the detractors saying, well, do we really have to have all NVME, right? I mean, we just bought FlashArrays, wait a minute here. Why would you need to jump like that? So we started to put together, right? As you start to partner with someone like Micron, who actually makes the spectrum of everything for solid state. Solid state storage, right, in every flavor, right? As well as persistent and non-persistent memories. So we started to take a look at what is the difference, right? So I'm just curious, how many people in the audience use hybrid arrays today? Anyone use all FlashArrays today, right? So for all FlashArrays, right, you're two and a half times faster than a hybrid array. Then you start to take a look at, as you start to move to the left on your screen, right, you start to take a look at the direct attached devices. So what kind of difference would that make? Now, at least when I would talk to my management, right, there was a magic number, right, at 100X. That was when I suddenly got someone's attention, right? That was when two orders of magnitude I could deliver a completely different business capability, right? Some folks who've come and talked to us on the floor said, you know, we can get Seft to run for some of the lower performing storage solutions. But what about my databases, right? How do I really push this? When you start to talk to them about, you know, look, you can get two orders of magnitude out of jumping to direct attached storage. But it can go further, right? So we didn't talk about SATA today, right? We talked about NVMe. So what is that difference like, right? So again, right, and I should back up, for all flash, you can still get a factor of 40 by going to direct attached. But now again, right, if we're going to move to NVMe, even if you have all flashed today, right, you're now looking at an additional factor. This isn't 480%, right? This is 480 times faster. So at least what I found was people were happy with storage as long as I could get it instantly, right, and give me very large capacity, which is essentially the solution that Ryan has put together. But then beyond that, right, so this is what we talked about today, but when Ryan starts to talk about crazy fast, right, what does the future start to look like? So as we start to move further to the left, right, we have 3D crosspoint that we've come out with in partnership with Intel and even NVDems. So now if we start to compare a SATA solution, right, with NVDems, this is an additional factor of 3,000. So when you start to think about how fast could my logs go, here's three orders of magnitude. So in summary, right, we'd like to think that we've done some of the tuning for you. We realize that everyone's problem is different and we'd really appreciate the opportunity to partner with you and understand some of the challenges that you face. And we'd like to share the lessons that we have learned collectively as a team. Now both on things we've done today as well as things we can look at in the future, whether it be 3D crosspoint, NVDems, or even if you like, looking at our archive of SSDs. Thank you very much. So are there any questions? If you can go to the microphone. Thank you. So as a part of your future testing, are you planning on testing with RDMA either over Ethernet using ROCKY or maybe InfiniBand or some of the kind of stuff? Will you set that up for you? Yes, absolutely, 100%. Yes. In the system that we used, Red Hat Cef 2.1 and RHEL 7.3, it wasn't fully baked yet. It's not supported in that one. But the second it is, it will be in all of our solutions and will invalidate these results because it'll be way faster. So absolutely RDMA is something that's on the map and will be used in all of our testing going forward as soon as it's supported by our partners that provide the software. And outside of Cef, you should know that last week, Micron announced, and the reason for the comment about that we set it up is just last week, we announced an NVMeaver fabric solution. So look for solid scale. Google for solid scale. A great question. Thank you. Any reference deployments for the solution as well as in terms of performance, Cef, in general, a lot of people have concerns about especially latency numbers. I mean, these latency numbers look pretty good. But in general, Cef latency is reasonably bad given the block or object. So how are you able to tune to this number? 4K, I didn't see 4K reads, sequential reads and stuff like this, 4K random reads. So two part question, first part was reference deployment, second part was on performance, especially rebuilding the Cef clusters and stuff like that, slowing down everything. Yeah, so I will refer you to the slide that I just put up there, which has the reference architecture in it. And I include many more tests than the ones that we presented here. I don't believe I did sequential tests. I did 70, 30, 4K random, as well as quite a few others providing the CPU information, network information, latency on the actual drive level. So there's a lot of good stuff in there to show exactly how we got where we got, what the test results were and in a number of different scenarios. So I feel like a lot of the questions you have will be answered by that. And then if there's a test in there that we didn't do or something that customers see all the time, we're happy to add it. It's definitely, I work in a test lab, I get to build these cool things, but it really helps when customers come to me and say, hey, I see 64K random write workloads. What does it look like then? And of course, we can always add stuff like that to the test. And your second question about just the reference deployments, I think we'll give Ryan a little bit of time. We announced it just now, so just yesterday we published it. So is that? Operators are standing by. Just two quick questions. First question was traditionally, deployments use three X replicas instead of two X replicas. So maybe you could say why you chose two X for your tests, I think you said. And the other question is, I saw you shared average, I think it was either average latency, but in my experience, Cep tends to have some really bad outliers. Now maybe that's because I'm used to rotational media instead of all flash, but maybe you could share some of your, if you had any 95 percentile, 99 percentile IOs kind of things like that. That would be interesting. Thank you. Yeah, so for the replication question, so the PC answer is that all SSDs are much more reliable than spinning disks, so you totally don't need three X, you only need two X. The real answer is that it gets prohibitively expensive if you do three X, and so you'll see pretty much anyone testing with two X or with SSD, Cep is gonna do two X. One really interesting thing that will come out in Luminous is the ability to use erasure coding for block pools. And I think once that's an available feature, that'll be used probably by most of us setting up SSD Cep clusters. And from the outlier perspective, in the reference architecture, I have kind of time series data on the latency of the drives themselves. They're fairly flat. I did not collect 99.5, 99.9% latencies in the test. I have the data, it's just not included in the reference architecture because it already got way too long and boring, but that's definitely something I have and can compile. Okay. So one U of two sockets, that helps. Correct. So does this mean that the assumption of one U of two sockets has changed when you go to two servers per one U? Did she receive U because it's only gonna run maybe two drives? That's a good question. The architecture question is interesting here, right? Because you're right, going from four to 10, you don't really see a performance improvement. Our buddies at Super Micro released near the end of testing a device called a Big Twin, which was basically two servers in one U, or it's a two U, four node device. And you can put two servers in each socket and it was all NVMe. And each kind of blade, if you will, each server was able to put, you can put six NVMe's in the front of it. So basically you can do a two socket four NVMe and not lose a bunch of capacity or a bunch of drive bays. And that's actually referenced in the architecture as well as a possible alternative because yeah, going from four to 10, really you just add capacity. So you're losing out on the extra performance that you're paying for with NVMe. That was essentially my question. So the CPU is becoming a bottleneck at this point. And it seems like we're on the way to having just two drives maybe saturating the CPU and getting worse as we go further. So there was talk last year by Samuels from Sandisk about getting the CPU and memory out of the way. Do you have any thoughts about that? You know, from my perspective and from Micron in general, we're a hardware manufacturer. So our solution is just to brute force the thing which is really what we were attempting to do here. So we'll throw the fastest RAM, the fastest drives at these solutions. As far as tuning and changing the software stack, to be honest with you, we're not there yet. We kind of dipped our toe in the SEP pool and we've got some software developers on staff that can help work on things like that. But at this point, we're kind of just starting and creating some compelling SEP solutions. I would just add to that though that, so last week we did announce that we're even working on a DRAM that can do sort within the DRAM itself. So we're beginning to think about different functionality. You could think of that as an offload out of the CPU. So there's a number of things under discussion and consideration, but these are the sorts of things that we could talk about, you know, as a partnership. Good question. I'd like to second the request for three extra applications just because that's what most people run in production. And then also, erasure coding results would be interesting. Yeah, they would be. And really the reason we didn't do erasure coding in this specific one was because the object test was kind of a side test. It was something that I figured needed to be in the report but it wouldn't really be the main use case for an architecture like this. But absolutely, erasure coding will definitely be used the second it's available from a block device standpoint. I did, I tested- You'll repeat the question so that- Oh, so he asked if I changed the memory allocator, TC malloc versus gem malloc to see if there was a difference. And in my testing, the default TC malloc that comes with a rel worked just fine. The gem malloc didn't really provide any difference, any discernible difference in performance. And so going through the extra steps to get there didn't seem like a worthwhile scenario. I know that had been a major problem in the past though, so I definitely took a look at it. In terms of price points, usually self is seven to 10 cents per gig. So with NVMe solution, what pricing are we talking about? I'm just a simple engineer. If you're interested in pricing, we definitely have sales staff here. So we can definitely get you the information. Oh yeah, so again, in my fabulous reference architecture, there is some pricing guidance in that it shows dollar per IOP comparison of the different solutions. And that's based on just pure MSRP of all the parts. So you can kind of see like going from two to four to 10 drives per node, what is your dollar per IOP look like comparing those? One of the slides that you presented where you went from like two SSDs to four and then on the way up to eight. Seth's best practices recommends that you use a single OSD per device. And there was some research done by Intel and they published a paper saying they were unable to saturate a single SSD to its maximum using a single OSD. I forget what the size limitation was, but they recommended that you break up a single SSD into four partitions and put four OSDs per devices. So it would be interesting to kind of compare the results of what Intel was recommending versus your results where even up to 10 devices, I believe if I understood you correctly, you were using a single OSD to drive 10 of them together? So in my testing, each drive had a single OSD on it. With the 10 drives per node, I had a 20 gig journal co-located with the rest of the data. On the four drives per node test, I used two OSDs per drive. And then on the two drives per node, I used four OSDs per drive. And I found that at those levels, I maxed out the CPU. I know that in previous versions of Seth, Hammer specifically, you absolutely needed to cut up SSDs because the individual OSD processes could not push enough work through them. But with Juul and the current release, we were able to use fewer OSDs which helps with complexity and deployment and all that. So even with this setup, you were actually network bound, not the drive throughput bound, right? I was, well, CPU bound on the small block piece. I was network bound on the four meg object reads. And then I was drive bound on the four meg object writes. Okay, thank you. Yeah, no problem. So do you have any numbers on a single VM doing a single threaded application and how much IOPS it can do? Because I think that's the major limitation of Seth at the current release. So I know from doing testing that... So I had 12 low generation servers pushing load against my cluster for four meg, or not four meg, for 4K random write IO. I used an FIO process with one thread, one Q depth. And I was able to get 22,000 4K random writes out of a single client. And then I had to scale that up to get to the maximum number for reads I honestly have no idea. I don't remember what that ended up being. Thanks. Going once. All right, well, thanks everybody. Really appreciate it. And I have a good day. Thank you.