 All right, we'll get started. We're a little bit short on time here. First of all, welcome. Thank you for coming. My name's Matt Tangvald, and I work for NetApp. And for the last few years, we've been working on how to use our scan platform, our scale-up scan platform with third-platform workloads. And one of them is Ceph. And we've recently published some reference architecture documents. And so we wanted to go through what we found in the process. So we'll go over a quick introduction of the work that we've done and why this matters. Then we'll spend some time talking about how we set up our lab with a traditional Ceph, what we refer to as white box. So in our case, we use the tier one server vendor, but essentially no intelligent storage. Then how we modified it to actually use enterprise storage building blocks. And what we found between the two. If you have any questions, you can shout them out. These lights are really bright. So I don't know if I'll be able to catch them if you raise your hand. So first of all, I mean, what is Ceph? I mean, anyone here running Ceph in production today? Anyone running a proof of concept experimenting with it? OK, yeah. So generally when we see Ceph, the core value props around it tend to really come from the fact that it is a unified, software-defined storage platform. Today in production, we have the ability to run block workloads as well as object workloads. And you have the ability to scale almost infinite namespace or infinitely by adding more nodes as needed so you don't have any capacity limitations. File is in the works. It's been in tech preview for a little over a year. But right now, that's not fully supported. So really, it's about having that on-demand capacity scale with the right IO interface for your workload. So then the next question is, why does NetApp care? Well, while we were the first enterprise storage vendor to be part of OpenStack Foundation, and we had the first Cinder drivers, and we also created Manila, the file service. We were co-creators of that, you can see, and this is actually six-month-old data from April in Austin, that almost half of all block storage deployed inside of OpenStack today is on Ceph. So it's one of those things where we absolutely just need to understand the technology in order to be a trusted advisor to our customers and partners. And you certainly couldn't ignore it since 50% of it roughly is Ceph out there. So then that begs the question of, why do you care about what we found? Well, one of the interesting things with Ceph is it has the ability to dynamically scale, and it can also dynamically heal itself using the CrushMap technology. However, nothing is free, and so we set up a eight-node cluster, and we're gonna go into detail, and we pulled two pieces of media out of it. And what we found, and what you're seeing there is we're doing simultaneous reads and writes. We can get into all the details. But when you remove those media where the red line is there in the center, we saw a significant drop in performance, which is Ceph doing its job. It's healing itself and rebalancing the cluster. That's a two-hour time window you're looking at there. So even over an hour and a half later, we're still dealing with the cluster rebalancing. So depending on what your needs are, if you were to deploy this as archive, that may not be a big deal. But say if this was an image repository and you were an online merchant, this may not be acceptable. And we found a lot of cases where folks actually may over-provision their clusters so that at the lowest state, that fail state, they make sure that that's their minimum bar, but they may have to have additional servers and storage to make sure they don't go below a threshold. Well, what we found, when we used enterprise storage, there's almost no performance degradation. I'd really love to tell you it's zero, but it's about one or 2%. So it's the same size cluster, same eight OSD nodes, but almost no degradation in performance with simultaneous media failures. And so we'll spend some time going into that. So just wanted to cover that at the front end to why this matters, is delivering deterministic performance with your Seth cluster. So what does our cluster actually look like? We actually have side-by-side clusters. I would call this probably a small to mid-size cluster. Eight nodes is kind of the minimum you'd want to run with. Typically, from a capacity, it's about 200 terabytes of actual usable data, which means we have about 6,700 raw, and we'll get into the specifics. But it's a very classic configuration. Eight servers will go into what each, the storage configuration is for them. We have monitor services running. We actually don't run those on physical servers. We run them on top of the OSDs as a service. It's a very common methodology. And of course, we've got client IO. So what's inside of each of them? Well, we have a pool of near-line SaaS drives. We use near-line SaaS in this case, instead of serial ATA, because those are the same types of drives we're gonna use in the enterprise storage, and we didn't want the drives to really be any part of the decision. We want it to be as fair as possible. And the same thing for the SSDs that were used as well. You'll notice that we kept the classic five OSDs to one SSD ratio. That's a very common ratio, and we'll show you how that actually can change when you use enterprise storage. So this is a pretty typical OSD configuration. SSDs for high-performance journal writes with near-line SaaS slower magnetic drives for actual capacity. And one of the things that doesn't get discussed often is how exactly what is going on when you write in Ceph. Well, Ceph is very, very good at ensuring your data is properly written, and in fact, it's fully consistent. And what it means by that is every time you write data, there's no acknowledgement back to the client servers until the journal entry is written and the data entry is written for as many copies as you have in a cluster. For our cluster, we chose replication of two, so there's three total copies in the white box version, but that actually means that there are six writes that have to occur before we can acknowledge to the host instance, the host server, that their one write has been completed. And this is typically referred to in enterprise storage as write amplification. You may have heard that term with SSDs, and that comes up in play with rebalancing and correction for cell failures in SSDs, but it also plays a key role here in Ceph, and we'll show you what we've been able to do with that. So, what happens when there's a failure in the white box? It's just a standard server running disks. Again, we showed you what those disks look like. Well, so I've got essentially a standard server with those drives in it, and inside of those servers are the disks, the CPU and the RAM. Well, when you have a media failure, we actually have to spend time and rebalance or recopy the data to recover from the failure. So if we lose one copy, we need to make sure that we rebalance and get those three copies. There's a lot of questions about why is it three copies? The easiest way to think about that is you want to have one copy to ensure that you have a drive failure covered. It's not a question of if drives fail, it's really about when they fail, but then there's also the case of what happens if an entire node is disconnected from the network and that could be catastrophic motherboard failure or someone could have actually just tripped over the cable in the lab or there could be a network outage. Now, in the case of Ceph, we have two networks. We have an exterior 10 gig network for our data to come in, but we also have a 10 gig backend network that helps cover these rebalancing activities so that it doesn't consume bandwidth on the primary IO network. However, it does actually use CPU and RAM cycles, and that's what you're actually seeing when that performance dipped down was that part of the servers are now dedicated to rebalancing instead of processing IO. So then what happens if you decided to change from the standard methodology of white box with no intelligent storage in them to using enterprise storage? Well, the first thing that we need to talk about is that the common wisdom, and it's been documented for quite some time, is do not use RAID with Ceph. And in fact, if you do, if you have say a RAID controller in one of the servers, set it to pass through so your disks are raw, or if you can't do that, not all RAID controllers can do that, it make it a single disk RAID zero. So the OSD capacity is the same, but that means you're not getting any of the protection from it. And what we've found is that may not really hold. There are cases where using RAID underneath can deliver performance benefits as well as scaling and cost benefits, and we're gonna go into that in more detail, but we wanna acknowledge that. This isn't, and in fact, not only will we acknowledge it, we've been working with Red Hat, we actually started this conversation before Ink Tank was purchased by Red Hat, and we've continued it. And the most interesting thing is as we tried to set up that first Cef cluster you saw, we realized how limited our experience was. We have very smart engineers that know how to test traditional workloads, both Windows and Linux with things like databases on enterprise storage, but deploying Cef was rather difficult, so much so that we had to enlist Red Hat's help through professional services to help train us and to set it up and deploy and optimize it. And we knew we needed to do that with a white box cluster before we could proceed to this next step, but we believe that there are cases now and Red Hat would agree, and we'll talk about the documentation we're working on with them, that there may be cases where it really does make more sense to have RAID underneath. So what does it look like instead? Well, in this case, we picked a standard, highly available enterprise-grade storage building block. It's E-Series. I promise this is not a product pitch, so this will probably be one of two slides where you're gonna see actual product names. And in this case, we've got a high-density enclosure for you 60 drives with SSDs and spindles, and we actually expand it with another for you 60. And this system is highly available, and by that I mean everything is redundant in it. Redundant controllers for multi-path failover, redundant power, redundant cooling, and SAS compared to serial ATA is also dual-ported, so we have redundant paths to each drive as well. And this actually delivers, instead of what you see with a traditional white box server, which has three nines, and I think you guys probably know from the keynote, we talked about nines on Tuesday. Three nines is measuring downtime in hours. Our system is five nines. Five nines is less than five and a half minutes, five minutes, 26 seconds a year. As the keynote presenter mentioned, it's not even really enough time to have a cup of coffee, but that's key to the value prop that we're able to deliver. So in this case, we wanted to be Apple's Apple, so we connected to the servers using Serial Attach SCSI, a direct connect, and you might ask, how does that, what does that look like when you do it? Well, it's actually quite basic. Our system that we use, the E-Series, has eight SAS ports on it, and since we're going to have failover, that really means I need two ports per server across the two controllers, so eight divided by two means I can only have four servers directly attached. There are ways to scale it with Fiber Channel or Infiniband or even serial ATA with switches, but we use a direct attach configuration. So that means that our eight servers before, four of them are connected to one E-Series pair and four of them are connected to another. Now, what happens in this case? Well, instead of having to have the servers process media failure, we can actually have the enterprise storage handle that on the back end. We refer to this as a transparent recovery, and in this case, we have hot spares, so there is idle disk available to do this, and that means that it's real time. Once the drive fails, we start a rebuild, and Seth is unaware that that's actually happening, but there's a few things that we get with that. First and foremost, because I've just covered for disk failures, I don't need three copies anymore. I need two, so I can set replication to one. Because we're using RAID and we'll go into it in a bit more detail, it's not exactly two copies. It's about 2.5 in our case with our RAID configuration because there is parity overhead, but still, when you think about scaling capacity in a Seth cluster, we're now at 2.5 versus three, so it's more efficient to scale with enterprise storage, so that's one benefit on top of deterministic performance. Now, you can't just expect this to work. You actually have to spend time, and right now we have to do it manually, and we have to ensure that our OSD trees are set up so that only the right four servers are connected to one E-series or one enterprise storage platform, and that the OSDs are failover balanced between them. That is a manual process. We'll talk about it at the end, some of the work that we're doing with Ansible were the first enterprise storage company to have Ansible modules. They're actually posted up on GitHub extras for Ansible, and moving forward, we can automate this configuration and provisioning as part of Seth installation, so, but we did have to do it manually, and in fact, not only do we have to do it manually, we actually have to go through and view the OSD tree and ensure that we see that the right capacity with the right OSDs are in fact delivered. So it does make the question, so wait a minute, you had before 20 disks in your white box server, but now you're saying you only have five OSDs in RAID configurations. Well, as it turns out, because they're five disk RAID fives, of which only four capacity, because we have five of those per OSD nodes, and there's four times the capacity, roughly 12 terabytes of storage, that we end up with the same number of spindles available for IO and the same capacity. Based on some configuration, you'll see that we actually couldn't get the capacity exactly equal, but we have to go through these steps to ensure that we've configured this properly. So what did we find? Well, I showed you the highlight at the beginning, because I didn't know how long you were willing to stay, so we already know that when we have dual media failures in our cluster, there's almost no performance degradation compared to the traditional degradation you'd see in a standard cluster. When we actually go through and look at it, there's some interesting facts. We ended up with a more usable capacity, because of the way that we ended up having to provision it, but we still have those eight server heads in there, and we know that we get about 25% better capacity scaling as we add capacity compared to having three copies with 2.5, but one of the other things that we're able to do here, if you look at the ratio of SSDs, is that that five to one ratio, which you really do not wanna go over in white box, ended up at 11.3 in our configuration. And so there are gonna be questions about, does this cost more, and did this in fact impact the cluster's performance, and we've actually looked at both of those. So when we look at the cost, what you see here, the bottom dark blue bars are the cost to acquire the hardware, and as you would expect, buying enterprise grade storage with enterprise warranties is more expensive than it is to buy white box. So on day one, it is more expensive to run stuff on enterprise storage. However, when you factor in the cost of software maintenance power and cooling, in the first year for our cluster, we're on par. And as you scale over time, because our system is five nines reliable, and our media tends to be more reliable because of our testing and procurement processes, you can actually see that the benefit over time increases. So first year is on par, so day 365, but scale that out to year two, three, four, and five, and it's less expensive in this particular configuration to run enterprise storage than it is to run white box storage. That was a bit surprising, I did not expect that. And we can talk about it more afterwards because I'm sure there are people in here who do not believe this. A lot of people inside of NetApp didn't believe it. I didn't believe it at first. We actually had to spend a lot of time working on this with Red Hat as well. Well, what about performance? Our goal was to ensure that we didn't, that we had roughly equivalent performance between these two configurations. That was a base rule that we set out and so we wanted to measure it. And when we actually look at what our system can deliver from a throughput perspective, you'll see that we do have the product names up here. I put them up there simply because I want to make sure that we're representing our product that we measure performance on. You can see that there are cases where the white box is slightly better. There are cases where E-Series is better, but in general, there's no major performance deviation in the cluster performance from a throughput perspective. And when we look at it from just an IO per second perspective, we found similar results. Now, there is a bit of variance and that has to deal with RAID striping and the like. So it's not exactly the same, but we believe, working with Red Hat, that this is equivalent performance between these two configurations. You may ask, what about latency? Well, with latency, it's roughly the same story. Some cases are a little better, some cases are worse, but there's no case where we're way out of range, where we've increased latency significantly by adding enterprise storage, which was a fundamental belief for a lot of people. They did not believe that you could add an external storage enclosure, cable all that up, deal with the RAID overhead, and then not have additional latency. So from our perspective, we feel like we met our goal of having equivalent base performance in an optimal state. So really, at the end of the day, what have we figured out is that by using enterprise storage with SEF, we can deliver deterministic performance and we can optimize the price around it so that it is not significantly more expensive and in cases over time, less expensive. It's not exactly what any of us expected to find when we went out and performed this. So that's actually very compelling from our perspective. And at the same time, we can scale compute and storage independently and we are actually more efficient in scaling capacity because we have two replicas with RAID instead of three raw replicas. So what are we working on next? All the work you saw was on SEF 1.3. SEF 2.0 was launched at the end of August. We're in the middle of upgrading our test infrastructure to actually go through and retest. We don't expect to see major differences but of course we wanna update it. The thing that's unique about this type of testing which is very different than what you would see from a standard enterprise storage company is we're not gonna run every single configuration and exhaustively test everything and then publish a report. What we've done is we tested with 1.3. We worked with Red Hat to make sure that our test and our optimizations were correct and that we're not distorting SEF or misconfiguring it but that's just the first iteration. So 1.3 is the first round. All of the collateral I'll talk about is 1.3 based and now we'll have 2.0 hopefully by December of this year. We're in flight again as I mentioned, setting that up. If you'd like to find this information these slides will be posted but we've actually posted a four page solution brief that covers all of these facts including the cost, performance and benefits. Up on NetApp's website, our public website you'll notice this isn't behind a firewall or secured. We have copies up there of course as well but you can actually directly find this documentation and then we have about a 50 page reference architecture document. We refer to those as technical reports or TRs so that we've also posted that so it'll show you exactly what we tested, how we tested it and if you wanted to do it yourself how you would do it. Red Hat has reviewed those collaterals with us and they have blessed us to actually be able to put their logo on it so we're in the process of updating based on some minor pieces of feedback and putting their brands there and those versions will be posted up on their site before the end of the year as well and I think that's really important to notice. I can come up here and tell you that I can do whatever I want with my storage box in this new cloud world, on-premises cloud world where everything's supposed to be white box but if I don't have the support of the vendor of the actual creator of the technology that doesn't really mean much because if you pick up the phone to call them and they're like we don't know what you're talking about we told you no raid and so we're actually actively working through what does this look like and how we can actually promote it together but we agree that there are cases where this makes the most sense to do. So with that, are there any questions? I'm gonna come down so it's easier to see. Just yell them out. None? Yes, so the question is hey, what happens if your entire E-series dies? Let's say someone cuts all of the cables that go into it or pulls both power cords at the same time. Well that's why we've kept two copies in the cluster and we've balanced those copies to ensure that copy one is on one set of enterprise storage, controllers, we call them controller pairs, and that copy two is fully discreetly on the second one. So we actually can cover for that failure of a catastrophic enterprise storage system failure because we have another copy in the cluster. Now that of course will impact performance. I mean, at that point, you know, I mean that's a major catastrophic failure. It's equivalent to unplugging half of the server in it. But the cluster is still functional and that is one of the test cases that we're looking at getting data on after we go through our Cef2.0 testing. Yes, so the question is hey, what happens then? Well, you're gonna have to copy the data, but again, the interesting thing is we're doing one copy of data, one bulk transfer copy of data, instead of in certain cases, depending on how you balance your Cef cluster, there are cases where losing one server means you can actually lose two copies of data which can be challenging. So yes, there would be a bulk rebalance. Now there are ways, and we're looking at this as well, is there a way to capture the delta change so we're not having to do raw essentially copy of the data? But yes, catastrophic enterprise storage failure would require a full copy. But again, the downtime on our system is less than five minutes a year. Any other questions? Yeah, so dynamic disc pools. So interesting fact, Sage Weil, who I all really appreciate you being here considering he's presenting down the hallway on the future of Cef, really unfortunate scheduling challenge. But he created the crush algorithm and we actually use the crush algorithm within the E-Series. We refer to it as dynamic disc pools and we can balance internally. We did significant set of testing. The challenge that we found is while it's functional, and to be clear, as long as you're connecting block-luns, I mean your cluster will function, whether it's performant or optimally configured, that's the question that we really answered with this first round of testing. But why wouldn't you use dynamic disc pools? Well, it's a compound effect. If you're using horizontal crush across, which essentially means we're dynamically rebalancing across the eight nodes and then you dynamically balance up and down on the E-Series, depending on how the IO is written, we ended up with variable latency and we didn't like that. So there were cases where it was actually significantly faster than RAID 5 and there were other cases where it was slower because of latency, just depending on where E-Series dynamic disc pool balance was versus the crush balance in the Cef cluster in and of itself. So we did look at it, that is not a recommended configuration. That said too, why did you guys pick 4 plus 1 RAID 5? Well, we wanted to start with the premise of what if we just make each OSD bigger? Because that can be a limitation, especially as drives continue to get bigger, you may not wanna put 20, 20, 10 terabyte drives in there because the time it takes to recopy but the base premise was what if we made each OSD bigger can Cef actually consume larger OSDs because we also not only made them bigger, we made each OSD five times faster because there's five drives that are being read and written at the same time. You could easily do this with RAID 6 and that's also in our analysis and we're looking at how far we can extend the size of them. Any other questions? Yeah, so hey, we've heard of customers that have seen variability in block implementations of Cef and they're trying to make sure they have an SLA on the performance of it. That is an inherent challenge with using the crush technology. We even see it if you just use an E-Series with raw DDP, there is a bit of variability, not as much because it's not distributed. And in fact, when we had all this data ready to show Red Hat the first round of it in April, we really didn't know what they were gonna think because they'd helped and consulted with us but we hadn't met with them with the Cef team directly and said, what do you think about it? And when they looked at it, they were very excited about this because we can for customers that do need SLA's, that do need deterministic IO and block clusters, we can address that. And that's exactly what you're seeing. One of the things that we wanna show in the next round of testing as well is when you finish rebuilding the performance comes up but you haven't replaced failed drives yet in the Cef cluster. So then you replace failed drives and you rebalance again and the performance goes down. So there's this sort of saw motion in the performance. What customers are doing today to address variable performance is they over provision the cluster. And what I mean by that is if we knew that say 30% of the performance, the cluster performance goes down by 30% with multiple media failures, then I'm gonna build the cluster big enough so that even with media failures that we meet that minimum performance. So you essentially have to over provision the cluster to do that. More servers and more disks which we're addressing with this. And we haven't exactly calculated all that out. It's another iteration of testing we wanna do. Any other questions? Nope. All right, well I'll be up here and I can talk to you all. Please let me know if you have any questions. Again, my contact info is on the slide so you'll be able to find it. It's also in the documentation we published on our website. I really appreciate your time. Thank you very much. Then safe travels home.