 And our two after because we saw the couple of people filtering in We can start with some introductions though you bet so I'm so I'll start I'm Randy by us I am on the open stack foundation board directors I'm also a well-known figure in the open stack community haven't been a part of open stack since its launch in 2010 so Guess we're five years into the journey now And my company called cloud scaling which was one of the early pioneers in open stack built some of the earliest open-stack Employments and had two major customers AT&T in Walmart and it was acquired by emcee last year So thank you emcee And For those who don't know like before I was doing the whole cloud thing. I'm a long-time infrastructure guy So I've been doing information security storage networking and automation all that stuff for a very very long time An uncomfortably long time almost 25 years now And then I'll let Jeff introduce himself you bet. Hello. My name is Jeff Thomas I am a systems engineering leader for scale IO So I handle our pre-sales and make sure that our installations get done properly Most of the time great So for those of you so first of all, it's great that we have such a Great turnout here to see this today, and we're really excited to show you but for those who aren't familiar There was a blog posting that I did I'm kind of talking about the architecture of Seth and what I saw some of the problems And so you want to go ahead and go to the next slide And so when we decided to kind of talk about this, you know and actually do a live demonstration and kind of show You scale Ios perform performance versus Seth and the rationale behind this and this is this is a little bit funny Because I read how people are giving me a little bit of grief They're like oh you're validating Seth, and I'm like I don't need to validate Seth's I've already validated right It's like 62% of open stack deployments according to the latest user survey Are using Seth see there I go I'm marketing Seth my competitors product, right? That's that's how I roll And but the point is is that EMC has been doing block storage for a long time, right? I mean we're the leader globally if you look outside of open stack And so I thought it made sense to kind of you know really kind of bake these things off You know we we're the startup. I hate to say it. We're the startup We're the little guy in open stack only five percent of open stack deployments use EMC storage Part of my job is to change that so we're here today to show you why So What's interesting go ahead in the next slide What's interesting is that if you are thinking about block storage Most people who want block storage are looking for performance because performance matters, right? I mean tier one tier two tier three, right? You care about performance if it's tier one block storage, right? And so I think the challenge is you know, how do we determine? Why Seth is so dominant? I think that in a large There's a couple of different reasons why saw Seth is so dominant open sec So I think the first is that it's multi protocol, right? You get block you get object you get file. That's fine That's great. Second is that it's open source and it doesn't have a lot of peers, right? It used to compete with Gluster and then Red Hat bought ink tank. So now they Don't compete or something. I'm not sure but the point is is that and you know a lot of people if you look at sort of the actual Data around the open sec user survey the number one block device driver and rescender is Seth now not Seth file system Not Seth object, but Seth block And so I think that's interesting because I think most people are expecting high performance from block storage, right? But it's also interesting because even the Seth folks will tell you that it's not really designed to be a high performance system, right? They'll tell you that really what it's supposed to be is a multi protocol system allows you to have blocked object and file in a single System in its open source and it scales. Okay, which is fine And so the blog posting and part of what we're trying to do today is really talk about Why being most multi protocols actually a problem, right? That's sort of the storage unicorn, right? You see this a lot now today Customers come and they're trying to solve their problems in their data center and they're like well I want one software that does this and this and this and this and this and this and this and this and Oh, it's all going to do that perfectly well like all of that's got to be you know Really high-performance and inexpensive and easy to manage. There's things just don't exist, right? So single-purpose tools sometimes are better, right? Would you rather if you're out in the forest and you're camping? Right, you know your Mount Fuji and your camping Do you want a Swiss Army knife or do you want like a set of tools that will help you build a shelter for the night? It depends Right, if you're going to be there for one night probably Swiss Army knife is good enough Right, if you're going to be there for a week, you probably should have like a belt full of tools I Think I covered a bunch of this so these trade-offs That's what the key is that to look at today. It's like how does Seth achieve being multi protocol, right? Because block object and file are all a little bit different They all are trying to accomplish slightly different goals and so some compromises have to be made And so we want to show you what the result is of those compromises So we decided to do this bake-off and we're going to continue to look at this stuff And so and you know this isn't to say stuff is bad, okay? So like I said Swiss Army knife Seth's appropriate for certain things if I was going to build a small system of say like five servers in Iraq and I didn't care about scaling it past five servers and I didn't want to deploy, you know, EMC scale aO plus EMC ECS plus EMC whatever You know, then I wouldn't I you know, I would prefer Seth right it would make sense at five servers But I don't care about five server deployments I care about like five rack and a bulb deployments because that's the only way I think you get value out of cloud That's my particular bias All right with that on my hand off to Jeff who's gonna start running us through this and there'll be lots of time We've got a few times where we're resetting the environment, so it'll be lots of times to ask questions To poke fun or whatever it is that you need to do. All right. Thanks Randy So just to reiterate we are going to be doing a block bake off So Seth does many things we're gonna be concentrated on the RBD portion of it When we look at The way Seth works compared to scale I owe I think a lot of you probably have a good notion of how this works But for those of you that don't Seth is traditionally deployed in a kind of a two-layer client and server environment where You've got clients that are gonna run the Seth client and then on the back-end servers you've got the the rados infrastructure right which lets us scale out and Have the ability to You know put as many servers as you want to scale out underneath that so when a VM goes to write a block it's gonna write to that Seth client the Seth client's gonna push it down to RBD Goes through the object translation layer through the Linux file system and down the disk for scale I owe we have a similar approach we have a notion of a client in a server and that client is Going to install in a hypervisor inside a Linux operating system We can deploy in both this two-layer Configuration as well as a hyperconverged one for the tests that we're gonna do today We're gonna deploy it in two layers so that we have a very much like-to-like when we're comparing Seth to scale I owe As you can kind of see is we as we build this out and we look at The data path that a particular blocks gonna take as it's written There's a VMs going to write a block to a virtual disk the Seth client is gonna receive that block That's gonna get sent to RBD RBD maps to the rados objects that gets written to the file system and down to disk So rados is the common layer the object layer that Seth is using to basically allow multi-protocol heads, right? So and it allows for the scaling out so I can keep adding servers and rados is that layer that's gonna let me Have more and more servers as I scale that out It knows where all the servers are on the cluster and as a block request comes in through the RBD gateway Which is the block device gateway gets mapped into that object layer And then it's not like a one-for-one mapping, right? I mean the objects that Seth users are kind of fixed sizes. I think like a megabyte or something like that Help me out No, I don't mind if you hook up. That's why you're sitting up front, huh? Okay, all right. Yeah, no worries. I don't I don't want this to be an accurate. I appreciate any corrections all right, so When scale I owe goes to write to our VM goes to write to a virtual disk This the scale I owe data client is going to receive that block It's gonna forward that out to the scale I owe data server and that's gonna go down to disk So main point here a lot less steps to get from the VM that's doing the writing down to disk Yes, it's all done over IP both both configurations The question was is this is this all done over a TCP IP network the answer is definitely yes So with both the Seth solution and the scale I owe solution IP protocols are used From a deployment perspective is I kind of introduced there a minute ago There's a scale I owe data client as well as a scale I owe data server We're gonna be running our test today in this two-layer configuration where we've got storage Servers that are just Advertising out disk and clients that are consuming that but we can run this very much in a hyper converged way and that's What most of our customers actually deploy this in production as? We've chosen to use Amazon an Amazon instance here in Japan to build out these tests We have a management network that we use to communicate the machines as well as a storage network where the back-end storage traffic is gonna go the five client machines that are out there are running both the RBD client and our SDC client and these are C4 large compute Optimized instances that are out there From the server side we have five each of both Seth and scale I owe servers And we're gonna start off with a test running with two devices on each one of those servers Now part of the reason we've chosen to use Amazon is we want to make sure we were giving the exact same resources to both Seth and scale I owe The devices that we're using are EBS devices that each have 3000 provisioned IOPS provisioned IOPS means we're guaranteed by Amazon to be able to have 3000 IOPS out of each one of those devices Now if we if we look at the kind of performance that we should get and What we hope to show you today is that scale I owe can exploit all that underlying performance that is out there So this first test that we're gonna run Five hosts two devices, so I end up with ten of those three thousand IOPS devices Which gives me the ability or the IOPS potential of three thousand IOPS We're going to run a second test where we scale up that configuration We're gonna add devices to those same five servers and then we'll do another one where we add five more servers So a scale out test with both We're gonna run a 50 50 read write workload It makes it a little easier to kind of understand what's going on if we just run a single workload in the in the short time period that We have here today. How big are the blocks? We're gonna use a 4k block size, okay If we look at what we're gonna expect from an IOP potential There's a notion of both front-end IOPS that the client sees as well as back-end IOPS that happen on the storage So scale IO uses a two-copy protection scheme We've got Seth set up the same way to have two copies as well So we'll expect to see 20,000 IOPS out of the scale IO system And then as we run these additional tests will we'll scale that up as well And we should see this linear scaling then right you bet absolutely So the first test we're gonna run through 80 gig volumes on each of those five clients We're gonna kick off an FIO job and and look and see what happens And this is just a single thread FIO on each instance basically that's correct All right, so we can take a quick look here at the stuff health Caching enabled we're not doing caching because we're effectively using SSD devices caching would slow us down Stuff config has a journal data on the same disk. Yes So we can see here our stuff configuration. We've got five machines each with two devices Also SSDs Right, these are the EBS SSDs these are EBS Volumes with three thousand guaranteed provisioned IOPS so what's underneath there that may be a whole bunch of physical disks It may be a single SSD Amazon's abstracting that away from us, but we'll see on a per device basis that we're getting those 3000 IOPS each They didn't tell us we'd literally be in the hot seat today when we did this All right, so we we kick kick this off. You can see some IO stat data up here. It may be tough to see in the back This SCI and I a driver that is that's my scale IO device my RBD zero that is my Ceph device that's out there if we look at a Ceph Minus W we can see over here on the end the amount of IOPS that we're getting out of each one of those So we're getting roughly You know, let's say six six thousand IOPS We're sustaining out of there. It's it's bouncing up and down a little bit. Let's take a look at the scale IO gooey What was that we were seeing we've seen six thousand IOPS on a single system or was that? This is the back-end Ceph report of how many IOPS are across the entire cluster across the entire cluster So that's the drive the load that's being driven by all five machines Okay, so you're roughly getting a thousand to eleven hundred IOPS per machine there a few okay you do the averaging there If we take a look at the scale IO side, we can see we've got twenty thousand front-end IOPS So we were expecting twenty thousand. We're getting that full twenty thousand out of it We can look at the back-end as well And I and I realized these numbers are going to be tough to see a little bit as well But our back-end IOPS we're driving that full thirty thousand IOPS So we've given scale IO the ability to have thirty thousand IOPS to use and it's using that entire Thirty thousand and on the front-end client side. We're seeing that expected twenty thousand IOPS So we're getting everything that we expect out of that Yes, what is the FIO reported IOPS? inside the client so We can that when the jobs finish we can go in and look at those reports, but Well, we're really trying to demonstrate here is the the back-end IOPS that we're getting You know if we look at the average of these numbers along here It's roughly five or six thousand IOPS on the back-end total If you add up the read and write then divided by the ops then you get roughly about a two K Over the block size. I think that's the problem of the safe bank head reporting So you're running the 4k, but if you look at the bank head report is kind of reporting at the 2k You know block size. I don't know you look at the detail of the safe W or not Okay, it's really not reliable. You cannot rely on this data to tell you the IOPS So you have to look at the FIO That's my okay. Well, we'll have that report here in a second, right? Yep, and you're seeing that the stuff back and we report it after that and show that or It's only if we run it I mean I can kick off a separate FIO job on it and do it okay, if I Guess if the if the sef tool isn't reporting the right IOPS to us From a client perspective, we'd have to go to all five of them run it and look at them all individually So we're trying to show it centralized here We certainly we have all the stuff running in our booth as well And if we want to come if you want to come down later and we drill into all of it Certainly happy to do that. Yeah, I'm happy to have you help us tune sef if you like as well. Absolutely problem with that. I Mean I the reality here is that you know, this isn't about sef bad. That is not what this is about, right? Sef has a place Right the argument we're making is that if you want high performance block distributed That a pure tool a tool that's designed for that end-to-end is Ultimately better for the high performance use case and if you don't want high performance block and stuff is good enough for your use case Power to you man. Do it. All right. No problem Well see the thing is is that if you look at the majority of open stack block drivers 62% of the open stack block drivers are using Rbd Right Yeah, actually, I've talked to a lot of these customers and the majority of them are using sef is as block storage only They're not using his motor multi-protocol, which is what where it shines. That's that's as strength as in as in being a single Unified multi-protocol system and in which case scale a you can't touch it Yeah, but the cost there's no cost difference between these two systems There's no significant pricing difference between them This isn't this isn't about you know doing a product pitch Like if you want a product pitch you come by the emcee booth and we're happy to show you the numbers and talk to you about like Scale a o-cost versus sef if you like That's not that's not what we're doing here We're doing here is we're trying to examine the difference between a storage system that is designed for a specific purpose Versus a storage system that is designed to handle multiple purposes. This isn't even like a sef specific issue It just happens to be the thing in the open-stack community that's most widely deployed. So it's easy to talk about I have this problem inside of emcee I got like the elastic cloud storage guys who build an object system who want to put block on top of it And I'm like no, please don't do that Right so this is about purpose-built tools versus multi-purpose tools and why one is better than the other for particular kinds of use cases you were gonna The point on the front-end IOPS is that the IO stat that Will display will show both read and write IOs per second to the device for RBD zero as well as SCI and IA so you can take their five clients the output actually shows the five client outputs every three seconds So you can see that it is three four five hundred Read and write IOs per sef client multiply that out and it's about 6,000 IOs per second Hey, I have actually I've actually tried the same experiment On sef and I noticed but that was on hard disk with the three machines not the SSDs Notice that sef Could max out the IOPS that the disk could do right. I've done this myself Sure, did this make a difference because it's SSDs with Way more IOPS Did that make a difference certainly so so each each device in sef is going to spin up a process and You can push so many IOPS through each one of those processes So if you've got disks that only give you you know if you're using spinning disks, you know at best 150 IOPS per it's gonna have no trouble keeping up with that But when you start to get into devices that can do five or ten thousand IOPS a piece Then you're gonna find that you're not going to be able to exploit all those IOPS with sef Scale a doesn't max that on CPU scale. Oh at full board doesn't usually use up more than 20% But you know, there's there's something here you need to keep in mind, right? There's one of my guys used to tell me this all the time There's lies damn lies and statistics like all of this has to be seen in context, right? The numbers like vary about whether it's sequential or random How big the block sizes are like all of that stuff matters, right? So this isn't about like, you know, trying to make either one of these systems look particularly better than the other by like Jamming it up a certain way It's disordered try to get as close to apples to apples as you can for the block use case Even recognizing that sef being multi protocol isn't designed inherently for block But is used predominantly for block within the open stack In a community and you know, dominantly so 62% of market share like any of us wish that we had that in our companies for our products 64k block sizes random. I don't we're not we're not going to during this test because we had to pick One to kind of go through we certainly can do that We tried to pick a workload that would be a standard virtualized workload So most of the time you're either seeing four or eight k blocks. You can certainly have If you really want to test and tax a storage system, you can throw lots of massive blocks at it We chose to use a Block size that was more common out there in the real world Okay, I'm gonna hold you there because I do want to get through our things. I got you you too Wanna you're in the queue. I did not drop any I owe yet So So one of one of our intentions here as well for those who are really curious and want to do this themselves Is we're gonna try to generalize a lot of this because you know getting these things up and configured or at least, you know I'm almost through somebody other than a bus better not do that Storage systems are can be complex to configure So what we're hoping to do is actually have a set of recipes, you know It's a lot of mission tools and get them up and get hubs that'd be easier to deploy these clusters So you could run them on Amazon yourself and play with them and maybe Test it more easily without spending a huge amount of time getting that up So that's one of my intentions and if you watch the cloud scaling blog, which I still am blogging that you know It doesn't seem like it sometimes and there'll be some stuff up there soon so all right, so we did we did the reconfiguration so we took our five each servers and we reconfigured them with four devices versus two and Created volumes that And assign them out to the clients we look at stuff. We see we've got Five machines with four devices each now, so we're gonna kick off that same test and again. This is 5050 rewrite workload 4k blocks 32 Ios in the Q depth So every five every three seconds We're getting all five of them reporting back from the effort from the FiO side We take a look at at Ceph minus W. We see we're getting a little bit more performance out of it this time I'm running the clients concurrently, so they're both pushing the load at the same time That's why we used heavy compute instances to to run the load So there's two FiO processes running in parallel in parallel on each on each of those five clients You know we we did that Several times and got the exact same results, so we decided to bring them together so that we could you know Yep Colonel client or live RBD It's a normal mind Right, so let me repeat it Or you could have him repeat it if he's got the mic just I'll repeat it So yeah, just With an open stack the Colonel client isn't isn't really used in fact very few people use the Colonel client It's it's very minimal sort of use case relevant there Mainly because it doesn't do any of the caching stuff Most people use a live RBD and inside open stack and with FiO you there is a native RBD driver as well Which you can use but most people will use that with the client side cache as well And that's so if you want to I'm from Red Hat So I'm generally interested for you to rerun this with that data and show okay More practical use case well we can certainly do that but the really the The intention of these tests is not to show that one does this much IOPS and the other does this much It's more how much of that underlying infrastructure. Can you actually exploit? So if we change the parameters, yeah, our numbers may change a little bit I would think the relative difference between them are gonna stay the same a A Short detail question, please. You said IO death 32 is that one thread with IO death 32 or four threads with eight each What is one with 32? Thank you one per client server. Yep. What's the vision of safe? You're using hammer That's the latest. Yes. I believe so They're writing to different discs They're writing to two entirely different sets of servers, so we're not running seph and scale IO on the same servers The back ends are separate the clients are yeah running off the same. Yeah, and that's just so the RBD zero is the Seft device the SCI and IA is the scale IO device and we're getting the output from all five of them there So every three seconds I'm putting that out that that output to the screen Jeff I just want to check on time. We have about 10 minutes left. I want to make sure you get to it before we We are we're starting to reconfigure the last test here. All right so you mentioned that You're gonna start publishing this song like it hubs that we can go and look through this so there's some scaling issues that the OSD demon has with Saturating OS saturating SSDs So you actually need to run significantly more OSD processes per device if the device is capable of Over about a thousand IOPS each Got it, and it'll be on github so you can help us fix that He jumped the line. That's right So, oh, how did you compare your results with seph with other published? Tests for seph in similar configurations, so I know there is a fojito test early on before Giants it was Intel test. How do how do those numbers set up? I actually not familiar with those tests So once I said when Randy first published those results I noticed that you reported something like 11 milliseconds lag latency on seph side and Fujito study in the very same configure very similar configuration 32 Q-debs and so forth 4k random IOPS was around 2.5 So you might want to check and see how it compares. Maybe there's some difference in configurations that you missed Was Fujitsu using hammer or giant? Okay, all right, I'll take a note to go check that out now I will freely admit that the performance engineers who did that were scale IO performance engineers And if you know my reputation, I'm gonna try to do this as fair as possible and be as honest as possible So if they miss something, I'm more than happy to correct them and make it public Yeah, yeah, you know you got the same problem. I do man. There's so much time in the day So I you you call me on it and I'll go fix it. All right, it's fair You ready almost okay one last question who is that? Yeah, but um a Scale IO split brain say fault tolerant. Yes But but you guys are doing mirroring right? It's a two. It's a two-way replication. Yeah, so we're the client's gonna write to the SDS the Daemon process that controls all the discs that are on a particular host as That right goes down to the primary SDS Simultaneously as it's writing that committing that to disk it forwards it to another SDS Which then we'll commit that to disk acknowledge back through the primary and let and acts the client So it's more of a two-stage right commit than I'd say a synchronous replication But without quorums you can't be split brain safe It's quorums definitely built in so In addition to the client and the daemon process that controls the DAS There's Clustering software that controls the the health of the system and it's what you interface with in order to do configuration We're almost there. We got okay the other thing to keep in mind is that the the the scale a o has been optimized for this high performance use case and So, you know, it's got this notion of protection domains. You don't have like an infinite-sized pool for scale You do it and sort of these little kind of sub clusters called production domains And it might be a rack a quarter rack whatever size you want and but what we're optimizing therefore is for very fast disk rebuilds That's why it's only two-way application and we're optimizing so that we don't get a cascade failure If you have a failure within a single production domain, it shouldn't cascade over and cause a problem in the next Protection domain, so it's a little different. So if you're really smart What you would do is you would actually have a protection domain on like a single switch Right because the chance of a switch failure is basically nil a single switch failure with and you might lose a port But then you lose that set of discs in that box big deal So the odds is sort of that split brain scenario actually go down with like a well architected System which isn't to say that everybody will architects their systems, but Well, and even within that protection domain We can have different types of media within the host and create pools and that really is your your fault Domain is each pool within that protection domain Hey Randy just a quick question about Profiling in true open stack in community fashion Is it possible for us to to run these ourselves and start comparing them amongst the community because yeah right now Your license says that we're not allowed to publish those. Is there a way that we can work with you to do that? Yes There is what what is the best way for right that want to run this so for right now? because of my legal team Please contact me, and I will help you get it done. Don't worry about what your results are I will fight the battle with my legal team. Okay, you you publish them You do me the reciprocal favor of being as fair as you can and You know and and I'll get it passed through for you through the legal team Now I did go to bat with them and I did try to get it done but there there was a high high level of resistance and You know that emc is a business that's trying to become a different kind of business that is sort of setting its ways Right so I just have to go right now do that one at a time I apologize for that. Hopefully once we get it done once or twice I can just get them to change the end user license agreement, which is asinine anyway, Randy What's your Twitter handle again for people to seriously Randy by us? It's really hard to remember All right, so we just kicked off this final test 10 machines each with four devices each and see there we're getting we're getting more Iops this time out of SF if we go over and look at scale I oh Why is it scrubbing? Can you go back? Yep? We we just We added those we added four more machines to the cluster and created those volumes You'll see this this this calm down in just a second So we need a way for it to calm down before we can we can run it again as soon as this this run finishes But we can see we're getting you know, there's 14,000 You know, it's jumping around a little bit to your point with the scrubbing going on there But we're still you know, I'd say we're averaging and probably closer to 10,000 with this run Across those numbers. So I know what a scrub is on ZFS. What's a scrub on seph? I see It's rebalancing the references to the storage on disk Right all of the seph lungs were deleted images then in 20 OSDs were added and Then volumes were provisioned so I'm not obviously there is a little scrubbing going on but We intentionally deleted the devices so that right so well, I kicked it off there again. We can see the scrubbing stopped and You know the variability is still there a little bit where We're going up and down but in general the numbers are higher there as we I think if we average those out again We're looking at around 10,000 When I look at the scale I oh side, I'm getting 50,000 on the front end and About 70 on the back end and if you remember back to my presentation. I said I was expecting to get much more than that Reason we're not getting as much on this scale I oh run is we're not pushing on it that hard So I can run another FIO job push it push it a lot harder and drive that full 120,000 IOPS on the back end. Can you actually push it to drives 120,000 to the disk flat out? Yep Yeah, it is Jeff we've got about one minute left. Okay This is a quick thing a couple of things. So first is what is your ratio between reads and writes? 5050 5050 and Do you do any caching at the scale I oh side? No, so there is No caching at all completely not in this test. We certainly can configure the system to do that when we use SSDs or high IOP devices Putting cash in front of them actually slows them down So we turn off cash on the scale I oh side So if anyone has an additional questions, we are running the same demo live I have a scale I oh t-shirts for everybody who answered a question and everybody who's from red hat So please feel free to come down to make sure you get the right size if if you didn't get it the first time I do Laura so give you another one, but thank you very much. Yes. Thank you