 Today we're going to talk about scale-up block storage and how Amazon does it and how to do a scale-up block storage design I'm Eric Windish. I'm a principal engineer at cloud scaling and a Lead on open stack development. I'm a open stack developer since cactus. I've been doing platform as a service since 2002 infrastructure as a service since 2006 and I've done engineering for We're going to start with Amazon's architecture or at least what we can infer what we know about it most of this information we've gotten off of the Disclosures on failures that have happened and the analysis that people have done on it So you have a control plane and storage they're separate They're loosely coupled so you can lose a control plane or the storage and not both although in practice they've actually lost the storage and then realize that The control plane trying to control broken storage can kind of fail on them sometimes so they would Sometimes have to also shut down the control plane when they've had failures and then EC2 connects back into that storage they use distributed storage and it's not entirely clear if it's an actual DHT, but you could have a distributed network like this and By being distributed There are certain guarantees that you can or cannot have in a Distributed system. There's this thing called Bruce theorem. It's also called the cap theorem Which says states that it is impossible for a distributed system to simultaneously provide all three of the following guarantees consistency availability and partition tolerance So Amazon by doing this by having a distributed system Through whatever mechanism their system is distributed if it is truly distributed then you can only Get two of these three and that means that for storage block storage. I mean you have to have consistency that that's clear This is a problem. You absolutely have to have consistency, which means you have to decide to have availability or partition tolerance both of those Effect your reliability Which means that in a block a distributed block storage system You cannot have reliability It's impossible Well, unless somebody disproves Bruce theorem anyway, and you can kind of fake Having these other things so you can get two of these three and kind of fake the third one But you can't be guaranteed of the third one so because EBS doesn't have reliability users of EBS have to presume that doesn't exist people use snapshots Amazon provides the ability to provide snapshots Against their EBS as a form of reliability and only way to get their reliability is to use them So cloud failures They happen they have happened Amazon's had these problems occur already Read it posting about how red it went down from EBS failure other people Imager other cloud providers I don't know if Netflix went down or not, but plenty of people did So what really happened you can read all this stuff, but the reality is stuff happened I Tone that down by the way service providers fail Hardware does fail But operators fail and software fails more often than the other things so the Amazon failures somebody put a bad DNS record in one time and another one of the major EC2 failures was a bad router config that was accidentally uploaded Those were the longest Amazon outages actually I think one of them was something like 24 hours and for some effective customers and zones yeah, and Well, we actually took out names for these particular outages but you have 24 hours and 72 hours for some of these and People make mistakes right people make that software and the software of spogs or operators putting back configs so Quickly just go into scale because we're gonna talk about scale up quite a bit in this talk You want to make Box instead of make scaling up and making boxes bigger you want to scale out and make more boxes You don't want pets and dogs you want cattle You don't want you just want to if you have a problem with one of your cows you shoot it you Don't treat it nicely like your dog and you know pet it and feed it nicely Right because again, we don't have reliability like we already know According to the computer science theory we cannot have reliability So we can't treat our our things like pets We have to presume that they're just going to die and we need to have Small failure domains and not big failure domains. We can't have everything fail all at once because it will fail So the cinder architecture somewhat at the high level From the 50-foot perspective It kind of looks a little bit like Amazon You have a control plane It's loosely coupled to the storage and your compute nodes your what would be easy to or in this case? Nova compute talks to your storage We have a control plane And all these pieces here are scale out It's designed for scale out But then what we do in the bottom is we plug in to different block storage solutions and in cinder this well In cinder we have these mostly controlled by various vendors and we have 18 Different architectures in which you can actually deploy cinder So cinder isn't a solution cinder art is 18 different architectures and 18 different solutions And really that really comes down to all these because really provide one of these three things that NAS back ends distributed file systems, and that's actually a little thing of Seth there and block storage Generally sand ice-guzzy Fiber channel over ethernet or regular fiber channel for instance So before I talk about The storage back ends in more depth I really want to just talk briefly about this the control plane because it has Implications for which storage plug-in that you choose and how you do your storage back end And not all the plug-ins may actually work with a scale out control plane It turns out or at least not a highly scale out control plane So here we have a scale up scale up pattern. I don't know he does this This is you have a really big server controlling a really big Storage system in the back end and you can lose your really big either of your really big servers So something that was introducing grizzly was this idea of a multi backend pattern Where you can get your really big control plane or several multi big control planes to control several really big storage systems that they said well We're gonna we realize that we can't have this One really big storage system here So we're going to add multiple really big storage systems and manage them through Cinder and we're going to Just have this one box that can orchestrate these and then we can make multiple of these boxes that control multiple of these big giant storage systems and It becomes very expensive first of all because many most of these storage systems are expensive systems not always and You limit your ability to scale because now you have six of these storage systems and three of the control plane nodes and To prevent one of your control plane nodes going out and actually scale up that control plane node and get additional reliability you Have to make it very complex you get split brain problems if you actually try and Control one storage node from one machine and then the other storage it just becomes nasty if you Pretty much can't it's it's a very hard problem to solve and Cinder doesn't actually do this so To really do the scale up pattern. You can't even do it with a Cinder at all and To fix this we're gonna have to talk a little bit about the storage back in and We're gonna break it where we're gonna show you how it fails and it breaks so Scale out scale out right. This is what we just showed So I guess just to jump a little bit right so to do that pattern We were just showing a couple few slides ago was really a scale out then scale up pattern because even if you Even if you scale the control plane There's still really big back-end storage boxes that you're controlling that some vendor provided you so You can get a failed storage back-end transport failures failed servers and Now your front-end box has to be aware of that and understand that and stuff goes down and Most people try and fix this through doing ha pairs and When you do when you talk to an ha pair you're talking not to a system You're not talking to the active system in that pair. You're talking to the ha pair You're talking to the ha system because the ha system itself first of all can fail Ranny bias and myself have both given talks about how this design is generally broken because The pacemaker or whatever system that you use to actually provide this fail over Fails more often than the actual machines in which it's trying to protect from failing over And We have lots of in-depth analysis and discussion about that on our blog and some of the other slide decks that we've prepared that are linked there So just to give more depth on how bad this is. It's really complex you don't necessarily know where your data is because it you make it inconsistency between these ha pair nodes you Get more split-brain problems because not only have split-brain problems between the cinder volume agents and the storage back end, but then you get split-brain problems between the different ha nodes and Cinder still doesn't support this So this adds network complexity And this is just to get high availability for one of those for one of those back ends You'd have to do this for each of those back ends. This is What somebody told me is a net app System I haven't verified that but I've been told so going back to brewers theorem when you do ha you decide to actually have really high availability and Get reliability you get that through partition tolerance and availability Which means you actually give up on consistency and that's why you get these split-brain problems, you know You trade up consistency and that's actually really really important for storage It's really scary that people decide to give that up. So We have a solution or What we recommend that people do And that we do cinder like we do Nova In Nova you have Some way of doing high availability for your API and that can be HTTP proxy. It doesn't have to be I Generally fit say that there's a single point of entry you can use LBS or any caster whatever ECMP through API servers into schedulers and in Nova those are compute nodes and in sender those are volume services, so Cinder is designed like this now. It scales out You just have to deploy in a way that you actually leverage this as opposed to designing where you cannot leverage this and Because all these machines are communicating point-to-point It is important that we actually communicate point-to-point. So at least a class scaling we've been deploying with zero MQ which is a distributed broker and The way that rabbit MQ is used in Open stack at present is through a completely centralized broker and By doing that you introduce Points of failure that machines can fail to communicate to each other because they fail to communicate the rabbit Because messages go through rabbits. So we have the machines talk directly to each other So that this pattern is actually direct communication between these nodes. So we scale things together we put our storage with these with this with with the cinder volume so it is more tightly coupled to the control plane slightly But only to the extent that the cinder volume service is Attached to the hip at the storage in which it's controlling This is simple. It's deterministic We're not going to lose all of our data, right? so if you have a distributed file system if you had some major bug in your architecture you could lose all your data or you could have problems that Well, I just don't want to get into but We just can't lose all of our data here because we're going to lose a box. We're not going to lose everything and Even though these leaf node Systems can lose data What's that really important because we already assume that there's no reliability in our application and in our use of this Back end and we use snapshots to hedge against it Which is what everybody does in Amazon already anyway and what everybody using cloud is or should be doing and we decide that Consistency and partition tolerance are more important than anything. They're more important than availability because if it goes down We just restore from a snapshot and that's how snapshots save us and we Just don't really care if we ever lose a node so Actually, that was a lot of slides and I went maybe I went too quickly. We have plenty of time for Q&A We also have a blog That's where you will be able to download this link and hopefully if this presentation was recorded You'll be able to view it again as well and point people to it Like to open the floor for questions So the question was how do you mitigate against the impact of doing snapshots like frequent snapshots? well, people are going to use snapshots anyway if You're doing an EBS style solution and you're providing a block storage Like this and people are approaching it from the perspective that they're using Amazon today Then they're already going to be doing this So no matter what your architecture is if you use one really big back-end storage box You are going to already have customers doing just the same number of snapshots as it would be doing with this It's just that Maybe you're not designing for it. So You have to design for it regardless so with this design so Actually, so one way that our design differs from Amazon's is that because we're not actually using your distributed file system It's actually really really fast. So and all the machines. It's Essentially direct attached storage at the far end of it and it's really fast. So You we can do reads and random reads incredibly quickly. We're using ZFS with L2 arc and And dedicated is old devices. So we essentially have hybrid SSD slash hard disk storage. It's really really fast. So This solution is fast enough to do all the snapshots that you need and in fact, we've Run into more problems with actually getting performance out of the software that does snapshots then the actual Storage solution itself. We actually it's really hard to make the software fast enough Especially if you're using something like Python to actually make it fast enough to do the snapshots Over there. Thank you So so so storage is not on the same nodes. So the question was how we deal with resource contention I believe you're saying with this being on the same nodes as the compute and This is scale out in parallel to this to the compute node. So there are dedicated storage nodes and compute connects into that but is not bundled with compute Over here. Yes So how do we deal with Upgrades if we have it kind of tied at the hip? well, first of all, we Try and avoid upgrades if we can So no, so we're actually so you can upgrade the control plane Pretty painlessly without affecting the back end storage if you have to really update the back end storage then you could Bring a node down and back up and through proper management of timeouts and configuration of You know, we're using ice guzzie for transport if you configure it correctly It can actually survive a fairly quick reboot if you had to do so But the reality is we can actually do most of our upgrades without reboots and without bringing things down in a drastic manner Keith is there any idea of data locality in our center schedule scheduler So we have that in our Nova scheduler so I Believe a present we're not doing that data locality in cinder, but we already have that filter for Nova so it wouldn't be impossible for us to consider that or for someone implementing this to build that in to their system Doing especially if you design your network architecture a certain way as we have Then you can very easily actually do something similar to host aggregates But for cinder scheduling right so the question is about performance guarantees when you create cinder volumes So I actually had a performance like I took out maybe I should have put it back in and As I was saying to the fellow earlier, I think that this is Pretty quick the way we've designed this has proven to be incredibly performant Slaze is something that is done by the deployer and user of the service and So if you were for instance a public provider or even a private provider to internal tenants you would provide that SLA to your users and your tenants and we can so when you design the system you can get an idea for how fast it is and then you would Essentially make whatever promises that you need to the tenants Right there. Is that right? That's a really good question so which backend driver do you use in Cinder for this to work and is this is cloud scaling thing so cloud scaling uses Cinder well really all the open stack pretty much out of the box We do make some various fixes What we roll into our packages and then and we make sure that those are put in upstream and Parallel as we develop it Presently we are using a ZFS plugin that is Well, I'm very present only within our product, but you could equally use this pattern with LVM You may not get quite the same performance In doing so, but it would work just the same with the pattern I just described so the LVM would do everything that I just described maybe not quite as well or as performant. That's all Yes back there So question is where you provide the Snapshot in band or out band so the way that we do it is Because we use ZFS we can create the snapshots On ZFS, but then we can download so this creating a snapshot in ZFS is actually very quick So we can then after we create the snapshot, which is an instant almost instantaneous operation then we can download that Into our estuary Swift cluster so that so the copying of So the copying to estuary would be out of the band but the creation of the original snapshot in which that was based on it would be somewhat in band it's it's based on extents and it's kind of a copy-on-write file system that lets you do this so it's it just starts creating new blocks and It's kind of like get in a way. So No, no, no, so the snapshots are not stored in so the question was if the snapshots are stored in the cinder control plane and They're not so those snapshots move up into Swift so they're stored in Swift and In this architecture, you would not want to put Swift on top of cinder So you would see you create the snapshots the same way that you create snapshots So the question is if you create the snapshots in cinder and the answer is that you create the snapshots in the same way that you Create snapshots with EBS and we actually At least we support the Amazon APIs So you would use the same tools and commands and API requests that you use in EC to To get those things into Swift and s3 Back there question is about Seth why we're not using it so I don't want to say anything bad about the design of Seth or distributed file system that I'm actually a Bit of a distributed systems guy myself Sage is pretty cool Amazon appears to be using a distributed system We are already assuming as I said earlier that we don't have the reliability so we could you know trade against that Reliability, I just don't know if it's necessary. So it's a there's a lot of complexity in doing distributed file store and if We're effectively distributing the data just in a different way and what I think is a much simpler way So I question in this case if we would really need it And if you're another another thing to be considered as if you used Seth For what we're describing here, you'd have to have to set deployments one for storing Your objects which are your snapshots and another for storing your data because if you don't trust the Reliability of your data storing and you and you can't when you use something I'm gonna say you can't use it when you use a distributed data store if you want consistency then you need to do the snapshots so Seth would have to either be consistent or reliable and If you can deploy it both ways and you'd have to have two deployments one which is one way and one which is the other way But if they both provide the same level of guarantees and this didn't it both all of your sender or all of yourself Sits on one side of Brewer's theorem, then you can't pull all of your data there. You have to have your data stored on both sides Over here on the moose file system We haven't tried to use it Do you mean do you have more specific questions about that? Right so the questions if we have experience with moose as a distributed file system and Also noting that nobody seems to be using it So we prefer to use things that are known well trusted and that We know we can make work Have the least level of risk and this is what I would consider to be a safe approach using Things that people have used for years and have worked reliably for with for people for years Like that. Okay. Let's answer the first question first First question was Oh, yeah, so about the ha so you're asking if we advocate using ha on These back-end systems, so could we essentially do this direct attached thing and then still do ha and I would say that's when you again start reintroducing split-brain problems and I would advise against it Now you could do that But I do believe that you start introducing Because you now introduce a consistency problem when you want to have consistent storage and every time you do HA pairs with storage You are going to give up consistency and I think that's a bad idea and if you're doing the snapshots, then you don't need it So why introduce the complexity? Then your second question was about what distribution and packages do we use? for ZFS and Linux or do we use open slairs and The answer to that is that we actually Have quite a few people at our office including myself who have used open slairs and derivatives over the years we have and we started with a poc that was based on So a Salaris Gen Unix kernel and we moved off of that so after we finished our proof of concept we then switched to Linux and We presently use a boon to with ZFS on Linux packages over here first The question is in grisly. There's a snapshot capability into Swift and and if we are using that and I I Am not certain so we're actually deploying on fulsome. So I believe that we're doing our own thing to get it into Into there but to be quite honest. I'm not entirely certain. I wasn't the actual engineer that implemented the Snapshots, so I'd have to confirm that I believe there's another question across the aisle earlier back there So the question is why can't you just have more copies with a distributed file system as opposed to Putting snapshots on one side and block storage on the other and the answer is that if you make multiple copies And you no longer have consistency and consistency is really important But if you have multiple copies, how do you have consistency you every time you make a right you update every copy or You have older copies and now you no longer have consistency so the question is what kind of hardware is underneath of this say in NAS and so these boxes the actual physical machines are for us anyway are Machines that are part of our HCL. So we are vendor Like semi vendor agnostic. So we work with various vendors and we certify hardware to work with our software so For instance, we support Quanta and Dell hardware I'm not sure. I'm not sure what the full HCL is but we have an HCL of vendor hardware that we support and these are x86 boxes with a Bunch of disks attached So the question is how do you do recoveries? Assuming that if you make snapshots that you then recover them and Users that deploy on the cloud if they lose their data or they lose their instance then every then they recover from a snapshot Your tenants do that You just make sure that they have the data available to them through recover from back here in the middle again So can you repeat that again? Yes, so the question is if you can create a new volume from existing snapshot and yes so the idea is that future parity with the way the Amazon does things and to support the things that they support such as that which is to Recover from a snapshot back back there. Do we plan on releasing those EFS code at all? Possibly Pretty much everything that we do is open source we do quite a lot of open source development Joseph Gordon here in the front and myself are very active core developers and core reviewers in his case of OpenStack and We do contribute a lot of code. I'm not certain when that code will be released and I believe part of the issue is that we only Fairly so only in our latest release. Did we actually port that over from doing the open salarist stuff? To Linux so we wait it we want to wait until we actually had the Linux based one before we released it because The open salarist stuff was something we figured that not only we're not going to use but probably nobody was going to use So we didn't want to release that any other questions We're almost out of time anyway, so Thank you very much