 Welcome to our talk on resiliency and performance engineering for OpenStack at Enterprise Scale. We're going to talk a little bit about us. I'm Jason Venner. I'm the Chief Architect for Morantis. I have been involved in most of the large-scale enterprise deployments to date. And we have... Hi, everybody. I'm Nathan Shrewblood, and I've been playing around with distributed systems and distributed systems at scale and enterprise computing for quite a while. So we're excited to be here and tell you a little bit about our experience. Morantis, as you can see, has a large enterprise customer list. We're also, as part of our enterprise focus, partnering with most of the people who make the equipment that's necessary for enterprise operations. And we are very major contributors to OpenStack. As part of our focus on the enterprise, we've initiated a project we call the Wrecking Crew, which is designed... Our belief is, unless you break it every possible way, you can't build stability. So we've created a crew called the Wrecking Crew to actually break everything so we can deliver you rock solid OpenStack. We want to improve reliability at scale, demonstrate publicly how to do this because the more the community understands and the users, the more adoption OpenStack we'll see. We want to demonstrate real enterprise applications. And we want to push the scale and performance numbers as large as we can and share all of the best practices we learn while we open source everything we do. So we see a lot of the enterprise engagements and I talked to a lot of the CIOs and the VPs and I'm going to see what they're looking for from their perspective. They're looking for cost savings, operational efficiency, an open platform, flexibility and choice, and the ability to innovate and compete. They have to deliver something they can run their business on, so it has to be rock solid. The typical enterprise person, I need to deploy and manage my apps. I need to help. My ops team has to be able to manage what I'm delivering. Everybody has to meet their SLAs. And in most enterprises, there's somebody overseeing everything to make sure that they meet some compliance requirements and have to explain our costs to the CFO and we have to provide capacity in near real time rather than 16 weeks later, which is the norm. We have a lot of materials, so I'm going to whip through it until we get to the stuff that's actually interesting. Really enterprise this option is about increasing the velocity of value creation because every enterprise we know is under some competitive threat and they need to make a change to compete effectively. So to be successful in an enterprise, you need to think about your technology and you do think about your culture because ultimately your culture is where the true value will come and the technology needs to enable the cultural changes to let you achieve the high velocity of innovation you're looking for. And we need to plan your open stack clouds that don't live in isolation. They have to be part of your environment. They have to integrate into your ecosystem. You need to pick... Sorry, get rid of that. You need to pick the plug-ins because open stacks control plane. You need to pick the SDN plug-in, the storage plug-ins, the compute plug-ins and everything else in your environment to fit your workloads and tailor it to your requirements. You also need to pick your hardware because these things are circular dependencies and plan for failure because we're building for a horizontally scaled world where most everything can fail and will. For true enterprise reliability, you need to actually fail over on a regular basis because if you don't verify your failover, it's already failed. We're planning for our SLAs. We need to plan for our clusters. Most of the enterprises start with very small clusters and then they take the deployment plan and the operational policies they built and try to run that on their production clusters and that's not going to work. There's really a very different whole ecosystem around production deployments versus pilots. We also need to deal with when we're thinking about this friendly environments versus things that are exposed to the public internet and be prepared to tailor your cloud to your workload if you need to. We're going to talk about hypervisors. We live in the KVM world because we're mostly open source but vCenter is an ideal hypervisor if you're running legacy applications where you want vMotion and all of those lovely attributes for things that you can't deal with as horizontally scaled attributes. Containers are ideal for very dense packing and or for very high performance applications. The goal with the container hypervisor is that your applications that would normally run on bare metal are part of your SDN environment and part of your storage virtualization. Hyper-V, Microsoft. The most important thing you're going to have to decide on is your SDN solution for your enterprise application because this will determine how you lay out your physical network topology and this is really a function of the kind of use cases. If you're really dealing with a public cloud where most of your access is north-south you will choose one style of SDN if you're dealing with an application where you have a large numbers of east-west flows in your cloud you may have another. And the other kind of interesting piece is ultimately with SDN everything about your application deployment and your network infrastructure changes and it's going to take the people who normally support your networking a while to catch up and you're going to end up working with them through some very painful moments before they become partners in your success. Mirantis has been a big fan of HA for OpenStack and HA for applications. We pioneered a lot of the HA for OpenStack. We're testing failure right now. See? Fail over. We didn't test beforehand. We'll switch back. Let's switch back. You should expect that you're going to be doing cold starts in your environment on a reasonable basis and at some point... Okay, where we go? Here we go. Your applications need to be horizontally scaled and ultimately spanning multiple failure domains. And if you're running something like your applications are running behind heat or scalar or some automation infrastructure they're going to beat the crap out of your control plane and you need to build your control plane to handle the load that they're going to impose. And ultimately, once you're running your production apps on your cloud you need to do rolling updates and you need sustained failures and I'm kind of a firm believer that if you have a service pool of that you need five servers to provide the function you need to run seven to provide the capability so you can have one down for maintenance and still have resiliency for a second failure. And that also handles some burst loading. Okay. I'm going to talk a lot more. From a storage perspective, Swift is an ideal tool for geo-replicated data. Block storage is still really very much a local data center operation unless you're using a commercial storage technology that handles the replication. And if you're running high level production applications you need storage replication across the geo. The most important thing for enterprise success is recognizing that you're going to be crossing a lot of silos in your environment and if they're not working as a team it'll be a disaster. We've already talked about the network core the release management team needs to roll in your software development, your architects who else do we need? Compliance for example. And your security and audit. We've been working through this with some of our customers and sometimes these people come in very late in the game when you're getting ready to go live and the time you have between the time they engage and the time you go live is less than the time they need to actually understand what you're doing so they can make sensible choices let alone actually come up with a compliance and audit plan. You spend most of your time explaining what you're doing and why it's different. So in many ways the rest of the organization may not have caught up to the velocity with which we're delivering software and infrastructure especially at the enterprise. So we kind of whipped through that because those are really mom, pop and apple pie and we're going to start delving into some of the cultural issues which are much more interesting. So I want to talk to you a little bit about culture and what it takes to create mission critical culture within the enterprise and how we need to think about starting to change both within the enterprise and also within the open stack community that we need to do to deliver mission critical rock solid open stack and this is a journey. It's not something we can just flip a switch and deliver. So when we think about mission critical open stack the question is how can we support this? How can we evolve what we're doing in that direction? So one of the great ways to think about this is let's borrow from other mission critical situations where we're already familiar and we already know there's best practices. So you can think about this. I have a picture of a 747 cockpit here. There's a ton of switches and gauges and dials on there and somebody said to me earlier this week, you know, I think there's probably more switches and dials in open stack than that. So this doesn't even begin to cover the kind of complexity we're dealing with in open stack. So let's look at how some other industries manage complexity what they've done in their cultures and how they've affected that. So put simply what a lot of these things have in common they have checklists. So how many of you have heard about the checklist manifesto? A couple of people. So the checklist manifesto is a book that was written a couple of years ago by Dr. Atul Gawande and he was trying to understand why there are so many failures in surgery and in medicine and what can be done. You go in for an operation and people end up with complications that really could have been avoided. People are missing sort of the simple stuff, the easy stuff. So they developed a very simple 19 point checklist that they applied at eight different hospitals when they were going to do surgery. And they're trying to understand, well if we applied something as simple as a checklist borrowing from these other industries can that really improve the results and reduce the failure rate in surgery? And quite shockingly with just a simple checklist at key points in the process they dropped complications by 36% and dropped the death rate from those complications by half. It was pretty stunning and at first they really couldn't quite believe the results. So one thing you might ask and this happened in the medical field well, okay maybe that's surgery but it's open stack, is it too complex for checklists? So a lot of the doctors rejected the idea and I think when we think about something as simple as a checklist and not when you're thinking of operations we've all had a ton of education, training, worked for years with all this technology and for someone like me to come along or this doctor to come along and say you know what you really need is a simple checklist everyone rolls their eyes and says well that's not really the answer here we're dealing with very complex systems and you can't reduce my job to a checklist. But my favorite part about his study was that when they surveyed the doctors about the effectiveness of checklists 20% of them said waste of time don't need it but when they surveyed those same guys you were getting surgery would you want a checklist? 94% so everybody wanted a checklist for their surgery but didn't necessarily think it was the thing for them. So thankfully we're not doing surgery we're building open stack and trying to run it in production for the enterprise. So it's not quite a life and death experience but for many of you who may have been through a service outage at scale it can feel like a near-death experience if you're involved in it and so I think it's worthwhile to think about that as a comparison. So there's lots of ways we can apply checklists in open stack we can apply it at the requirements level I experienced this at Hadoop when we were running 40,000 servers 500 petabytes and we had a lot of development teams coming up with really amazing stuff and they would come and say I've got this incredible new service I'll put it on the production grid and then our DevOps team would come out and say well here's the checklist have you thought about your HA strategy have you thought about what you're going to do how we're going to monitor this thing have you thought about these very simple like three or four points HA, monitoring, upgrade and rollback pretty basic stuff but of course the engineers were so focused on delivering the new service they weren't really sensitized yet to thinking about what it was going to take to put things into production and so this is kind of the point I'm trying to make about culture is that as a community and as people implementing open stack in the enterprise we really need to think about how we create that production oriented thinking in the enterprise so there's a bunch of different places we can apply checklists and actually I want to go back for just a moment and cover these so I mentioned requirements and I'll delve into these in a little more detail the layers at which you can apply checklists but stages are also a really important place to think about checklists when running things at scale and by stages I mean pause points alright we tried it small now let's pause and see if we've got everything right okay we're running out to our staging cluster let's pause before we go to the next step so there's lots of places where you can think about applying checklists as a way to raise the level of production mentality and reduce the failure rate so I covered some examples of good pause points when going from an idea to putting something into production and you can think of each of these as good checkpoints to stop and think and make sure you've done your homework with respect to production requirements I covered those a moment ago again it seems pretty basic the whole idea here is not to make a huge long list of a hundred things that you need to do but to make sure that all the experts who have this knowledge have stopped for a moment to think that I cover this before moving on to the next phase you know in the surgery example it was amazing how many silly things got missed and a lot of the time in the surgical example you have incredibly brilliant people faced with very complex situations and a lot of chaos is coming at them and if this sounds like any of you working in production deployments it's the same thing and it's when all that chaos is happening that you miss the basics that can really get you into trouble later so other examples of where you can employ checklists thinking about what are you going to do about power thinking about redundancy in the physical infrastructure thinking about have I got the topology right have I verified that all the plumbing is in fact working so there's ways to think about this both in terms of what you need to check what are the critical points as well as places where you need to pause before you go to the next stage and then of course there's all kinds of things within OpenStack itself that you can do to validate that you've got it right and you know we're building tooling all the time to do this the good news is it's software and we have automation so a lot of these things we can take care of with tooling and of course Jason mentioned earlier you know understanding what your use case is and understanding what the SLA expectations are for that use case is another critical place to make sure that you are validating what you're doing that you're getting the response rates that you're expecting and that you're constantly revalidating let me interject one thing OpenStack is a means for an enterprise to provide services to their users everything is focused on what do I need to do to provide these services reliably to my users everything else is noise and of course you know I'm talking about checklists and we're thinking about software and technology but I think one of the big learnings when we talk about culture and mission critical oriented thinking is that a lot of the time checklists are a way to forcing function a way to force communication across the silos that Jason mentioned earlier so making sure that all the stakeholders before you put something into production have actually communicated and know what's going on not only force communication but also to create a tighter team of people that are working towards a common goal of delivering something into production an interesting example of that we were tuning a large cluster for a company and we'd set up all of our tuning parameters and then the DC Ops people said you can't use these tuning settings because they're going to exceed our floor space floor space power planning so we had to detune the cluster by about 15% to keep the power utilization with intolerance never like a surprising all your racks going out in the middle of the peak load yep so so checklists aren't about getting things, getting the a recipe or getting the easy stuff done especially when we're talking about communication checklists are there to facilitate communication when the inevitable bad things happen so when the chaos happens and you're having the near death experience because the cluster is down your environment is down that's where you want to employ checklists to think about what do we need to go through to make sure we've covered the obvious stuff and that we've checked in and made sure the environment is rolling the way we expect so as I said, checklists when you think about them, they're not recipes they're not do this, do this, do this do this, do this, do this they are telling you not how to do something but what you need to do and in what order one of my favorite checklists from the aircraft example was what to do when the engine fails on your single engine plane the very first item on that list, it's a very short checklist, it's five items the very first item on it is fly the airplane which seems kind of ridiculous but if you think about it in the chaos of the engine has died how do I restart it, oh my god, I'm going to die people sort of tend to forget oh yes, I'm still flying an airplane, perhaps I should do something about that first how many pilots in the audience not many so for the pilots, you guys already know all about what I'm talking about with respect to checklists, so how can we borrow that and apply it to mission critical open stack the other thing about checklists is they aren't a static thing, they are there as a tool for continuous improvement and improving communication something went wrong, you missed it update your approach and the other critical point about checklists is plan for failure more than success in the aircraft example there are a couple of checklists that cover the normal situation and a hundred that cover what to do when something goes wrong it may seem like putting in more work up front to planning what you're going to do in production but if you do it right and you work across the teams you're actually going to save yourself time and make this easier and repeatable you can't afford not to do this otherwise you'll never deliver reliable features to production that stay up and running so when you think about this to sort of summarize today I'm not going to give you a bunch of checklists as we develop them within Mirantis as we are developing them we'll share them publicly I think you can see a lot of the things and I'll give some examples in a minute where we're applying a checklist mentality to the tooling that we're developing for open stack and how you can do that but think about what are your checklists what are the critical areas where this needs to happen keep it simple like I said before a checklist is not a to do list if it's something that's a hundred lines long everyone is going to ignore this we live in a very complex interrupt driven world it has to be short as I said before consider what are the stages what are the key communication points you need to think about when you were putting something into production and it's mission critical as wrecking crew never stop thinking about failures I said before planned fail fail over see what happens make sure you've done that already and develop your checklists while you're doing that not after everyone has had the near death experience of having their entire environment fail so I'm going to change gears in just a second so if you haven't read it I only saw one hand out there read it it's a really short read it's really compelling I was definitely inspired by it in thinking about how we can deliver mission critical open stack so check out the guys book I think you can read it in about three hours it's not it's not super dense alright so we get to some meat finally now we're going to talk a little bit about how we're applying some of this thinking so Jason talked quite a bit about considerations you need to make from a technology perspective and I've been talking quite a bit about what you need to think about from a cultural perspective so now let's try to put those two things together in some ongoing case studies that we're doing with our partners I think part of the message we want to deliver here to all of you is that delivering mission critical open stack is not something that Mirantis can do on its own it's not something that any of us can do on its own part of the beauty of open source is that it really is going to take an entire community of us to deliver the kind of rock solid mission critical experience that the enterprise needs so this is why we're working with partners working with customers and we want to work with all of you in how do we raise the production quality how do we raise open stack to be that mission critical infrastructure that we all want so talk a little bit about what we were doing with FlexTronics FlexTronics worked with us is actually working with us on an ongoing basis in their cloud labs and so some of the questions we wanted to understand is well with the current state of tooling with what's out there what's the experience of trying to configure and deploy mission critical open stack at the scale of about 60 nodes so the idea was to walk through a complete deployment with the current current tooling configure for HA configure for mission critical operation and see what that experience is like and where we see some some gaps and this was largely him one of our product managers doing it not the core guys that we send out to the customer sites to tackle anything yeah so I had the advantage or disadvantage I suppose of not having the same technical depth as Jason and knowing where all the gray areas are in open stack when deploying for something at production so what did we deploy a little bit of a drift there so Flextronus was kind enough to provide 54 nodes of their Wolfpack 1U computes we deployed those on to a couple of racks we also used 6 nodes of their Kenya 2U storage units we had a lot of switches in there we had 10 edge core switches 5 per rack and the idea here was this is not unlike what maybe not large scale yet but not unlike what customers are deploying today bit of a spaghetti diagram but this is just to show so we had two racks they said 10 switches 5 networks we deployed all of this according to Morant's reference architecture for networking as well as for the underlying infrastructure and just briefly we had a network for the pixie booting a network for public VM IPs as well as 10 gig networks for management for the private east west networking and of course 10 gig network for storage as well as I mentioned we deployed all of this based on our reference architecture and we'll make these available later I know this is definitely an eye chart for all of you in the audience but it's to represent again a little bit of checklist thinking we worked hard to think through the physical infrastructure of the network the logical infrastructure of the network and so forth and for this particular example we deployed a neutron VLAN so the software components that were involved we deployed open stack ice house we did the deployment using using fuel we also deployed some additional capabilities like we used Zabix for monitoring the infrastructure we've deployed Rally for test and benchmarking against the cluster we used Ceph for block and object storage on the storage nodes in the cluster and then on top of that we also used Sahara at least our first tab at this is using Sahara to generate workloads on the cluster today a lot of the workload generation that's out there is still in the early days and so spinning up and spinning down a VM is not necessarily simulating a real world workload so we used Sahara and used a Hadoop workload as something to generate a workload looks a little bit more like what you would expect Sahara is Hadoop as a service for those that are not familiar lets you deploy Hadoop clusters in VMs very quickly and easily so what did we learn in going through this experience of standing up the cluster and doing some of this benchmarking so I kind of think of this as food for checklists for later so we started small so we didn't throw for the touchdown use a sports analogy we didn't try to stand the entire thing up make it all work in the first shot well truth be told we did try that but it failed miserably so we all do this for like oh it'll be fine two days later of moving cables around so this leads to the next bullet point which is debug the physical infrastructure now again some of this is sort of motherhood and apple pie but can't stress enough the importance of making sure you've got this right before you move on to the next step we liken it to the leaning tower of Pisa if you want to build an upright structure you have to verify each floor before you build the next one that's right and some of this is slightly different thinking from what you're doing in a development environment where typically the deployment may be smaller and the environment you're working in may not have the same kind of requirements so once we got the physical infrastructure solid and we verified it then we started layering in more of the complexity we used Neutron VLAN before we tried to troubleshoot all of the switch configuration we started with GRE much simpler just a way for us to verify all the plumbing before we layered on additional complexity then once we had all that going then we could confirm the health of OpenStack and if you remember I mentioned that a lot of the sort of checklist thinking we have an opportunity to create through automation so fuel for example once you've deployed an environment we'll then allow you to run a complete health check of the environment and so here's an example where we're using automation to run through a set of steps to confirm the health of the environment before you go on to the next step and then lastly or not lastly but then once all of that is good to go so health check we run rally to benchmark the environment make sure things are working properly now we have a cloud now we have a structure but we're still not done yet we need to verify when we spin up VMs they're performing as we expect is the hypervisor performing the way we expect Jason has a set of tools that he's written and it's not this particular example it's not rocket science but writing scripts to go ahead and generate VMs generate traffic between VMs is another way to make sure that you're sort of checking things off in the environment as you go okay yeah this is production ready it's mission critical and then of course we started simulating workloads with Hadoop as a service using Terrasort as something that you can use to generate a lot of traffic on the network there's also something called Hadoop High Bench which is another tool you can use Terrasort as a part of it as a way to just put some stress on your infrastructure at the application level and then once all that was going then it was a matter of okay now it's time to confirm the high availability of what we're doing so let's go ahead and start failing services you know we deployed our controllers in HA configuration so let's start let's simulate failure start shooting services start shooting controllers and verify that things continue as expected yeah so I mentioned you know health check after we deployed for HA configuration we need to make sure that the environment is configured the way we wanted that everything was configured for HA and then we could simulate component failure so we started this journey of confirming resiliency using the human chaos monkey me and Jason killing things ourselves but again you know going through killing off key services verifying that they're working all of these become food for checklists and food for automation for later so for the moment we were using physical chaos monkey but in the future perhaps the chaos gorilla is something that we can take up within the open-stack community I know there are a few projects I've seen out there that are exploring the use of the actual Netflix chaos monkey as a way to simulate failure to cause chaos in the infrastructure on an ongoing basis so if you think about what chaos monkey is doing it's really trying to cover all those failure scenarios and automate them so I mentioned earlier the aircraft example of normal scenarios, maybe there's one or two checklists all the failure ones maybe there's hundreds of scenarios we need to think about and have a set of checklists we go through we can do all of that if we start to build up automated chaos gorilla or whatever we decide to to call it and we'd love to chat with some of you about what you may be doing in that area or interest in participating in helping us as we go further along to publish these as they evolve so I'll switch over to Jason who did so I want to thank Flextronics for their ongoing participation with us in providing the environment and the expertise as we do this together again I said we can't do this alone it takes a community and so we appreciate Flextronic we also worked quite a bit with Bigswitch and Jason can talk a little bit about that Bigswitch had a 16 rack configuration which was a nice large open stack environment for us to play in and we ran Terrasort on 75 node Hadoop cluster we provisioned through Sahara and after and I'll talk a little bit more about some of the details but the baseline with everything working perfectly the Terrasort was running in about 7 minutes 22 seconds and then Bigswitch started simulating failures of the spine switches, the core switches across the cluster during the run which ultimately networking is the heart of your open stack cluster if your network isn't unstable forget it you can just go home and the application didn't even notice that they were making failures across the cluster and this happened with a background load of about 42,000 VMs on the cluster so it was a quite exciting result and we're looking forward to publishing some more results the very interesting thing that came out from this is this cluster had a mix of server types for the hypervisors and we didn't really think about that we think a flavor is a flavor is a flavor all M1 smalls or M1 tinies or M1 larges are exactly the same and they were in fact exactly not and we couldn't, it took us a day or so to figure out why we had a 3x variation in the run times and we used a small tool from a company, a little startup called Megafind which let us run a very quick infrastructure analytics on any given VM and we saw that we had over a factor of two variants within the flavors depending on the hypervisor so once we re-ran the jobs all on the appropriate CPU cycle we had very stable time so it was very nice another example of food for checklists so by running this tool and running the analytics against the infrastructure you're able to anticipate problems beforehand up until this I would run Sysbench and a few other things across all of the hypervisors and I'd normally expect quite a spread in performance but I didn't have any quick and easy way to run this across the set we throw in one more story my friend David from Symantec is here we deployed they have a fairly large cluster and we deploy at the rock level they deploy at the rock level and we built an extensive in combination we did a lot of work with them built a very extensive rock level validator as part of our checklist where when a switch discovery happens in the spine the tour is booted with a bootstrap image we power on the boxes and we actually validate through looking at the DHCP and the MAC ports of the servers the switch port, the cabling plan we bootstrap an image on every server and we verify the full cabling plan and burn in the entire rack before we release it for use in the OpenStack cloud or some other purpose and having that baseline means that everything else is simpler because we know that that floor is solid so there's a birds of a feather session for the enterprise 11 am room 241 there is the OpenStack for the when the enterprise mailing list if you're interested in participating with the wrecking crew you can mail Emmatob Shaw and we'll produce further updates on the blog at Mirantis and of course you can also come talk to us because as we've been saying all along we can't do this alone it's critically important to OpenStack this is a community effort we hope to actually at some point have a large scale performance and scale testing lab that people can gate things through use as a CI gate and of course there's a lot more talks that sort of fit within what we're thinking about within the wrecking crew I know it's another bit of an eye chart but you know please what happened maybe they pulled out the hook okay there we go a lot more talks here at the summit in this same vein so please go check them out and thank you I'll put this slide back up there in case some of you want to it looks like we have a couple of minutes for any questions alright well then if any of you are interested in helping or want to learn more come see Jason and myself after the talk thanks a lot for your attention and we'll see you at the rest of the summit hope everyone has a great summit