 Hello everyone We're here to talk about enterprise readiness of OpenStack And so I'd like to thinking about this presentation. I thought about you know Being in a father and having screaming children in the back of my car when we're on a road trip arguing about all the Problems and the frustrations that they're having in the backseat and we've been in on a long journey with OpenStack I know it hasn't it's only been five years, but it's felt like a long journey and There's been a lot of kind of perennial Frustrations that we've been hearing from people who are adopting OpenStack, so we're kind of thinking about this in terms of Sitting in the car With the kids yelling and screaming are we there yet? And so some of these questions that have been asked about OpenStack have included stuff like it's impossible to deploy It just doesn't work at scale make a good kid Upgrades are really painful It's gonna fail when I need it most It takes an army so many people to operate I Can't even run my legacy workloads on this thing Those are the type of things we hear from enterprise customers all the time and so we wanted to kind of introduce ourselves I'm Kenny Johnston. I'm a product manager with The OSIC partnership between racks rack space and Intel I work my paycheck comes from rack space, but I'm a OSIC or through and through Mike Postel. I'm a director at the OSIC side engineering in San Antonio I work at Intel and I work directly on this OpenStack Innovation Center and Work out of the rack space building. It's pretty cool. If you haven't been there. You got to check it out So a little bit about the OpenStack Innovation Center There's really three tenants and maybe you've heard this but we'll repeat it first We're we're inspiring the next generation of OpenStack developers So we're mostly targeting upstream developers. We also have a significant amount of operations built into that We think that's necessary for an understanding, but we've trained. I think it's over 200 people Yeah, and we've also hired a bunch of people Within Intel for for OSIC and then we work together with rack space to make upstream contributions You want to hit on the cluster? Yeah We also another component is the cluster for those of you who attended the keynote when they were talking about the infor cloud being this massively multi-cloud distributed application a significant portion of that The compute power needed to do that comes from the OSIC cluster that we donate But it's also available for the community to use whether that be a project team or an enterprise Considering using OpenStack. They can get access to an OpenStack environment and up-to-date OpenStack environment or bare metal resources to test and deploy OpenStack on their own so check out osic.org if you're Interested in getting access to some of those resources. Yeah, it's a really cool option And then finally with that with those people we've trained and along with the rack space Experts we are doing a lot within the community With 100% upstream contributions across many different many of the different projects Yeah, and we should say you know that that target for us is really enterprise readiness It's solving this problem of are we there yet the frustrations that we hear continually from enterprises who have considered using OpenStack and Walked away tried it and failed or tried it and I'm just really frustrated and struggling to gain adoption within their organization So we wanted to kind of go one by one through these Kid complaints the first one being it's impossible to deploy so we're as we talk about these There's going to be some where we think you know this there's been some history here. This is kind of a Solved thing we think there's less pain point here in terms of enterprise readiness like we think that's a green check But there are going to be others where it's less so and there's more work to be done And there's others where frankly there's we fully acknowledge that pain point is still there So impossible to deploy Some of the history here for those have been around OpenStack for a while It used to be that there was little in the way of Deployment tooling it was a bunch of projects you had to go kind of assemble them all yourself the Analogy that's been thrown around within the community is that OpenStack It was kind of like a bunch of Legos a room full of Legos and you were told to walk into there and build The Millennium Falcon or some complex Lego structure a castle Without any instruction guide and that that that history is actually far past us now There's a series of deployment tools available whether they're from specific vendors or OpenStack projects that are that enable you to Do deployments that are very robust, you know, I like I think anyone considering Deploying OpenStack by hand should first take a look at the deployment tools available in in your preferred orchestration languages Whether that be salt puppet chef ansible There are all sorts of methods for deploying OpenStack Sometimes it can be confusing because there's so many and it's hard to choose which one But there are lots of ways to deploy OpenStack where you don't have to get into the nitty-gritty of deploying each and every individual service Mike do you want to kind of talk about what yeah, so yeah in OSUK I really like the Lego analogy So you got to have that guide in order to build the right thing, right? So so we continue to focus on the install guide within that and you know, there's several different things we're doing there that we'll talk about but Really to help operators choose the right tool for their deployment And then we also have we did a lot of work in the configuration process really centralizing those With the community when I say we I mean us the community all of us But really centralizing those so it's really easy to see is this an advanced option. Is this a Standard option, so if it's my first deployment, maybe I stay away from those advanced options And then feature classification matrix, you know to better understanding to the features when you're deploying How will this impact? specific areas within your cloud and then finally we did this novice install Which is something that we've done. I think I was listening to Malvin Yesterday, I think we were at eight cycles of this now where we have novice installers people who are very new to OpenStack Here's the install guide go build a cloud the first one took 40 hours And through the iterations and the improvements within that install guide The the most recent one was at six hours, so I think it's an 85% Reduction so some really good stuff in there Yeah, and I think on the install guide. It's important to point out that or the novice install We're talking about from bare metal. I just have machines in Racked and cabled in my data center to deployed open stacks So that includes provisioning the host OS preparing the networking deploying open stack now I have full horizon run the run the tempest test to verify and that's that's pretty impressive to be honest And it's been a lot of work it kind of we try to take a kind of UX bent to that where we're saying What is the experience of an operator here? I mean, where can we solve the pain points? So we've we've put a lot into the deployment thing again if I'm if I'm grading this one I yell out the kids and say stop complaining. This is this is a largely Solved thing that there are lots of tools available to help you The next one is does it work at it doesn't work at scale The history here has been kind of interesting. There were times maybe two years ago where people would say The control plane of open stack just can't scale beyond 200 nodes. You'd hear about things like rabbit and queue not scaling Whether that be the database not being able to scale to the size needed for large cluster deployments So in in that time the community has responded all of these deployment projects. I've talked about have Very scalable control planes many if not most of them have been proven at 500 plus Scale and so we think that again if you're if you'd considered open stack in the past and thought well It doesn't meet my scale requirements. We think we're we're in that range where a standard enterprise, you know is probably five hundred five hundred to a thousand Nodes is a is a an aspirational scale that I think anyone would want to know that yes if I deploy this thing it can get there That we're largely in that space and part of that is also a lot of validation that the OSIC team has been doing Yeah, and we're really seeking some proof points as well, right when you talk about validation I'll just add before we go to that But if you look the OSIC cluster that we talked about did scale to 1,000 nodes using open stack And then in the testing so within OSIC and we're also seeking additional support We're doing a lot of testing around scale where things failing. What do we find where things break? and then also doing some upstream contributions around scale whether that sells or Regions or Federation Yeah, and for those of you don't know those are all three methods Within the community to scale open stack You can scale out with cells or scale out with regions or scale with Federation another thing We're working on is Providing more robust guides for when to use which option and what are the kind of best practices for using which option? So that's future work that the OSIC team is doing around kind of continuing to push the scale boundaries of open stack and and I should point out, you know, like not just OSIC you heard a stream of of presenters in summit sessions or in the keynotes talking about open stack had truly massive scale like far more than a thousand Some of that is not really easily enterprise deployable but but Much of it and we've tested 2000 nodes is very easily deployable The next one is upgrades are painful. This is a picture of an asphalt machine repaving a road The the old pain point used to be Essentially if you wanted to upgrade open stack in the earlier releases you had to repave your entire environment I I can speak from my company's experience that we would we'd have to ask customers to migrate entire workloads to a new cloud Decommission an entire environment and just move to a whole new set of of physical infrastructure in those days That seemed crazy. Obviously that is not a recipe for success in Any piece of software it is also I think in our in our belief kind of an existential threat to open stack itself where the community and and what all these developers are working on is about bringing innovation and newness and Updating a piece of software and if you're if you look at the people who are using your software and realize that they are They're scared or have too much pain or incapable of getting to that newest release All that is is for not and we saw this in things like the user survey You would see long tales of people You know sometimes nine to twelve month windows before people adopted new releases And long tales of people still on on earlier releases We're starting to see improvement in that if you look at more recent user surveys the Though it's a stacked rank graph and this graph has been slowly moving towards the right and So part of that was was efforts by individual project teams It was largely on each individual service to improve the the way they architected their service so that upgrades were less painful Community has has gone about kind of organizing this effort through something called tags and if you're not familiar with Web page on open stack or called the project navigator. It's a way to view these tags It's a way to look at a project like Nova or Manila or ironic and say What is this project do and then information about that project some of which are about their attributes? So some one of the tags is Supports non-destructive upgrades another one is supports Rolling upgrades and there's a new tag about zero downtime upgrades that Mike will talk about but the community has responded I think pretty robustly we have a lot of projects that have You know one of the first to get rolling upgrade status was Swift and and Nova was right there behind them But we've now seen a number of other projects fall down this path So we're we're improving upgrades significantly and newer versions of open stack You can be rest assured that while there might still be some downtime in a couple of core services The vast majority of your control plane is going to remain online during an upgrade And it will be far less painful than some of the horror stories You might have heard about in the past and Mike can talk about the work We've been doing in OSIC around yeah in OSIC. We've really focused you mentioned Nova and Swift We've really focused as well on Keystone glance Cinder neutron And and really making sure that those can get to that status as well all of those now do non-destructive upgrades which you were hitting on But from a rolling upgrade perspective, I know we're actually focusing a lot in neutron now And there's a big effort there Also when we talk about zero downtime that's different than the rolling upgrade tag You know in the rolling upgrade and I want to get too far into the definition But you know, you know basically you could have a small bit of downtime We should go from one service to another service and they're running at the same time We're trying to get to zero where we don't have that interruption at all So we want to continue to push that upgrades I believe with you that you know that this is a detrimental thing to open stack and I'm glad the community is focused on it And we're glad to be able to help there as well So it will fail when I need it most. This is really about control plane Reliability and if you think about the promise of open stack and anyone moving from a more traditional infrastructure platform to a cloud infrastructure platform like open stack You have to have a mental shift in this same analogy of Cattles and pets who for those who are familiar we have to have a mental shift from I am relying on my infrastructure to be reliable my data plane to be reliable to now I have to rely on my control plane my ability to Provision a new virtual machine is what makes my application reliable right because my application is Handling that reliability and able to provision new infrastructure of something were to go wrong So my application in a cloud putting in a cloud application on an infrastructure platform my my application Can no longer scale or remain reliable if the control plane is not available So we're really talking about the reliability of the open stack control plane and at first It wasn't that reliable again thinking about some of these issues with the with rabid MQ or other services failing and hosts going down and so Again, this is one that I think is kind of a green check the response has been if you look at any of these deployment projects available An open stack they all deploy An HA control plane Most of them deploy into some sort of container structure. So you have the ability to Share control plane nodes with services and have some isolation So there's less interruption or inter-service conflict that might cause reliability issues But those are the kinds of innovations that the the Community has responded with and again I think you can you can look at almost any one of those control the deployment processes and deployment services and be Rest assured that it's going to deploy a reliable open stack control plane for you and and where we know There's failures. That's again where OSIC is focusing at this point more from a testing perspective Understand those upstream findings upstream bugs and go from there Yeah So we're one of the specific things around failure testing that we're doing is building a third-party CI to test That when you deploy open stack it can respond to specific failure events whether that might be high IO traffic on your control nodes or an entire control node going down or a Specific service failing on your one of your control control plane instances. So That's that's the kind of thing where we're taking it to the next step to not just say We're comfortable that all of these tools do it But that there's actually mechanisms to put that into the CI CD process So we're gating and making sure that we don't have any regression on any new code that would prohibit you From maintaining that reliability under certain failure scenarios It takes an army to operate the for those it might be hard to read But the letters on this keyboard say quit now We often hear from people who are considering adding open stack to their suite of enterprise infrastructure platforms That they're scared that they're gonna lose their weekends that they're gonna be on call 24-7 that they're gonna get the Pager duty call at 2 a.m. And have to respond to this thing that you know They're just not comfortable that it will operate long-term and some of that Has I think has to do with the fact that it's new But it also has to do with the fact that unlike probably other Infrastructure platforms you've run in a in an enterprise environment. There's not a great set of operator tooling here, right? it a lot of it is a lot of it is the just kind of Hack at it and make it work type things if you think about other tools you've used I'm sure you know there's really robust GUIs that help you operate this thing and One of the key metrics that we look at when we talk to Customers and users about this is what is they think about it in terms of VM to operator ratio How many operators do I have to have if I have a you know, 10,000 20,000 VM deployment? And in open stack that that ratio is not that great today to be completely frank This is one of those ones that I think we all have some concern about There is no great cloud modeled Open-sourced Tool to go use to do operations on open stack. That's that's open stack native that kind of takes into account the cloud mental model There are some fledgling initiatives in various parts of the community that are I don't think any of them would admit that they're fully there and robust and reliable for operator deployments and of course there are a lot of downstream initiatives, so there's lots of the vendors and if you work with a Managed service provider or a distribution vendor many of them have their own tools that that do provide this kind of robust capability But as a as an upstream focused organization in osik We would really also like there to be this kind of tooling in the community And it's something that we've kind of put the challenge out to the community to see if they can come up with The last one is I can't run my leg I can't do it This is you know, this is the perennial pets versus cattle problem, right in the open-stack sense though It wasn't really about Just pets versus cattle in the beginning. We were talking about you know sub 90 percent data plane reliability reported from some operators and this is again years ago, but The approach that I think the community has taken has been Really spot-on they've basically responded said we want to get availability In an environment we want to make sure that the bugs are fixed to the point to the point that no longer our operator induced Failures requiring the end down downtime. So if you think about that if you're operating a fleet of open-stack You might have to do a zero-day patch on a security vulnerability That would require you to potentially reboot every single compute host in your environment Should that mean that every single VM and every single cloud native application has to respond to a VM failure and spin up spin up new VMs That seems a little bit onerous to the applications looking for reliability So one of the things that we've seen a response is what's try to limit the to increase VM availability to the point where It can survive operator induced failure and really the secret sauce that we and we in osik and rack space and intel have both been contributing to Is improving live migrate if you have really good success rate and a good deployment architecture for live migrate You can you can get around some of those operator induced problems Yeah I mean you hit on the the live migrate the operator induced and then you know kind of next level is really to detect when that Failure is happening And then go go fix that on the fly right so we want to get into that space With the community as well, and so we're kicking off some efforts there and also part of the continuing ones within the community. Yeah And again, I know there there are There's lots of conflict about how reliable should open stack B. What is it should it be pet ready? And frankly, I think what we're the position we're taking is It should it should be as reliable as we can easily make it with operator tooling, right? If you know that a node in your environment Has a problem or is is about to fail. You should be able to detect that and migrate workloads off As opposed to just immediately destroying it in your environment So that's kind of the mental model that we're taking there Sorry So just to wrap up I'm just looking at our enormous traffic signal there It's a little confusing But you know, I think Kenny hit on each of these items as we went But you know, we're kind of given a green in the deploy space Really, there are several tools to choose from you know We talked about improving that install guide and they're still more to do here, but we gave that one a green in scale Yeah, there's there's been efforts and we've definitely seen You know improvements here and we've seen you know our cluster get to a thousand nodes and there were other success stories that we talked about But there's more that needs to come so we gave this one a yellow kids are kids are right here Well, if you want to add more Kenny and you're getting a drink Yeah, and I think on upgrades we we want to we want to communicate that upgrades are Far less onerous than they have been in the past, but there's still work to do Yeah, and and then in reliability. Yeah, there's there's multiple tools in that space We talked about that quite a bit operations. Yeah, that that one's definitely that one's definitely red the providers provide that Right, and that's where you're really going to get that support or you have your own tools And then finally in pet ready. We just talked about that one and the live migration and then the auto migration So, you know from our perspective that's that's the score that we we gave on those six different items of the kids screaming in the back Right, but you know, there's some real reality here, but then we also want to show there should be some confidence As well in several of these areas and significant improvements coming as well Yeah, we've made a lot of progress has been a kind of history of fear or doubt about Open stacks enterprise readiness. We think You know, there's been a lot of analysts research about open stack being ready It's obviously being adopted by enterprises all around the world There are still pain points, but we think that We're on a great path to solve them and through the community and with 06 help. We're gonna get there shortly. So Plus all we had we're happy to take questions or anyone Looks like we have a mic, but I don't know if you have to walk all the way up here to use it Good Yeah, that's it. No, there are and so yeah, let me Tackle that one and repeat the question. So the question was we talk about scale we kind of put the The boundaries at 500 to a thousand but obviously are there aren't there people who want more than that? Is that really where we want to end scale? you know, I think of I'm kind of limiting myself to the enterprise use case which again like there are certainly fast enterprises like We've heard keynotes from Walmart and and others that are using this at at very massive scale But I think of a typical enterprise workload to be in that range and I don't like rack space Obviously, we have a public cloud that is many x that range But I think we were trying to target a Barrier to adoption for enterprises today was that it wouldn't reach that scale I think if you think about the broader market of enterprise use of open stack That's really the sweet spot that we weren't into target was making sure we could hit that scale So I have a bug in my throat, but I don't disagree and cells is out my understanding is that Cells multi cells is in Newton So it is complete, but we frankly haven't seen a lot of people use it and within OSIC We haven't really tested it But that's one of the things that we will be doing is testing the different methods of scale Including regions cells and Federation does that answer your question Here and there the same applies for Microsoft and so and so on I mean I think we are there already I mean we are comparable to these Enterprises so to say solutions And the father's mean that we continue to To to push the message it's important because many people doesn't get it so far, but I think we are ready there Just Yeah, that's a really good point and just to read back for the recording But the point was made that compared to other enterprise software available Open stack does scale far beyond what there's some many of those capabilities are and we should not that we should be satisfied with that but we should be Yeah, we shouldn't be so concerned about scale if we're already kind of surpass some of these other scale metrics any other questions Thank you guys cool. Thanks everyone appreciate you coming