 All right, I guess that's started Hello everyone, I'm Nate Becktold. I'm an enterprise architect at EBSCO information services and I wanted to give this talk because We are we've deployed open stack. We are in live with open stack We've been that way for a little while now and we're we have an interesting story because we're kind of your average large enterprise, but we're not really at the Scale of a lot of the the hyperscale companies you see presenting at summits So I wanted to go over kind of the practical issues that we encountered of an enterprise our sides getting to live with open stack and how we wound up solving them so EBSCO we are a Discovery service provider we provide kind of online research content to mostly libraries academic Academic organizations If you've gone to if you've gone to college or university, there's a good chance you've used our product We also serve electronic journals ebooks From the back end perspective we serve up about 300 million back end searches per day So we do we do serve a fairly reasonable amount of content So what did we need so around two two and a half years ago? We started looking at infrastructure automation solutions and of the typical story There was interest in public cloud interest in self-service provisioning environments the development teams and full-stack automation So we kicked off the typical investigation effort and at this point in time what we wanted to do is really just focus on increasing the productivity of our teams and Focusing on automation We want to lower our costs by going with open-source solutions And one thing we were really looking for was Finding a solution that integrates really well with other products and allows tools to integrate with it with really good quality integrations So what I mean by that is we have there are a lot of solutions out there where their idea of an integration is a plug-in to a user interface And when you're talking about automation You're looking at API's and those those types of integrations are almost useless So we were trying to find a good solution to kind of abstract our underlying infrastructure and we wound up settling on OpenStack So why did we settle on OpenStack at that point in time? We were looking very strongly. We still are to public clouds and OpenStack gave us a very easy to consume API that Kind of had a very nice methodological alignment with AWS You know, you don't have API here really good API compatibility But from a methodological standpoint and the objects lineup you have an you have an instance aligns to an EC2 instance EBS aligns to a volume and the same kind of actions are capable in both environments. So You're reduced you're solving a lot of the same problems that in both of them A Keep a component for us was abstracting our underlying infrastructure Because if you try to if you try to automate say a traditional virtualization provider what you wind up with kind of are a lot of leaky Abstractions you get a lot of changes in the hardware configurations or say data store storage Configurations that really bubble up to consumers and whenever you wind up changing them it affects your automation So we wanted a stable platform that could really abstract the underlying implementation from our people building automation from our development and operations teams We wanted a standard interface to be able to get compute network and storage And one key that was good for us was that when something integrates with OpenStack and integrates a lot more consistently than a lot of other Solutions, so if you look again back to kind of the traditional virtualization side You'll find things with VMware support or other hypervisors But whether you can actually leverage that solution in your data center is another question How your VMs get IP addresses you do use an IPAM solution DHCP all that can vary from environment to environment So if you're looking from an for an end-to-end solution You know you don't often get as much out of the box as you do with a solution like OpenStack And also we wanted to build an infrastructure as a service platform. It was fit for our live services and without we could hand out to say a diverse set of teams and Be kind of confident in the level of Project level isolation that they wouldn't be able to interfere with each other It's a big tenant for us to be able to say to our consumers if you break it It's our fault or it's somebody else's fault. It's not your fault. Don't be afraid to use the product I much rather tell them that then hand them a product and give them a long pay a large page of Best practices and things they should never ever ever do Otherwise some admin is going to hunt them down and say you shouldn't be doing that Nobody likes that. No one wants to use a product like that So what's our current scale and these numbers are a little bit still a little bit old but right now we have Three separate OpenStack clouds one backing our development resources and two backing our live environments We are approximately well at that point in time 1259 running instances that was a snapshot point in time. Obviously it goes up and down all the time One interesting metric is since we started OpenStack We've had almost five hundred thousand instances created and destroyed since we made it generally available We find that a very interesting metric to track About 68% of the workloads are concentrated on development environments because that's where we have a lot of development environment a lot of development teams and They have had access to the environment the longest and they spin up a lot of infrastructure to support their development efforts Around one-third of our virtualized workloads are currently on OpenStack So our design philosophy We really wanted to build a platform to run our production applications and so what we mean by a platform is we're really providing this platform service and Part of it a significant part of it is OpenStack, but really the platform encompasses all of the tools and integrations you need to bring an application From development and host it in a live environment. So that brings in a whole bunch of other tool sets as well we wanted to Have a solution that was multi-tenant at its core so we could be able to kind of safely give a development team and an operations team a Different project on the same physical infrastructure and not be concerned about them creating security issues or affecting each other in any way All the tools and everything that was part of our platform had to be highly available and it needed to be production grade so Good enough for development, but not for production is not an acceptable for a permanent state in our in our case and there's a lot of reasoning behind that philosophy because developers and People building automation they don't want to code to a state that they can't use in live You get your value in life so if you start making solutions that kind of work in dev and Then you transition to this other completely different solution in live. That's not a good strategy So we basically said everything we put into this platform everything we put into OpenStack Every project has to be something we at least have a path to live with We didn't want to put any toys in it we want to make sure all of our All of our offerings were really built for general purpose and customizes little as possible Kind of provide a menu of infrastructure offering so once you get OpenStack out there suddenly you always wind up getting people saying oh well I want this specific configuration of core and memory and you don't have any your flavor model. Can you please build me in that flavor and That what you wind up if you go down that path is you get a lot of different flavors And you get flavors with application names in them and all sorts of crazy things You don't want to be able to support so we drew a hard line and said everything we put in here Should be general purpose if you found a flavor that you know We don't support and we find a good general purpose for it. Yes, we gladly add it But we won't add one with your application name in it Same thing for all of our storage offerings and everything we've done through OpenStack The other part is we wanted a solution With good safeguards that would encourage experimentation Development against any kind of any kind of platform is a lot easier when your developers can actually experiment on it They don't need to worry about bringing down the entire environment It makes them go a lot faster because they don't need to fully research everything They're going to do they can experiment and try things and find out what works and that gets you a much faster feedback loop So what does our current architecture look like I'll go into I'm going to go into this a little bit more detail later But right now we have OpenStack for our monitoring solution on top of OpenStack one up leveraging Zabrex For kind of an operations Dashboard we use Rundeck. We use that to kind of power a lot of the Automated jobs we built to operationalize OpenStack For some of our dashboards for metrics you use Grafana And then we use all the core components you see on the presentation there from OpenStack And also we use a product called avi software to find load balancer for all of our load balancing with an OpenStack And Here's some of the things that we learned so Then we had quite a few problems to solve and they kind of fell into a Munch of broad categories of skills in training Selecting vendors and integrations the actual deployment of OpenStack adoption and productionizing it to get it to live So first part and you'll probably hear this all throughout the summit all throughout the summit OpenStack skills are very hard to hire They are very hard to hire And if you don't get the direct OpenStack experience Direct OpenStack hire you need somebody with very good and Linux administration experience and that seems to make sense but What usually winds up happening and a lot of organizations kind of like ours initially is Your virtualization team is what winds up taking over the OpenStack POC and you get a lot of skill sets for VMware, but not necessarily Linux administration so that can be a gap that has to be plugged And inexperienced administrators on top of OpenStack can do large amounts of damage When you're running as admin the safeguards are off Safeties are off you can do terrible things So you need to understand the system very well to be an admin How do we how did we kind of get around these we? Decided we're going to develop Through our POC Basically a set a core group of subject matter experts We're going to be very hands-on and become experts and kind of train the rest of the team We decided very early on we we actually attempted to go this way originally But our advice is don't waste learning opportunities by over reliance on professional services If you want to use professional services, it's a good idea to use them as a architectural guidance But it's not a good idea at least in my opinion to have them come in and actually install OpenStack for you You miss a very big learning opportunity for your team and ultimately they just kind of get handed this thing over the fence that they Really don't understand how how to operate or how it was installed to begin with so you lose a very big learning opportunity by Not letting them do the installation the first time means it's going to be a little bit harder But everything you learn is going to be directly transferable to their job is administering it When you look for new candidates look for people with strong Linux backgrounds Networking virtualization Python skills don't look for a direct OpenStack experience if you get it great But we've had we've had a much better Success looking for people with a strong foundation and then training them on OpenStack Give your give your team the opportunity to experiment and learn how OpenStack works people learn by doing and If you don't have a safe sandbox environment for them to experiment on Then they're gonna have to experiment on real environments and potentially cause issues And vendor support so if you go with a vendor you get support then that's going to lower the amount of expertise You need to get to production you always have that that phone to call if some issue arises and you don't know how to Deal with it so it means you can get to production Sooner and with less people at least that was the approach that we took Vendors and integrations, so there are a lot of OpenStack a lot of vendors who integrate with OpenStack today tons of vendors and varying Degrees of quality as to how good those integrations are there are lots of established vendors and Ultimately now with whatever vendor you pick with a direct or indirect integration with OpenStack You need it to be approaching it with the philosophy that everything under our platform is going to give our development or operation teams the tools they need to deploy and manage a highly available application so Our preference we strongly prefer Products that integrate with OpenStack's multi-tenancy model. It makes it really easy to get integrations So and a lot of them will align with the OpenStack project model So their existing OpenStack login will get them access to all of these other services that they need that's a really clean and nice model Focus on vendors who are building for cloud natively rather than trying to integrate and tack it on to a product that wasn't built for it Look at areas everywhere to improve your stack reevaluate all your product decisions There's high value when a product integration is done correctly under OpenStack It can work really well and increase your adoption And also you're never going to know how good an existing vendors integration is until you actually try it There's a lot of hidden landmines with missing support or API capabilities things like we've encountered sender drivers it didn't support snapshotting functionality and from your develop from a developer's perspective if you were relying on snapshots it's a breaking API change So you're not going to really know How good the integration is until you actually get it up and kick the tires So a good case study we when we first approach we said all right, let's let's integrate with our existing load balancing system The existing vendor they had kind of a limited open stack knowledge a to bare bones implementation But we tried to integrate it under Elbas We encountered issues we encountered a lot of issues We actually encountered a bug in their product that wound up a kernel panicking it many times And when we went to go approach support, we got this interesting answer back For now to avoid failover. I would recommend to program the open stack not to delete IPs So what that told us OpenStack wasn't really a first-class citizen for this vendor They really didn't even know what it was from their support perspective Or the fact that the code that was running there was their Elbas driver In our case Elbas v1 at that point in time. It was very limited We didn't think it would cover very many production use cases at all And fundamentally our load balancing product if we weren't using Elbas It really didn't support safe multi-tenancy was really hard to kind of give somebody access and say hey You can't break it. You can only break your own things. I know there were all these shared resources Where if they weren't knowledgeable enough they could cause an outage It was a prolonged evaluation period ultimately Resulting in rejection after about six to eight months and we went back to the drawing board so this is when we brought in Abby and This was a really nifty product because it was built for a cloud from the get-go So our the installation product process was basically deploy the controllers point them at your open stack And it provisions and manages all the load balancers underneath automatically So that allowed us to move really fast we Were able to get basically a production grade load balancing Solution in our development environments sent out a general availability announcement to our develop to all of our teams Hey, you've got access to load balancers all of you have an access to open stack You've got access to load balancing now within a week of actually purchasing the product So we were moving really fast It's multi tenancy model aligns with the open open stacks project model So there is a product There's a tenant inside of the abbey product and aligns with the tenant inside of the open stack product So when we give somebody an open stack project, they've got the ability to Do everything they need from infrastructure and load balancing perspective automatically. There's no Giving them a separate login or giving them access or permissions to different systems. It just baked in And that was really powerful. The other part was because we want this product It has a very strong kind of insight and analytics module to it that the development teams really enjoyed and this helped us Made people want to move to the new platform want to move to open stack and want to move under Abbey because it was naturally incentivizing they really liked the The functionality they were getting so that was a good example where we kind of dethroned the existing vendor in place for one that approached Clouds as a first-class citizen All right problems to solve on the deployment side Deployments on open stack are take a long time and they tend to be very complex The story is getting a lot better than it used to be but for the first time approaching it It is still a very complex problem And a lot of a lot of functionality and open stack isn't ready for production and it's not always obvious What is and what is not especially to a newcomer? So from our experience when you're going to do your deployment one of the biggest things to do is online all of your resources all of your manpower for storage networking data center teams make sure that Supporting this installation is a top priority in troubleshooting issues top priority for everyone in that in that tier or anyone who is allocated to this project because open-stack requires really this tight integration with all of your infrastructure components and You get a very slow troubleshooting And it's very slow troubleshooting the feedback loop basically when you starting countering issues with say networking or storage access An account doesn't work if you have to go back and submit a ticket That is going to make your deployment generally take a very very long time And people are going to lose context while they're sitting here waiting for the storage team or the network team to respond to that ticket So it's going to directly affect the time it takes for you to deploy And it's going to take directly affect the quality of the product at the end of the day because People get there It's hard to keep your mind in the right place when you get those constant interruptions You need to understand what deployment Choices are difficult to change afterwards and make sure you get them right. There are certain ones like core networking drivers some storage decisions Where whether you want an SDN platform or not that you really get one chance to do it right at deployment And if you want to undo that decision you have to redeploy a whole open-stack cloud And there are some that that are easy to change afterwards So it's important to identify which ones are going to be very hard to change And make sure you've done your due diligence on them before going forward And assume it's going to take you multiple tries to get a production ready configuration You're probably not going to get it on your first go On the adoption side adoption is one of the most critical elements to really success in a any private cloud So what I'd strongly recommend to have a really close relationship with your early adopters They're going to help you you're going to help them by giving them access to something that they want and They're going to help you increase the resiliency of your deployment go up there speak to them regularly in person Help them understand open-stack help them learn it and be open to them when they tell you about problems So when you go through that'll help you understand when they say oh, I tried to provision some instances yesterday And they failed now your team can go in find out troubleshoot Why did they fail was it because maybe somebody restarted Nova computer at a certain time or maybe you have an issue in In your messaging tier That feedback early on in that close relationship will help you build a much more resilient product at the end of the day Because they are going to tell basically do the opposite of what a lot of people do to have them tell you every little hiccup This system has early on that will help you a lot Get the deployments into users hands as fast as fast as humanly possible There's only so much you can do in a poc environment once it starts taking real Real workloads and you have real people on it. That's when you start seeing how well it's working and what you need to fix And this one was very important for us. Don't stall getting into production teams Do not want to code to an API that they cannot use in production So yes, we got a lot of early adopters and then we had a lot of people sitting here waiting saying Yep, this is really interesting to us, but until you can give me a path to production with this You know, I'm not going to commit any development time and that is going to limit your adoption so Just merely having the environment in your production environments available will Increase your adoption rate considerably from our perspective was almost exponential growth from the second we put it into live I think our environment started doubling approximately every six months from that point on a literal exponential growth and Early feedback is really critical to this process so productionization Open stack provides a lot of Building blocks, but some assembly is required to actually build a product out of it That's ready for production Monitoring and common operational tasks are really not solved out of the box So open stack kind of gives you all the little Lego bricks to piece together But somebody or something on your side has to actually piece them together to get make them work So one of the biggest things that we found that was successful was monitoring open stack by actually using open stack provision instances provision volumes attach them to instances exercise the functionality Because Open stacks a complex system trying to approach it from the bottom up and actually figure out what the impact of an error is really hard When you have that top-down monitoring when you say alright an instance is failing to provision You know what the impact to the customer is you know at the impact to your users are and Now you have something that you can troubleshoot a lot easier So that's where we had our most successful monitoring Was from that to your that was what would always pick up issues and help us troubleshoot Open stack is really complex finding the effect of a failure is a difficult problem It's important in order to get adoption that you find these issues before your users do if your users start finding these Problems and you don't know about them until they report them to you They're gonna lose faith in the resiliency of the product Another important bit automate completely automate common operational tasks in our case things like taking a compute node or a control node out of service Restarting open stack services some elements of patching any any common operational task automate it in our case we used run deck to To be the the kind of single pane of glass that people can go to to run these tasks But that will lower the barrier to entry for people actually administering open stack and give you a It kind of helps plug the gaps and kind of assemble those Lego box into a cohesive story This one might be a little bit controversial, but I would say open stack ha is complex and it is needed for all environments We went out on our development environment originally We did not do an ha installation and what we found is a lot of tasks and troubleshooting that involved restarting open stack services or Testing out configuration. It was almost always disruptive. It was almost impossible to make a non disruptive change When your environment isn't highly available Also, you want to make sure you have an adequate twist testing environment for any changes you make because An ha environment is a lot more complex than a non ha environment and they act very differently So from our perspective, we said everything has to be ha From the ground up and that includes development and every resource our development environment We treat exactly like a production environment So what did we actually do? What did our process look like beginning to end? So we started out in Havana We did a prototype The usual dev stack single all-in-one machine learn the basics validate direction Keep it a disposable environment because it will probably go down. You will probably trash it So make sure anyone using it understands that in our case We did blow it up many times and then we transitioned to what we called our interim environment Which was we broke apart compute and troll and started getting experience in a distributed environment got feedback from our users and determine the desired configuration and that's when we went with a highly available environment on Juno at that point in time and We treated it exactly like production we announced it was generally available for all of the development workloads and Started to determine the tasks we needed to do to actually take this thing into production and then finally went to Production after we figured out the problems we needed to solve But what went actually happening is we had this big delay that I went talked about early in the presentation between Dev and production It was there was a lot of reasons for it one issue was a critical team member Matt left and we spent too long looking for somebody with an open stack skill set To backfill that position that we lost a lot of time So we wound up Going with somebody with a strong Linux administration experience Because that just set us back way too long looking for somebody with open stack talent We figured out that additional work had to be done for monitoring and operations before we were confident having this host production workloads And a lot of the required skill sets to do that actually weren't part of the open stack team at that point in time So our solution was create a focus squad And what we did is we kicked off a basically a six-week effort with a cross-functional team everyone we needed to get the job done they thought the Requiring it was anyone on the team would have to focus 100% on this project didn't matter what else was going on Director had a good quote where he said set your email to out of office if you have to do that But it had to be top priority for all of the members and the focused effort was incredibly efficient We the feedback loops for travel shooting were incredibly reduced There were very few blocked tasks and when they were blocked they weren't blocked for very long gave us a higher quality implementation at the end of the day and I think within It took us Several weeks to a month to get the first From first installed to get the first dev QA out there and we had both of our live environments done I think in a couple of weeks Over here and that was with the with the enhanced Productivization and more documentation tasks. So it was that focus was incredibly important The focus squad what they did was it created a reliable monitoring solution based on Zabax And we wrote a Python framework to execute checks on open stack. That's the part that does the Does the instance provisioning our reports on health? It one thing we had to do is say if an instance provisioning fail fails It's generally important to identify what part of the workflow failed So if you have a lot of these compound actions, for example, if you want to test volumes if you want to test sender You we have to provision an instance provision of volume attach it But if Nova is having issues and you can't provision that instance We don't want to reflect that as a sender failure. So we had to put in a lot of effort into making sure we Returned the correct component of the failure in these in these kind of compound monitoring tasks We created automated recovery for issues we discovered in the dev QA environment Things like automating compute node of value Evacuation on a fail on a failure. It's essentially one API call to evacuate a compute node when it fails But something has to make that determination and something has to make that API call It's not going to be open stack. So we had to put that into Zavix and We automated some of our failed open stack service recovery Workflow so every now and then an agent or a service in open stack would either lose communication with a message queue Or get into a bad state need to be restarted. We automated those restarts and kind of made open stack a self-healing environment We also increased visibility into the environment and tried to make it as publicly public as possible with Zavix and Grafina Automated common operational tasks with Rundeck And we deployed all of this infrastructure. So we deployed these weren't pre-existing systems We deployed Zavix Rundeck open stack and did all this within the six week time frame So tracking success This one is really difficult because it's critical to getting ongoing commitment But it's really hard to track how successful was your open stack to plan was open stack the right solution for the company And that's really hard to Get kind of a a metrics-based analysis behind so what we wound up doing was we counted some KPIs like how many instance Are how many instances are on it how much resources are allocated to the environment? The number of teams that are leveraging open stack one interesting thing We found was the number of instances that were created and deleted was a very useful metric because it wound up being an Indicator as to whether teams are actually using the environment correctly where they provisioning Elastic resources where they provisioning kind of a fair moral systems or do we have a lot of long-standing VMs And they were just kind of using it for self-service provisioning And I think that's pretty much it does anyone have any questions I'm sorry. I can't hear How big is our team our team? our team is Between three three and four people I'd say three people really proficient on the open stack side For storage we use a net app for storage We found the net app integration was actually pretty high quality. It worked. It worked very well out of the box So we have a lot of net app and we just stuck with it. So that's our sender provider We also did go the route where we do use kind of a shared instances mountain and we do support live migration So that's backed by net app as well That was one of those things where we didn't really want to do it But we did it to get faster adoption, but we put forward we said Live migration is kind of a best effort live migration rather than a guaranteed Guaranteed solution so we do it to minimize the impact of the environment, but it's not guaranteed Thank you very much you