 Hello everybody can hear me. All right. Good. Good morning. How's everybody doing? That's about the level of a dizzy as I'm I expected for 9 a.m. So that's all right would say you're a lovely audience, but I can't really see you very well because of the lights Actually, I should mention actually you probably saw at the door we're giving out the raffle tickets So we'll be giving away Amazon Echo at the end of the presentation. So kind of stick around for that and I Today, we're here to Upstack, so I'll let the car thick kind of take that away Thanks, Jeremy. Hey, hi folks. Good morning And welcome to the session on turnkey open stack We know you have a lot of other sessions to attend. So thanks for taking time to come here As Jeremy mentioned, we will be giving away an Amazon Echo at the end of the session. So please stick it on for that We've been working on a project called project Caspian for the last year and a half We'd like to show you what we have done so far It's about making open stack turnkey not just from a setup and deployment standpoint, but from operations and ongoing operations so So this is the agenda for the session today. I'll give you You know a few words about what problem are we trying to solve? Why did we think it was important for us to solve this problem? I'll show you a few demos of the of the project that we Have been working on. There's a big unveil at EMC world next week So you'll see and hear more from us next week We'll also talk about the trade-offs as we went through this this effort and then Jeremy will take over and then talk about design and challenges Sounds like a plan, okay, so why did we choose to go down this path? We realized that many organizations do not have the time or the appetite to put all of these things together We have heard repeatedly from many organizations that they're very interested in open stack But it's just incredibly complex Design takes weeks if not months and even at the end of the day You do not have a very stable open stack that's scalable. It's you know that you can upgrade and so on and so forth So I think many vendors are trying to solve this problem as as are we An off-coded statistic is that it? Organizations spend almost 70% of their budgets just trying to keep the lights on So it is important for us to solve You know not just bringing up the open stack environment In a very simple and elegant manner, but also making sure that you are able to operate it with the same ease of use In short, many organizations are really looking for an easy bottom So when we started out this project we had two goals the first one was Can we get to an open stack cloud in less than a day what that means is could you get push-button deployment? Could we make your maintenance operations very simple and easy like change the personalities of servers replace parts? Can you have your users start using the environment almost immediately? And can you all do this in a very easy manner? Simplicity is always at the forefront of everything that we try to do with this project Internally, we call this the iPhone experience so that that guides us from a design principle In terms of how we want to build this The hardware comes in it's wheeled in and it's plugged into the network How quickly can we get the open stack environment up and running so that's goal number one? Goal number two is You know in in the life of running this entire cloud. It's going to be running for three four five years or hopefully longer Standing up a cloud itself may not mean You know from a time perspective may not be that big of a deal, but like I said, we've seen organizations repeatedly You know struggle with even getting to that stage. So that's goal number one goal number two is While that's all great. Can you make it enterprise ready? Is it secure? Is it highly available? Many organizations want to start small and scale out. They don't have big budgets to go after setting up multiple racks So They really want to start small and scale out. Can you build a multi-tenant out of the box? monitoring across the entire stack and single pane of glass management and at the end of the day Upgrades and patches, you know patch management very critical You have tons of new projects coming in at a pace that many IT organizations are not traditionally used to How do you take advantage of that? How do you build that all in from a design perspective? You know and make it very simple for the IT admin So let's see what that looks like. So what I'm going to do is show you a series of demos What we're going to do is show you how the open stack environment is deployed How it is monitored and how it's scaled out and you know elastic elastically scaled out and brought brought back in so when when the system is wheeled in what a customer sees is a System with no personality all we have is a few control plan nodes And everything else is empty. You see about 17 of these servers But nothing's really set up so what we do is an IT admin would that go in and say oh I'm going to stand up my open stack cloud and What he's going to do is just go and click a few buttons and say deploy open stack it's really as simple as that and What that happened when that happens in the background we configure everything from all the open stack projects Nova you know neutron and also the you know the scale IO software technology that we use for storage And all of this happens within a few minutes. You see that we have Kibana integrated for log management You'd also see that I don't know if it's clear on the slide, but the four node system in real time takes about about four minutes and What happens after that is? The the end user the advent sorry just once can then go ahead and start You know adding users to this cloud, you know, we have support for active directory in LDAP Sorry, I'm having some technical glitches Okay, so the user can then go in and start creating Tenants within the organization, you know import users from active directory in LDAP And at the end of the day The user or the admin is able to now provision the cloud for end users to take advantage of And then the end users can start using horizon and start deploying workloads and so on and so forth So that's step number one Now once that is done you know in steady state The admin can then go in and monitor the entire cloud We have monitoring built-in top to bottom right from your open stack environment all the way down to the hardware You can take a look at the infrastructure various levels the storage the network The compute nodes you also take a look at the open stack environment from an instances standpoint What you see as cloud compute is really the the compute environment powered by open stack So so that's what the end user or the admin can do And we also have the ability to monitor for every single Organization what it looks like so you see an account tab there and the video will show that in a few Seconds you can actually drill down to a particular organization and see what their environment looks like So it's very powerful the admin can drill down To the depths and figure out what's going on Once that is done. Let's assume that the operator wants to scale out the cloud All he has to do the same thing just go select a few nodes and then expand the cloud So you can do this non-disruptively there are no scripts and Configuration things to do all you have to do is just go back to the to the environment Just select a few nodes and voila you have an expanded system So as you can see from four nodes we go up to Six nodes at the end of this process and the same with you know shrinking the cloud If you want to get it down To a few nodes what you have to do is just live migrate those VMs and because we have scale IO storage Which is which is kind of powering this entire environment? We can live migrate the VMs over take the nodes out and you can get your You can elastically bring it down. Do you like that? okay, um as part of this effort and Jeremy will talk a lot more about the design and show you under the covers a lot of conscious decisions went into How we built this entire thing so we focused on simplicity and robustness And if you want to do that you cannot get flexibility. I mean that doesn't work that way So we chose ease of management and install ability over customization you get What comes with this with the system again going back to the whole iPhone experience? You know it is exactly like that You know you don't get choices you get a few choices in terms of the hardware you select But you do not get a lot of different components that you want to put in That means that we have a curated set of open stack projects that we support We have a bar that we set fairly high. It has to be stable. It has to be Well adopted by the community and those are the ones that will go in as we build this thing out And of course a predefined network topology a lot of the design that we do You know a lot of what you saw is enabled by very conscious design choices And I'll just highlight two of them and Jeremy will take over and explain more The first one is the use of our software defined storage scale. I oh, you know That enables you to start small we run it in a hyperconverged configuration so that enables you to start small and scale out But probably the most important one is the use of docker containers for running everything our control plane is all Containerized our entire open stack environment is containerized To talk about that and more So as Karthik mentioned, you know, we're running everything in docker containers It's not just you know open stack services, but a lot of the custom services We had to build to make this an enterprise ready solution What we're really shooting for there was you know, eliminating the conflicts and dependencies between services so, you know, we have huge flexibility and and Around upgrades. They're a lot smoother because we're using the containers, you know We have base OS upgrades separated from you know service upgrades. We can upgrade individual containers independently and You know, we'll talk more on that But I think using docker in general has also helped us to be able to iterate more quickly over the project and Completed in a faster time We're using notably ansible for lightweight container management. So the choice there is kind of you know, why would you use it? It's kind of the same reason many people might choose ansible, you know, no extra You know impose architecture on The platform no agents everything's simply over SSH easier for our you know team to to set up and script and we'll talk more about these as we go along For the scale I owe we'll talk more a little bit about the components in the detail there but scale I owe in general is just software Defined storage so you can pull together and leverage your disks in your infrastructure And you're basically creating a scalable elastic resilient virtual sand and that's at a fraction of the cost of like a traditional sand So that's kind of like the reason why we're using that and plus it's redundant storage reliable fault-tolerant Especially the way we have it set up with the containers. So we're getting that in a bit and as Carthage mentioned in order to create a tested kind of tried and true reliable system You know that one of the sacrifices was a little bit of the flexibility for the hardware. So it's predefined hardware configuration so Just kind of how I want to finish this last half of the presentation We're just going to talk about containerizing open stack Talk about some of the challenges we hit there and talk about you know scale I owe the software to find storage and then at the end will mention kind of for miscellaneous challenges that we had to address along the way because of design So containerizing open stack This slide just kind of highlights basically what what we mean there what we're doing so We basically have the base Linux OS and we have you know any number of images and this this isn't representative all the images that are possible But you know, we have load load balancer image rabid MQ, you know UI log, you know custom logging containers and things so there's a lot of stuff here that's going on Notably here for the open stack images we have We do those in kind of a unique way where we actually pass in a parameter at the time that the container will start up So if you want like a glance container from your open stack image, we say, you know passing a roll to a boot Script that basically will create a glance container with those services running in it now Importantly to note here is because many of our containers are actually running multiple services We're actually using supervisor D to kind of monitor and maintain those which is actually nice because It's allowing a little bit of self-healing the way we have it set up because we actually have it configured to restart services if they Went down for any reason and by the same note. We have Docker Set up so that we'll restart any Container that if it happened to crash for any reason So as Karthik mentioned we we kind of set the bar high on what projects we wanted in we wanted in, you know We're initially the the the most reliable projects here and One of the things that Docker also is going to allow us to do in the future is more easily pull in those other services that We want to add, you know as time goes on and also using the containers allows us to Support enterprise ready features such as ability to transfer nodes You know if you have a note here and you wanted to transfer all the services away to another node That's much easier to do the way we have things set up especially with Ansible orchestrating that But just to kind of walk through some of this so you know we have Rabbid MQ clustered mirror cues database is my sequel glara Their my sequel containers are all active masters and as Karthik also mentioned we're relying on scale IO for you know back in for the Glants of Cinder and Nova and we'll talk about some of that in a bit as well and Neutron Open V switch VXLan I should mention at this point though that when when we started out this project initially You know we were focused on the kilo release at the time But we're actively implementing working to implement metaka at current and I mentioned that because some of the things That will hit on some of the challenges you'll note us a little bit more geared towards kilo so This slide is just we'll walk through it a bit But kind of laying out the landscape of the architecture in terms of nodes and then what containers look like and where they are But once again, this is not representative of all the containers You would see because there's things here like UI logging We're using elastic search and things like that so that more containers there But you know to fit on the slide that that I think this gives you a pretty good sample of what we're doing And of note here, you'll you'll see that Containers are pretty much in triplicate, you know, that's for fault tolerance and redundancy sake And we have things behind load everything behind load balancers. We're making sure there's no single point of failures With services now what would you see here in green of these platform nodes? So what we're doing is basically you can think of as like seating an environment with three platform nodes You can think of it. It's like a control plane or Control nodes or however, it's basically like where they're like nova controller and neutron controller and things like that are going to be running and from there we're allowing to spin up in number of compute nodes and I should also point out here on this slide that You know, there's these scale IO components will get into in a bit like scale. Oh MDM gateway Those things are also spread across multiple platform nodes again for fault tolerance and redundancy sake also worth noting here is that Scale IOS DS which we'll talk about briefly in a moment is on the compute nodes because typically the discs are on the compute node side And so we're pulling pulling together the discs from the compute nodes the SDS as you'll see is what? Manages the back-end IO operations But first I just want to force switch over to the storage and scale IO a bit I want to talk about some of the challenges we faced along the way When you know going down this path of containerizing open stack so Well, we initially Starting out we knew about these four, you know things would be challenges going in so it wasn't like any of these were surprises But to walk through them, you know It needed a way to manage configuration and service metadata So if you think back to that diagram that I showed the docker one where we you know You pass in the role to say this image become glance container or Nova container one of the things that happens there and that process is that There is a container that's it's basically a metadata service Just has a REST API that we talk to when the container comes up and it pulls its Configuration metadata so for instance if you were going to run a nova controller container What's going to happen is that it's going to talk to that metadata service and get the configuration items such as like CPU allocation ratio memory allocation ratio Scale IO plug-in configuration and it's going to set up the services Now dynamic note inventory, so as you can see one of the things with docker containers is that you know You have they it's kind of like you you got to control or know where they are, right? So one of the things we have to do is know Where every you know the containers are on any given node and Another thing we have to know is for the nodes themselves like what do they look like, you know, what kind of disks do they have? What size are they which ones can you know are taken by an OS which ones are actually you can be used by scale? I oh so that information is stored into another service which also can be talked with via REST API to kind of control that dynamic node inventory and That leads me kind of the third challenge here, which was With the containerization, you know, we needed a way to pass in custom variables and and this dynamic inventory To ansible so one of the ways we did that was just have container that was a REST API wrapper around ansible so we could do that and then also, you know Allowed us to have help separate this concept out of you know Platformer management plane versus compute plane and allows us to programmatically execute the playbooks that we need such as if we need to upgrade You know the services we can launch an upgrade playbook if we need to transfer the services or add more nodes We can run the appropriate playbooks and the fourth point here so typically with any of the containers there's probably something you want to persist to disk or have survive if you brought a container down and back up or something like that so One of the challenges here is like, you know, you want to do that But then you also want to support those enterprise ready features such as you know the ability to transfer all all my services from node A to node B So a little bit of a balancing act. So one of the ways We actually meet that challenge is by actually using scale IO and Just to give you I guess two quick examples of that so for glance in particular using scale IO so As I mentioned, you know, we have this Services and in triplicate here so any one glance container Only one glance container at a time is going to be running the glance services and we have like a Clustering mechanism or something to have that happen But the important point here is that there's a scale IO volume that's going to be mounted to that particular container That's actually running the those glance services Now if that container crash and one of the others intelligently takes over or we want to transfer the active one to somewhere else The volume just needs to get mounted to the other node Same thing with kind of Nova live migration So we're actually facilitating Nova live migration by using scale IO for the same reasons If you want to migrate VM from node A to node B, then the volume can be mounted to the other node So I've talked about scale IO Components and things in here and you're probably still wondering what some of this stuff is that you've seen on some of these slides So just to briefly cover some of it in case you're not aware You probably saw S scale IO SDC on the slide. So that's you can think of it as a client simple enough Scale IO data server you can kind of think of it as a server But it's really just performing the back in IO operations and the SDC is going to talk to the SDS Metadata managers like kind of like the brain because it's doing you know Managing the configuration the volume mapping information error handling things like that gateway simply think of it as like a REST API and Since I said all that this is kind of lays out what that would actually look like In terms of the Nova compute and center Services, so we have our scale IO plugins talking to the REST gateway SDS like I said, it's going to be talking to the SD STC is going to talk to the SDS is for the volume mappings and The real challenge here was just doing the legwork, you know of getting scale IO implemented You know getting the drivers created and things like that so some more of the real challenges that I want to talk about though here before we Reach the end is security upgrades monitoring keystone v3, so we'll talk about a few points in each of these So from security standpoint, you know You want to you want to just allow access from your VMs to Certain things that are running on those platform nodes that I talked about, you know especially like custom services and things that VMs have no no reason to write to To talk to one of the the challenges There is you have to understand though how The Docker networking plays a part in this because if not if you're not careful what you could find yourself In is a situation where You had a Docker container You had a container in Docker networking mode running on The same node as a neutron network container and you realize oh the VM still can access this even though I put in my IP tables That you know to disallow access to a port that was open on that container and the reason for that is just how Docker Networking mode works and the IP tables are configured there The same thing doesn't happen if you have a container and host networking mode, you know on the same node as the neutron network Container so just something you know something to keep in mind there that you have to Understand those inner workings a little bit before you start trying to do things with IP tables to disallow access Another thing is just Authorization issues in open stack. So if you think back to what I mentioned earlier when we initially started out on the project It was focused on the kilo release Like I said, we're you know actively doing with talk now, but at the time in kilo And I'll call out horizon here a bit, you know, there's this hard-coded admin logic, you know that that Horizons going to use to do things like know when to display the admin page, you know for a user that logs in There isn't really any there wasn't any really concept of like project admin domain admin and things like that so one of the things we had to do is just go in and and and and do that work of you know Making those concepts in the policy files and things so we can have more clear-to-fine roles for access So upgrades so what are the things you want to do when you know, you're trying to create an upgrade service? Well, some of those here you want to minimize service disruptions obviously a big one another is just intelligently upgrading so you don't trip over your own foot, you know Possibly have a situation where you would trip over your own foot while you were upgrading and we'll get into that in a bit fault tolerant upgrades obviously the first one Minimizing service disruptions that's more Ansible's kind of helping us there because ansible. We're actually using things like you know ability to serialize You know actions of the playbook on certain containers and things like that so we can do rolling Upgrades of service so you don't really notice anything Upgrading the containers intelligently that's more about Knowing that okay, what are you know after so many services and containers? What are the interdependencies you know Nova needs my sequel you okay? You have a keystone you have these custom metadata services and all this other stuff that has to be upgraded right? So you also have these upgrade servers themselves It may need to be upgraded or containers that may need to be upgraded and then ansible itself So just knowing you know all the inter-working so you can upgrade efficiently without you know Doing something weird like upgrading the upgrade service in the middle of of your upgrade and whatnot From fault-tolerant upgrades a couple of things we were trying to do there is like I mentioned for our upgrade container Once again running kind of in triplicate here. We only have one upgrade Container actually doing the upgrade at a time the other two are kind of like in a Mode watching to see if the other one is still up So the way that happens is the container that's doing the upgrade is actually writing to a etc Back-end key value store if the others you know see that it's not updating that they will take over and finish the upgrade Simple like can you know a clustering kind of stuff? The other thing we're trying to do there is is let's say worst case All of those upgrade server containers went away having a fault-tolerant workflow engine so that You know you could recover exactly the the moment where the upgrade stopped Very important. Oh monitoring Monitoring definitely a challenge because we're saying here that you know We want to own everything from the hardware up to the open-sac services themselves, you know network PDU everything So thinking about everything that can go wrong, you know with with a system and then saying You know this is a failure or this could indicate an imminent failure This is maybe more of a performance problem and then those that kind of thinking gets into like okay Now I have to think about severity, you know, what is like absolutely critical? What is kind of critical? What I can wait and be done, you know fixed, you know a little later and And then also, you know having alerts a trigger off of that, right? Having all this integrated into a single pane of glass like Karthik had mentioned, you know to have a dashboard That's meaningful easy to understand exactly what's wrong with having to guess what something is right and then integrating that with You know a back-end support team support chain and Saying what kind of failure or alert would we actually phone home about and integrating that with a phone home mechanism? And then when you start thinking that path you think well now I want to say if something Happens, you know, what is an initial diagnosis or what is an action that could be taken immediately, you know If if an alert was triggered or something, you know, so supporting immediately you know jump on that and Another thing that kind of hints out here too is, you know trying to just log everything so there's this wealth of data That's captured all in one place a single pane of glass and when I say that what so I mentioned We're using elastic search, you know log stash cabana cabana's you know in that that single pane of glass UI that we were demonstrating and So that's that's challenged to get all the logs in keep them stored But also to allow to to be able to filter on things right you want to go in there and and and say okay Let me see all the error severity messages or info severity messages I want to be able to filter by host by container type by container type on a specific host and all these things you We want to slice and dice the logs because I mean even isn't even as a developer Just having a place to go instead of having to log into a hundred nodes or something to find one log that you thought was there It's very powerful so finally Keystone v3 and this is one that's again a little bit more towards kilo So service is not understanding domain scope Again, I'll call out horizon here as an example So horizon basically knew of domain kind of like as an at namespace you could say whereas like Jeremy at Coke or Jeremy at Pepsi could be distinguished it could it could also you talk v3 to Keystone at the time and generally understood v3 Tokens, but didn't understand domain scope tokens so kind of understanding the landscape at the time was a challenge, right and along the same lines is Not everything being v3 all the way right some things being like, you know Nova to neutron internal communication v2 Just understanding the you know the how far things were integrated with v3 at the time definitely a challenge and You know for sake of time, we can't really talk about everything unfortunately about the the project if there was something that Specifically you wanted to know or drill down to I mean obviously we'll have a chance for you to come to the mic And ask a few questions here in a bit But more than likely you know to get the level in depth that you probably want it might be good to stop by the emc booth as a as a good chance to talk with a couple of members of the project Caspian team who should be stationed there You know at all times so really good chance to get questions answered even even if you know project Caspian If that wasn't something that sounded cool, which I think it's personally think it's cool, but I worked on it So hey, I might be a little bit biased It's good to stop by the booth even just to pick people's brains, you know understand how you know Working with the open stack or Docker putting things in Docker containers how that worked out Of course, you can also stick around for the additional sessions today kind of pick and choose what you want to do Importantly, I would say stay tuned for emc world next week because it would definitely be big announcements about project Caspian Actually Yeah, we should do questions. I guess first before the Come up to the mic that would help the recording. I don't know if that's a good sign or that's fine No one has any questions Like I said the best chance if there's something in your mind the burning question Going to the emc booth is a good chance to talk with someone in-depth about a conversation more than you could probably get in like a couple of minutes up here anyway I guess with that you want to yeah, why don't you? Okay, so we're rafling off the Amazon echo So I'll just call out the numbers if you can come up here 9703 1 6 9703 1 6. Oh, yeah, there you go Congratulations. Oh, yeah, I guess I should check it. All right. Yeah, okay. Thank you so much