 Hello, everyone. Welcome to today's session on Rackspace Private Cloud, powered by Red Hat, backed with four nines availability. My name is Jeff Ekstrom. I'm a Cloud Evangelist with a certified cloud and service provider program at Red Hat. To my right is Nick Jarosimatos. He is also a Cloud Evangelist in the same program. And behind me is John Fulton, an open stack engineer with Red Hat, and Kent Wolf, also known as Superman, an open stack engineer with Rackspace. Today, I'm going to be talking about our definition of enterprise grade as it relates to code, as it relates to best practices for implementation, and to support. Nick is going to talk about the reference architecture for the Rackspace Private Cloud, powered by Red Hat, or RPCR for short. John is going to talk about some of the coding changes that Red Hat had to make on their side in order to make this product a reality. And then Kent is going to talk about some of the implementation decisions that Rackspace made, as well as some of the next things that are coming up in the product partnership between the two companies. So first, let's talk about what is our enterprise grade open stack distribution. First thing is a longer product lifecycle. There's a rapid pace of innovation within open stack, and that's great for agility. It's great for trying new things. But when you get to the enterprise, it's not always the greatest thing around. One of the things that the enterprise needs is stability. So we try to slow the pace of innovation. We try to productize the upstream code, similar to what we did with Linux. But we do it in such a way that the enterprise doesn't sacrifice stability in the name of agility. So we're trying to find a happy medium there to make this very more consumable for the enterprise. At the same time, there's a deep integration with Red Hat Enterprise Linux. That engine co-engineering between open stack processes and the Linux kernel is critical to open stack functionality. And Red Hat has been one of the leading contributors to both open stack and Linux communities for years, as well as Rackspace. And that fills in our expert support. In addition, we have a robust partner ecosystem, hardware, software, ISVs, systems integrators, just across the board. And it really allows Red Hat to produce a very integrated solution stack that's certified, hardened, and proven to work. Thanks. And then what are the best practices for implementing enterprise grade code? And they would really start with high availability of open stack components. So the underlying components, such as the databases, message queues, APIs that underpin the systems, as well as storage. If any of you were here earlier for the self-talk, you'll see the importance of scalability, performance, IOPS, those sorts of things within the type of storage that supports your guests within open stack. And then finally, with the applications. The way that they're architected and integrated within open stack has to keep in mind the nuances of things like stateless applications and elasticity and things of that nature. The same time, the environment should be scalable. The way that the hardware is procured, packaged, and laid out, as well as the actual open stack services and the services that we use to build and deploy the zones, all have to be able to scale quickly, efficiently, and with good utilization. You need automated management. One of the things that you'll see as open stack zones grow to scale is that the complexity really goes up exponentially. And an automated solution is very necessary in order for you to be able to manage this environment and do it right. And do it the way that the enterprise expects with stability and reliability. Zero-day hot patching and hot fixes, of course, really get into we can't have customers sitting around waiting for a solution when there's a problem we need to be able to deliver on it as soon as possible. And that, again, falls into software assurance. And being upgradable, one of the things that I've seen as a rolling theme around here is rolling upgrades, live migrations, those sorts of things. Being able to service the zones, being able to service the environment without impacting the applications that are running and without impacting the availability. Next slide. And then finally, what is fanatical support? So this is rack spaces leading into, excuse me, industry leading, four nines open stack API uptime. They're a dedicated team of engineers. They're proactive health monitoring. They're a 15 minute response time dedicated account manager. And one of the things that you can see when you look at what rack space rows around fanatical support, when they wrap that around Red Hat's open stack distribution, is that you see a lot of those best practices coming into play. You see an alignment between the four nines API uptime and the best practice of high ability. You see the need for expert support coming out in their dedicated support team. You see the need for automated management in the way that rack space provides health monitoring and operations of our open stack cloud. And all of these elements come together into rack space private cloud powered by Red Hat, which is really, as the slide says, a prescriptive and managed private cloud deployment, where we take enterprise grade code, we wrap enterprise grade support around it, and we basically present open stack in a way that is consumable as a service for the end user, lowering a good deal of the learning curve, and really making the end user experience simpler would be a good way to put it. And so next to talk about the reference architecture that was built to underpin the service offering is Nick. Thanks, Jeff. So RPC powered by Red Hat, the reference architecture. So it's actually consisted of two physical firewalls, which are Cisco ASAs. We use two load balancers, which are F5s for have load balancing the actual APIs themselves. Network control playing storage traffic and open stack management are all segmented to provide quality of service. We use four dual core dedicated Haswell servers for the control playing. One of those is actually then Stuller node. So it runs director in the under cloud components. And the other three are actually controller nodes, which are clustered. We have two dual processor Haswell servers, which are dedicated compute nodes. And five dual processor Haswell servers, which are dedicated CF storage nodes. Now, this is the minimum size. This can scale dramatically larger, but this is the minimum reference architecture that you could actually acquire. So the Red Hat OpenStack platform and seven enterprise subscriptions were a little bit flexible compared to other managed services in a sense where you have the ability to bring your own licensing, or if you choose so, you could actually purchase your licensing directly through Rackspace. We do support block and object storage, Swift. Rackspace support optional integration with third party storage arrays. So I know a lot of managed services don't really want to integrate with third party storage arrays, and we're actually open to that. So if there's an interest in doing that, we were more than glad to actually have that discussion. We use Red Hat Satellite for integrating, providing OpenStack package management for seamless updates to end users. And then our Rack architecture management and monitoring services are deployed in a scalable and highly available manner to prevent single point of failure. So we actually will spread out between multiple racks. So if for whatever reason you lose one rack or half a control plane or whatever, you still maintain availability of services and things on those lines. Thanks. So Nick mentioned the undercloud. I'm going to talk more about that and what was changed in the undercloud to make this possible. So the undercloud that we use is OSP director. So OSP director is our product for deploying and managing OpenStack. It's based on triple O, OpenStack on OpenStack. It uses Pixi to prepare the overcloud nodes. So when that happens, nodes are GHCPed onto a management network for administration by the OSP director. So GHCP is fine for provisioning, and you're going to want that when you're doing Pixi. But Rackspace had a requirement that if a node reboots, it should come back online, even if director is offline. So I want to clarify that there are other static IP assignments. I'm just talking about the management network. So you could, in the heat template for director, configure several static IPs. They just had a specific requirement for the management network for it to have static IPs. So the solution is that when the system boots to be provisioned, GHCP is used to handle an IP address, and then the system is installed and imaged. But during the configuration later, the deployed node, its IF config file, is modified to have a static IP. Next slide. So these were the changes that were made upstream to support this. OSnetConfig is a Python program that on the overcloud modifies the deployed node for its network configuration. So it could have a variety of networks in there, and you can get pretty complicated networks based on your needs. So this program was changed so that detailed exit codes would be an option. I've linked here to the actual upstream patches that we made to support this. For reference, I'm not going to go through the code review, but that's the details, if you want to see. OSnetConfig was also changed to reach the metadata server after it was assigned a static IP, the threat being that I'm changing the IP I'm using to talk to the undercloud. So finally, the third change was in triple O heat templates, which is our project to, it's basically a set of heat templates to deploy OpenStack. So those are the three changes for this particular feature. Another important change for Rackspace was the external load balancer. So normally, our default option is for the, we have three controller nodes, and they use HA proxy. And there's virtual IPs that a user would connect to for the API services of OpenStack. And then HA proxy would access a load balancer to send the traffic to the active IP. And it can control this using Pacemaker. That's still our default option. However, Rackspace, it's important to them to move the, to decouple the load balancer from the control server. And therefore, they use an external load balancer. So director was modified to support this. And this resulted in three changes to triple O heat templates. The first of which, I mentioned HA proxy controlled by Pacemaker to, when you do a deployment, that becomes optional. So this is a patch to triple O heat templates so that it's optional and you don't need to, you can simply say in your heat template, I don't want to use this. I want to disable that. And then, if you're going to configure a load balancer to talk to OpenStack, you need to use, you need to know the Vips, which is the IP address that people will connect to to access the service. And of course, the IP addresses of the controller nodes that it will proxy the service back to. So you need to know that beforehand. And these changes allow you to specify that in the heat templates so that you can do it. So six changes occurred to support this based on their needs. Thanks. Thanks, John. As you can see, the Rackspace private cloud reference architecture powered by Red Hat leverages a dedicated pair of F5 HA configured load balancers. And that's a key component to our 4.9 API uptime. And we've done this for a few reasons. The first reason is we wanted to decouple the virtual IPs from the control plane. This essentially allows us to mark a controller node pool member down in order to proactively patch your environment with minimal to no downtime. We've also wanted to offload SSL traffic onto the F5 load balancer, which provides us with higher throughput as we can leverage the SSL accelerator module on the F5. And lastly, we have a team of dedicated network specialists who have expertise in configuring the F5 hardware load balancers and are an escalation point for any of our network issues. So in order to back up the environment and proactively monitor the environment, we've taken advantage of the heat templates along with puppet manifests. So what we do is we can specify per node configurations using the system UUID, which effectively writes a UUID JSON higher data file with per node specific data that we can then use to assign additional IPs and properly install our monitoring and backup software. And then puppet applies it. Next slide. So what's next for our partnership? We're working with Red Hat to expand the way we test future versions of OpenStack. And one of the ways we've managed to do this is we're working with Red Hat to configure the distributed continuous integration, or DCI for short. And this essentially allows us to test pre-development and pre-GA release versions of OSP in order to ensure it works in our environment with our configurations. We've also worked with Red Hat using the OSIC cluster. We've taken away some key configuration and performance tuning parameters from that. And lastly, we'd like to continue to increase integration between Red Hat tooling and the Rackspace private cloud Red Hat offering, powered by Red Hat offering. Does anyone have any questions? Primary driver for removing the HA proxy and putting it on a physical load balancer. So we're able to decouple the IPs for the control plane. OK, but what did that really buy you? What did it buy us? It buys us the ability to do rolling updates without impacting the control plane. So we can mark a controller node down. We have two more in the rotation that are serving API requests as we patch and test the patching on the disabled controller member. So I think it's also a simplification of the environment. As things are changing, HA proxy and pacemaker and keep a live D and things along those lines, they become deprecated. They change between open stack releases and things along those lines. So by removing those components and actually moving them directly into the load balancer, you're kind of guaranteeing that as you're moving through the next iterations of open stack as it continues to release, that there are certain components that you don't have to necessarily consistently change. I also think another portion of that is because they do have the introduction of a director that ties into some of that. It helps with the simplification of that as well. I was just going to say it's an external load balancer. It could be hardware. It could be software. So we did a lot of testing of this just with HA proxy on a separate box. Could you use the mic, please? Are you doing your SSL termination at the external load balancer or at the services layer? We're doing SSL termination at the external load balancer for the public URLs. And does your security allows for that? The traffic from the load balancer to where actual services are running are basically It's in the same environment. So external traffic. This is going to be difficult for most people. This is probably going to be incredibly difficult for most people to see just because of the size of it. I apologize in advance. But this is a general design from an architecture perspective that specifically shows the aggregate layers and then how the ASAs and load balancers are actually configured as well as all the individual nodes. So the public interfaces, the service interfaces, and then what they're using for DRAC. We'll make this available online so that you could actually consume it easily. But if you want to speak more about it. So I have a question here. Sorry for the interruption. You guys are running on the hardware Haswell servers x86. I understand Red Hat also has other architectures they're working on. Is Rackspace Red Hat working on any other hardware armed specifically beyond Haswell for the same kind of software deployment? So we're obviously exploring all types of different hardware, right? Arm is continuing to grow, system on chip, things on those lines. But there's really nothing at this point that we could actually comment on, unfortunately. Thanks. Any other questions? OK. Just one more. Sorry. So working with Red Hat, we've been able to increase. So for example, we've deployed just testing. We did 188 instances simultaneously in about a minute and a half with the performance benefits. Does that answer your question? Yeah, no problem. So are you saying the FI load balancer introduction is not for the performance reasons or is it more of a simplification reasons? You say both. Can you repeat that? I'm sorry, I couldn't hear you. So I think I was wondering the reason you put FI is for the performance or more of a simplification aspect of it? For performance and simplification, yes. OK, on top. So when you're growing, like you said, the example was like two compute nodes you have. As we grow, what components stay fixed, I guess? Or you have to grow everything, scale up even the FI as well. Like, I think the model has only two compute nodes. I just want to make a distinction. I mean, the load balancers for the APIs, and it keeps the APIs up with their SLA. Once the API gets it, it will deploy a job. So if I say give me an instance, it will put it on a compute node. And then how many compute nodes you have is a separate matter. So I just wanted to couple. Are you asking about how many compute nodes? Yeah, I think your question is specifically when you scale out and you add more compute nodes, how does the controller node scale and things along those lines correct? That means as the compute nodes meet, are they going over? So that really depends from an architecture perspective. What the customer is looking for. So if they want segmented fault domains, I mean, there's a lot of variables that you have to take into account there. So that's kind of a tricky question to answer without actually being presented with specifics. So if you come in and you say I need 100 compute nodes and 200 seph nodes and I want to divide it up into four different fault domains, how would that be architected? It's going to be completely different from customer. In general, the compute nodes would be, the controller nodes would be scaled accordingly, right? So I guess when you reach a high number of compute nodes, maybe you'd add another controller node and then as you said, it depends. Thank you. If I was like a customer of Rackspaces, would I see the overcloud and undercloud or would the undercloud be abstracted away? What is presented to Rackspaces customers? Yeah, I mean, what we usually do is we provide you with the credentials to leverage the overcloud. But if you'd like, we can give you access to the undercloud, just if something breaks due to your actions, we're not responsible for the SLA at that point. So you mentioned you do pixie boot your compute nodes and design IP addresses through DSCP mechanism. But when server boots up, you have this IFC MCMX, which goes and gives the server the same IP address, the host the same management IP address. I want to understand what is the trigger of that? How do you trigger, like the host is failed, you have some sort of orchestrator or something to pixie boot the server and assign the same IP address? So, Ironic has a documented procedure for provisioning the system. So, it's the basic, Ironic, triple O, Nova triggers Ironic, Ironic trickles IPMI. IPMI turns the server on. The server comes up. It wants to DHCP and IP address to begin. I'm missing something. No, I get that part. Okay. So when that server went down and you had to rebuild that compute node, how do we assign the same IP address to that? Oh, okay. So, that's kind of the point of how it works. The if config file of the server that was deployed has a static IP address assigned already in it. So, when it comes up, it simply follows what's in its if config file and uses that IP address. Are you saying if we were to replace a node or? So, let's say your compute node get corrupted, whatever happened, you have to rebuild that compute node and basically give it same IP address which was first time assigned using IPMI. Yeah. So, at that point we would just provision another compute node and the only IP that we really care at Rackspace to define statically is the management IP. So, if we can, in OSP8, you can actually statically set IPs for several networks but we're just gonna do it for the management network. And that's a different question. Can you elaborate more on your rolling upgrade? How does it work? Like you upgrade your databases, upgrade control node, how does it work? Yeah, so upgrades are done by the under cloud director node. So, essentially what we'll do is we'll mark the one of the controller nodes down that aren't actively running any of the services and we'll patch that controller node, we'll make sure everything is fine and then we'll re-enable that controller node and then we'll move on to the next node. And how does you associate compute node with controller node? You have a three controller node configuration. Let's say you upgraded one and you then load balanced and basically upgraded the next one. But how does your computer understand to talk to the upgraded one and at what point you go and upgrade the computers? So, the compute nodes talk to the virtual IPs on the F5. So, the F5 wouldn't direct traffic to the actual controller node. That's down. So, essentially it's a rolling upgrade. Yeah. For your reference architecture, do you guys have your configuration files available anywhere to get value pairs for tuning that you guys have experienced? We don't have that publicly available. However, there is a project that the Red Hat team open source called BrowBeat and that will actually provide you with some of the parameters we're using. Okay, thank you. Yeah, no problem. BrowBeat, yes. Actually, the guides are back there. Joe and...