 So, my name is Greg Swift, I've been at Rackspace for about five years. I actually used to work with Joel at the US courts before coming to Rackspace. I'm excited to hear that they're doing such good work with OpenShift now. So for anybody that doesn't know Rackspace, well, overview. These are the things I'm going to talk about. Quick overview of Rackspace, what we do, what we need, you know, how we're going to get there, and then the kinds of things we learned along the way with OpenShift. So first off, Rackspace is really about managed services and providing fanatical support. It seems like we're selling a lot of individual products, but really like at the end of the day, that's what we're trying to provide. And we do that over a huge breadth of products. Pretty much if you want it, we try and provide it. This can lead to some problems for us, because it means that we are a collection of hundreds of IT departments, all highly skilled, highly intelligent people, trying to make things happen as fast as possible inside their domain. And so you got the AWS guys over here, the Azure guys over here, and they're all just going that way. We end up following a lot of good practices, a lot of best practices. If you want an example of a best practice to come to us, we will find one of our thousands of people that knows that practice really well. But what that means for us as an enterprise is it makes the internal use of products a little bit more difficult. So it can make it like you're changing companies. If you switch from one operations team to another, or one of the groups that I work with supports several hundred apps. And they are switching companies every 15 minutes sometimes if there's something big going on. And then compliance time can be a mad rush because of the 200 different variances to accomplish the same thing. So what we needed was to try and be able to come back and say, okay, for the internal things, for the things that are not our support bread and butter or the services that we're providing out to our customers, how do we solve those problems? And so what we needed to do was realize that best practices needed to find their way up to a standard practice. Here's the commonality that we need to be following. We need to get it to where all the people that don't need to be managing the entire stack have a good option for somebody else to do it for them. And then realize that not everybody's gonna hit their problem solved. You're still gonna have that 10% that's running off to the side. And that's not necessarily a bad thing. That's where innovation can happen. That's sometimes just the cost of doing business. An important thing to remember for most companies is it's really easy for a team to go and just run, get their credit card, go jump on AWS and go run, sprint down the line. But in a year and a half, where's that product? Like who's maintaining it? Who's taking care of it? Who even knows how it got deployed? Because maybe that person was really smart and got scooped up by another company and now nobody knows how to run that app. So we can go further as a company together because when you're working for a company, that's who it's about technically. It's about that company and making sure the product is good for them and not hurting your coworkers who are a part of that company. So we can go further together instead of faster apart. So our goals, developers are the SMEs, let them be the SMEs. Let them know about prod, let them know how prod runs. Try and get to a point where we were trying to get to a point where we can just say developers have access to prod even in a compliant environment. We have to implement some significant controls to make that acceptable, but it is possible. Get operations out of that path, make it so that dev teams or product team at this point just doesn't have to worry about the standalone operations team, let them go run other things like OpenShift or the logging or the monitoring off to the side. Our goal for ourselves is also to simplify fleet management. The less variance we have at that level, the better. And then maintain compliance objectives. Really fancy word to say, whenever PCI comes around, we can give them that report a lot quicker with a lot less resources. And then actually move faster. Because the trick about that going further together is that once you get to a point in that race, you're actually going faster as well. And you gotta find that point, but you will get there as long as you follow. So how are we getting there? For IaaS, we decided to utilize one of our largest IaaS products. Rackspace does several, as I mentioned earlier, or as I had on a slide earlier. The one we went with was Rackspace Private Cloud powered by VMware. One of our larger products. And so it was an easy win. We have a lot of internal support for it. We've got a lot of experts on it. We've been providing that product for probably the life of the company, almost. So then the only problem with that really became how to stay ahead of demand. Because everybody needs a place to put their stuff. So then our first pass at, I thought I'd updated the top of the slide for a nice little pun, but apparently not. Our first pass at a PaaS was actually started about two years ago. It was an in-house app written in Ruby called Maestro. And it was built on top of Marathon and Mesos. And it was intended to be very Heroku-like, build packs, curls to the APIs, those kinds of things. It worked for the most part, but when you have developer turn, then you have a hard, we didn't have a team supporting it after a year. And once we started getting more resources into it, it was still that, well, maybe going to OpenShift is a better idea. And so we went on our second pass. And so we've started building out an OpenShift environment. We're working on our third region right now. We started off with 1.4, upgrade to 1.5. That was unfortunately a painful upgrade for us, primarily because of logging and some custom changes that we had internally. So we haven't gone to 1.6 yet, we're about to try that out. Storage was a little bit of a hiccup for us as well. We started with ClusterFS, but Elasticsearch did not like it for the aggregated logging. I didn't see anybody else complaining about that. So I don't know if it was still something we were doing. But so we moved that to, we still occasionally run into issues with that. And we're going to just move Elasticsearch outside of Cluster. So within three months, we'd gone to production for non-critical workloads. And people had deployed a couple of small, piddly production apps, and we're pretty happy with it. Jenkins did very much become a top consumer, both in number of instances and actual resources. I think we had a minimum memory footprint of four gigabytes for their app. So but the successes, we had a new ticketing API that's at a demo stage right now that was able to get all the way out to production for that within a couple months with minimal operation involvement, which has been great. Our QE team several months ago migrated over their testing and for our internal identity system. And they say it's 15 million requests from this testing suite within a couple of days. He implemented this and he was very happy and impressed with that. He's sitting over there somewhere. So right now, we're at a couple hundred projects, half of them are Sandbox Playgrounds, and about 15% are CICD projects. I've only got one customer facing production system on it right now, but we've got several production services. Technically, you're not production as far as I'm concerned, unless you tag your product as prod. So there might be more than that, but when I go run a query, that's the only ones that classify. So some lessons learned. These were just some points that I thought would be nice to share, especially if you haven't done this before, as we ran through the things we ran into. Took a while to fully learn this lesson. In the Ansible inventory, you've got the LB nodes. And because the routers are similar to the LB and because they both run HA proxy, it's real easy to just kind of sit in your head and go, oh, they're the same thing, and they're not at all. By default, the LB is pretty much an uncontainerized HA proxy that runs on that LB node. If you don't have LB nodes, it's expecting you to be handling it externally, such as in an F5 or something like that. The routers are then pods running HA proxy that run on any nodes that are inside your router selector, which defaults to the infra region. That might have all just basically, by default, there's an infra region. If you don't have schedulable nodes in that infra region, say you just have your three masters, and they're all set to unschedulable, because that's what the instructions tell you to do, you're never going to get anything running. It took me a week to figure out that's why those nodes weren't coming up. So once you add additional nodes into that infra region that can be scheduled on, those nodes will come up. Where I actually ran into the problem was I had two nodes, but the default, I think, for the router replicas is five. And so I only had two nodes. And so it just was never coming up. Once we went in there, shifted that down to two, everything was fine. And so basically, in our host inventory, there's a nice big comment section now that says, make sure that router replicas is no more than the number of nodes in the router region. We set aside a separate region for the routers. So right now we actually are two primary LBs, which run the uncontainerized, are also running the router. I'd like to change that at some point and just keep them completely separate. I think it would be easier over time to manage because you have that distinction of what they are. Quotas, one of the things that I'm happy we did was start off with quotas from the get-go. Every project that you create gets a very default kind of minimal quota. We don't really put a high barrier to entry to requesting a higher quota, except that we prefer to only give them to you if you are following our conventions for naming and such to prove that it's not just your personal playground. But even if it's your personal playground, if you want a higher quota, we're likely to give it to you because it's that we just kind of want to keep a lid on things. We're not trying to be overly restrictive. The one thing that we didn't include from the get-go, we tried and what happened was, or what we didn't implement was resource limiting. So like CPUs and memory, you can add those into the quotas. Instead, we just restricted the number of items that you could have, the number of pods, the number of storage containers, things like that. When we added the resource limiting, anybody that went to go load a new app failed because all the QuickStart templates don't have any default resource requests. And if your template doesn't request the resource, then it fails. So laziness and time and all that of what was going on at the time, we was like, okay, well, we'll revisit that later because it means we have to go edit all of the templates that came with OpenShift to include those requests. And so we've got a store and a backlog to go implement that everywhere. And it does work. We did play with it a little bit on something that I'll be getting to in a second. But let's talk about the resources. So I didn't even want to try and put this into a reasonable slide, so I just don't what's in our inventory file. So try to be pretty verbose in our inventory file about why, what settings are where. The cubelet args is where you're gonna pass a lot of arguments to the individual nodes and what's running out on your workers. Excuse me. Pause per core, I believe actually defaults to the 10. I don't remember why you hard-coded it in here. But that's based on the documentation and the size of the nodes, the number of pods that they can handle. Our nodes are pretty small intentionally. So then the garbage collection threshold, high and low. So what this is, is the local image repository on each of the nodes takes up a certain amount of desk. The high threshold is where the garbage collection kicks in. And then it tries to clear out until it's lower than the low threshold. Fairly easy. We had this at 90 and 80, I think originally. And so what happened was we would have somebody come in with a big image that needed more than 10% of disk and they would get kickback errors when they would get scheduled to a node because there wasn't enough capacity, even though the nodes got 10% free. And it took us a little while to figure it out. And so basically we just made it a little bit greedier on the garbage collection. We've actually still seen that error once or twice, but it's very rare now. So definitely something to keep an eye on. Then the other was our first major incident came from nodes starting to whom kill on us. It was decidedly not fun. We didn't have system reserved defined or C group driver. I'm not 100% sure that C group driver actually has to be in there. Reading through the docs, we thought all three of those bottom ones need to be there, but the bottom two actually break origin node when they're there, so. For what it's worth, I left them there so you can see don't add those to this because they will stop origin node from working. Basically the goal there is to reserve an amount of memory on the system so that way OpenShift doesn't whom kill itself, which is what happened. That was lots of fun. This is where we actually played with the resource inside our quotas. Red Hat has put together this awesome set of resources and they've gone around the country and probably the globe giving free workshops where you can come in. I totally thought it was gonna be a sales pitch and I went in and we got to sit down and do all day labs. It was amazing. It was so much fun. So, excuse me. The content is all out on the public GitHub. It's fairly easy to kick off your own version of it. I run this internally and it's just up running now. And we have a special quota set aside that has resource limits. The road show that they did in San Antonio, we blew out their reservations and halfway through the day we overloaded their system. And so with that in mind, when I went to go do the big version of internally at Rackspace, we made sure we had pretty good resource quotas in place before we let people on it and we were able to handle a good 100 plus people, which was about what was in the road show, without it affecting any of our production workloads or anything we ran it on our main system. It was pretty good. So internally, we've got several teams that are working on using Helm to manage things. Basically, trying to provide a little bit more composable templates for reuse. This is not fully embraced for everything yet, mainly because Helm is single-tenant at this point, but there is work upstream to change that. It does appear that this is going to eventually kind of be a little bit more of a thing. And so we've seen a lot of success to the teams that have been using it. If anybody wants to talk to one of the people that uses it, you can come find me and I'll introduce you to them. Community, Diane's mentioned this several times. You're not on the slack, get on the slack. It's very quiet and very lonely. We want you there, we want you talking. It is technically the preferred one. OpenShift Ansible on Gitter is one of the more active ones that I've been a part of, for whatever reason. That's just where there ends up being more talk. It'd be nice if it was on slack instead. And then also the Stack Overflow tag for OpenShift Origin is a good resource as well. At least a good place to ask your questions. I've been told that it's actually some place where people try and keep up with. And I've seen answers go up on things fairly quickly. So some random finishing off notes I guess. One of the things that we didn't really pay attention to until the last second on our first production deploy was the SDNs network. We had deployed using a 10.network. Well, we already used 10. Everywhere inside. We're a big hosting provider with a big private networks. We're using pretty much all of 10.slash8. I had deployed it in all of our POCs using 172. And then I went to go do prod and we had deployed it with 10.net the night before my coworker goes, hey, that doesn't look right. And so I did the horrible thing and I spent the night rebuilding the entire thing from scratch, which was lots of fun. But yeah, so definitely if you are on private networks, make sure you're taking that consideration for overlap with the SDN inside OpenShift. That could lead to some weird wonkiness here and there. Another thing that bit us in the long run was we deployed using OpenShift Ansible, obviously. And then we went to go extend it and we had no idea what hash we deployed on and when we went to go deploy again using whatever was the current, things acted weird because it was not exactly at the right state and for whatever reason. We eventually got to a point and we know what hash we're working with but definitely keep track of that. It's also helpful like, okay, we deployed this data center and I'm gonna go to deploy that one just to help make sure you're synced. I mean, theoretically you can stay current on a release branch, but that theoretically doesn't always work out. We've also been working to start handling a lot of our post deployment changes like adding quotas to things using the Ansible OC module so that it's a lot more programmatic instead of somebody just doing the OC create over a bunch of files that are in a repo which it's at least a little bit more automatic even though it's the same results. And then I left that last line in the wrong spot. So anyways, so, Greg Swift, those are the ways to get a hold of me, so. Thank you. Sorry, I'm just getting over having a cold, so. So I loved in this the shout out to the road show stuff because that's been one of the things that we've used to get people started really quickly and came out of the evangelist team. So it's great that you're making, taking advantage of it and I hope other people will too. Does anyone else have any questions for Greg? While he's still standing, there's one over here. What's the strategy around origin? So Rackspace historically has been about building up the internal knowledge space. We are actually, I mean, we explore a lot of options and avenues over time and having conversations with Red Hat again in here and there. Big part of it was just getting started ramp up, knowing that we were gonna be building out expertise on it internally and hoping to be contributors back to the community. We haven't quite gotten to that point on the development side on the back end, but we have submitted a few pull requests back to OpenShift Ansible and trying to be helpful in that community as well, so. Is there another question for Greg before we let him go back home and go to bed? Oh no, I'm better now. You're better now, okay good. He's been quite sick for the past couple of days. If not, there's one, oh, there's a couple, wait a minute, I keep thinking. I was just wondering how many clusters you're running. Right now we have two online. We're about to do the third. Okay, so both certain production workloads? The first one is fully prod. The second one is only not prod because we're in the middle of a moratorium and there was no reason to release a new production system in the middle of a moratorium. My question's actually for Diane. Do you know if there's a roadshow plan for 2018? Yeah, I don't think it's up on the website yet, but we can get it up there and as soon as it is, we'll send it out on the commons mailing list and then the Slack channel. There's another question over here. If anybody wants help setting up the roadshow for their own use, you feel free to hit me up on Slack because it did take a little bit of poking, so. You mentioned you're running EAs on firmware. Did you do any specific tuning? No, in fact, I noticed the other day that there's tuned E profiles for OpenShift and I still haven't applied them, but I definitely want to go back and do it. We didn't do any different tuning for the OpenShift nodes than we do for normal Linux nodes on VMware. I was more referring to the VMware layer. Who? The VMware layer, did you do any tuning there? No, we don't tune it any differently than we normally do. Okay, thank you. And I'm not the VMware expert on my team, so. All right, there's another question. Hang on a sec. On your two clusters that are you keeping them, are you planning on keeping them in sync with projects and deployments or will they be out of sync, heterogeneous? So we want to look at having the federation layer going on, but basically we're taking the approach of, kind of like if you go to use AWS, they don't sync your projects between regions, you still go deploy your stuff to them. So basically not synced, you get, we will worry about making sure templates are there and all those other things are there, but we're not helping anybody make sure that their application is deployed across multiples. It's expected as a consumer internally that it's just like you're consuming a cloud service, you're responsible to deploy your app where you want it to be and if that's multi-region, then that's your responsibility. Any final questions for Greg? He will be here this afternoon and through all of Kuba, KubeCon too, so please reach out and I will set up my laptop in the reception this evening while we're all drinking beer and anyone who wants to get on the Slack channel, I will sign them up, so come find me. All right, thank you very much, Greg.