 All right, we got a lot to get through in the next 30 odd minutes, so let's get going. So build it yourself. This is kind of a story really about our journey to OpenStack at my organization. We're gonna go through the history of why and how we chose to kind of go at our, mostly our own without a system integrator to do our deployment. And the skills and tips that I generally feel that people are going to need in order to be successful in a kind of do-it-yourself environment for OpenStack. Well, about me, my name's Cody, I'm a principal SysOps engineer at the company Puppet. Used to be Puppet Labs. We rebranded it all over a couple of weeks ago. So I've been there for almost six years now. I started in professional services, have gone through business development. Still do a fair bit of business development, but been in the SysOps department for quite a long time now, a good four and a half years. So I came out of the university space, just maintaining a whole lot of different, discrete, odd academic applications, including our virtualization and storage platforms at the university. Up until very recently, I was primary maintainer at Puppet of those legacy systems, so our legacy vSphere implementation storage platform and I even call our public cloud providers at this point legacy too. So right now I am predominantly focused wholly on internal and external proliferation of OpenStack usage at Puppet and in our upstream community activities as well. One of the big things I feel about being successful with OpenStack inside an organization when you're just a small group, it still can't be just about the money, it can't be about the licensing costs. You're still gonna be able to derive value and financial value in numbers out of time saved from end users, your time saved from the back end of having to interact with the system that you don't like or can't troubleshoot. So the team identity at Puppet was really important for us to be successful here and so we really had to look really closely at how that aligned with our organization. So we looked at the team, the way we were built, the history we had with other applications and then compared that to Puppet's values, the actual company values and identified if the way we actually had infrastructure in place at the time was at all congruent if there was any fit to those values. A lot of things that just didn't match up was we have a list of, it's kind of a almost poem like value statement of Puppet Labs but there's three statements inside that really stands out that didn't align with us inside our infrastructure and that's things like rapid improvement, user value, collaboration, all three of those things were completely devoid in our old system. So this really all kinda boils down to if you're at the end of the day and you're proud of the systems you're running. I joined the SysOps team and inherited what was already a vSphere implementation that we're dependent on. So I say that a lot. People ask me what we run and I go through that whole sentence like well I inherited it. So if I'm saying that all the time it's obviously not a system that I wanna continue using and improving inside my infrastructure. We also, inside we have a small implementation team that did the OpenStack deployment. It's part of a bigger team that goes across enterprise applications, websites. It's a fairly diverse team but I have up here automation enthusiasts and trying to enable more effective work environment. That really boils down to being hackers. We kinda forget this phrase because everyone has changed the meaning of it over the last several years but it really is about a culture of overcoming limitations of your environment and your services. Doesn't matter if you're SysOps, DevOps, developers, this is what you are and this is what you're trying to do inside of your organization. It really brings everyone together and it transcends all those other titles that we've come up with over the last several years. And we do this because we love it. Once we leave technology in our jobs we're still gonna be hackers. So general tendencies inside the team. We're very wary of implementing anything that is a majority open source. This really doesn't have anything to do with cost or ideology around open source. It has to do with security and confidence. So we wanna feel safe that we can identify a problem and tell our customers how we solved it. And at the end of the day if we have to we're gonna write code to fix the problem ourselves. That's all that's about. About digging in, being able to drill through everything. Because of that value, we're not afraid to dump entire portions of infrastructure off the edge of cliffs. If it's not a thing that fits into the infrastructure we will just, we will cut it off and replace it. We've actually done this with art, with a sand implementation. We had three years on a support plan. After less than a year it became more expensive to maintain it on a day to day basis than to put it in a closet. I believe still has maybe a quarter of a year of support contract left on it and it has not been plugged in in just over two and a half years. We unplugged it literally from the power when we moved to offices last and it was never plugged back in. It was a grand time actually. We like flexibility so we build over by. We're very agnostic. I'll bring that up later as well. It's a big thing when it comes to having to do open stack on your own. Being dogmatic is going to cause you more issues to success. They're really good examples throughout this are gonna be, we switched from Devian to Sentos for this project and we'll abandon Puppet if we have to and go directly to Bash and Python and we've been known to do some fabric as well to get around some automation issues. So a little bit of history. We've had quite a few failures along the way of doing this. There's been many, many years of organic infrastructure inside Puppet Labs. Puppet, sorry. Still hard to get used to. So AWS that has become, that was ended up being a failure because it turned into a giant wild wild west. Everyone ends up having unbridled access to everything. When you're an automation company writing tools to manage Amazon, it just so happens that people can very easily wipe out everyone else's work with a single command. So that's been pretty rough there. So we really needed a real tenant system. Vanilla vSphere, that kind of went the opposite direction. We tried to set up some automation to manage permissions. Just so happens that permissions were so difficult that we were always wrong. And so nothing ever worked. We could not convince an end user to use their SOPE API and we could not convince end users to use either vCenter or console. It just, the only API usage was by engineers trying to create new tools so people could use them. One of the more successful ones but eventually failed was one of our developers in his spare time created some CLI shims on top of VMware. And we created him a whole bunch of pre-provisioned VLAN tenants that he could use and everyone interacted with it. So he got through, it was working. He tried to hand this off to SysOps at a very bad time in our organizational history and in the team and it had some bugs. Had some bugs that no one was willing to fix. So eventually it just degraded and we ripped it out. The most successful endeavor has been this RESTful API. If you search the internet around Puppet's GitHub repositories for VM pooler, this has been our most successful endeavor and it was actually part of quality engineering's effort to get over a scaling issue we had with vCenter. We would spend, we actually have at any given time inside our vCenter cluster today, almost a thousand idle VMs and then many more running tests at any time. We cycle thousands of VMs a day because our matrix is fairly large for the number of platforms we support. So in order to run a test and this would spike, vCenter would just lock up because it could not actually provision VMs fast enough and it generally didn't have anything to do with the underlying disks. Had to do with the vCenter database. So we created, our quality engineering team created this thing that just built pools. So our developers found out that they didn't have any authentication on it. So they just started grabbing virtual machines. In fact, they actually would deplete pools and we would fail CI runs. So the tool has had no CLI until recently, no UI, has no tenant system. So it's not really for anyone, but it has worked. So we went ahead and named our OpenStack project so we know that this thing would come to completion. We named it first and then backer named it. That's why it's kind of a silly name, but we did the whole works. We project plan, user research, even escalated project health to upper management so that in need be, we could leverage them to continue the project and push it through to completion. So the goals of the project, they were twofold. They were in user and back end. So we needed to boil down to a safe, flexible environment that people could try and experiment new things. So this was to be a research cloud. And so that's what we built. All of our features were drive towards that. Since we were a company that's very used to having our own infrastructure, it had to be fast. It had to be close, it had to be fast. Latency had to be low. Pipes had to be big. Disks needed to have near no latency. So those were all very important things in our build out. And one of the reasons why I'm on top of the lack of a nested tenant system and public clouds, that wasn't a good fit for us. The automation has to be on the user side and on our back end. So we need to be able to automate the whole stack. In fact, generally we will look at a piece of software and go, can we automate it with Puppet? Yes, well, and then in some cases, no. Is it trivial to automate it with Puppet? Yes, all right, then it's a good choice. Is there just no way we're ever gonna put the time into automating this? Well, okay, that piece of software doesn't go into our infrastructure. So that needs to happen on both ends because also our users use a lot of Vagrant, a lot of other tools, Heat, Murano, in order to bring up various application stacks. The open source tooling, this has to do with the way that our team does on call rotation. We need a common set of tools across everything that we do. So our websites use a common set of tools to troubleshoot failure. We wanted to be able to use those same tools to troubleshoot our open stack cloud because after this implementation team of pretty much one person kinda two, that is eventually gonna go into a general SysOps on call queue that covers people that really aren't open stack domain experts. They deploy our websites or manage Jira. So it numbers really quick. Like I said, one full-time person, we had two floaters including me that could bounce in and out as legacy infrastructure priorities change. It did take 20 months, but this is a little misleading. So it seems like it took a long time. That was from Project We're Doing This to Production. The primary person on this project, his only introduction to open stack was Swift at a previous company where he actually did the implementation with Swift stack. He was previously a Windows admin, dropped into our Splatniks only shop. We actually have one Windows server and our entire infrastructure is managed by our IT team and that's for our phones. And he'd only been in the team a couple months and he also didn't know Puppet. So in those 20 months, he also had to do all of that. The automation, I have 4,000 lines up there, also pretty misleading statistic because that doesn't include the 10 upstream modules that I've included there. And a lot of that's like filling out parameters and whatnot, but any other automation is load balancer setup, monitoring setup. We've included all that. We did have to rebuild or backport 37 different packages. And maybe one I get back next week because we have another bug. Okay, so the intro issues that we had in getting the project finished. Knowledge was a big one. Probably I'll bring this up a couple more times, but because no one had ever done this before, OpenStack's a fairly complicated and dense application. And we had the tools to automate it and the ability to basically deploy the entire stack without knowing very many of the components underneath, but it's really against what we believe in inside our SysOps team. We don't really like automating away, understanding how things work. And I know this can be a little bit of a controversial thing. I've talked to people before about this and the idea is you would never deploy OpenStack into production without automation. So why not just learn it from the perspective of automation? Really two reasons humans write automation and generally automation doesn't prevent things from failing. It may be able to heal, but if something is failing constantly, you're gonna have to go fix it yourself. So you need to know those underlying components. We had bugs. I have one link up here. We did find that most of the bugs were fixed, but we had to go get the code ourselves and backport it. The one up there is all about SSL termination at a load balancer. There's a thing on the dev list about it. I think Neutron is the only one I couldn't find a fix for. We patched it and rebuilt Neutron ourselves in order to solve the problem. But still the ones that have fixed it have all fixed it in different ways with different parameters. And so if you actually wanna go down that route of securing your infrastructure at the load balancer to offload the needed SSL work, it also makes troubleshooting a lot easier, you will need to follow this link. Most, so all of those bugs have been related to that load balancing. Plus we've had a handful related to Neutron and our SDN choice. Workarounds, pretty much all due to that load balancing issue and SSL. These workarounds are very obscurely documented because they're not technically supposed to be needed. So the place I found most of them was by following the rabbit hole that is that link that I have up there. And it has attached to it all the associated reviews that will bring up the various parameters that you need to fill out to get around those issues. HA actually was one of our issues, but for different reasons, we decided to go full in and make this the most HA application we had internally. We got really, really frustrated with vSphere's inability to be HA. Like there was a single vCenter endpoint, that's all we could get. And we couldn't do maintenance without causing outages. It also added, because of that, it added new monitoring requirements and a new dependence on Puppet. So previously, Puppet just managed configuration. Now in our infrastructure, Puppet dynamically builds and destroys clusters. So if we don't have good Puppet hygiene and we break Puppet, we can actually take out infrastructure. We've been really happy with the monitoring changes in using OpenStack Rally. Rally has been fantastic use for us. It is the first time we've been able to do functional SLA testing for any of our virtual infrastructure. It's made us all very confident in the platform that we've delivered to people. And I would say it is a requirement for anyone that's going down this path themselves. So this was our first production Python app. Our team pretty generally prefers Python over Ruby. So we do a lot of Python scripting, but scripting is a whole lot different than a production Python app. We found very quickly that Puppet was installing libraries in different ways for different reasons and overriding stuff and taking out the OpenStack cloud. So quickly fixed those. So my first tip, now that we've kind of got through the history of our platform and the issues, it kind of just really comes down to make simple choices. Remove complexity as much as possible. Some of these are gonna feel like they're countered to that, but I'll explain. I mentioned being dogmatic and being agnostic previously with a lot of our goals and it also becomes one of my recommendations because the more dogmatic you are about the implementation details, the more complication you're going to add to the deployment. We switched from Debian to CentOS that had really nothing to do with any type of strategic desire to run enterprise Linux. It was, I mentioned that 37 packages. We knew we were gonna have to rebuild packages and when I looked at the Debian package build pipeline that I would have to build, we were like RPM is just going to be a whole lot easier. We had more expertise and it was simpler. So we just switched because we knew that would take a lot of the complexity out of it. We are also going to, we were also generally a Postgres shop but we ditched that for Galera as well. Learn from the ecosystem. This is both external OpenStack ecosystem and your internal ecosystem. We were originally going to do something like different and grand and original inside in our OpenStack implementation. We had gone down the path of designing on a shared nothing architecture which was load balanced and HA based off of any cast routing and equal cost multipathing. As we traveled down that path more and more and more and kept evaluating the way we were doing things, we realized that long term, when we weren't around, the barrier for maintaining that was going to be higher. It was going to depend on networking expertise. And so we started looking around for what are the alternatives and we actually found that our puppet code that we were using to manage websites already had a very robust, dynamically generated HA proxy implementation that gave us a very rich set of metrics. So we plugged that in in less than a day. One time when I was inserting myself back into the team for a while, we were able to switch that into our entire platform nearly instantly just directly to HA proxy. And we did that in a couple of other situations. Like I mentioned the Galera and MySQL, that was a lot easier too. There's already public modules in the ecosystem for that. And there's really good open stack support and documentation around it as well. Because we picked MySQL, I can actually hop into various Slack and IRC channels and go, hey, I'm having a database walking issue. And people go, do you have this set? I go, oh no, okay, fixed. So that's made that extremely easy. So this next one, do your research, set expectations inside your organization, don't deploy all these different services because you think they might be super cool in the future. You can phase these things out inside your infrastructure fairly easily. We actually piloted many of them and then backed off on quite a few services and only deployed the most valuable for production. And then we'll just kind of ramp up as we go on. Like I said, don't do stuff that you find out has no value. There's a very likelihood that many organizations don't need SDN at all. And flat DHCP provider networks are gonna serve you just fine. This is a research cloud. So we have an SDN because the networking requirements aren't dictated by our networking needs, they're dictated by our customer's networking needs. So we need to be able to actually design a network that our customer might have inside their infrastructure and model their infrastructure to design new applications or solve problems. We'll have two more clouds deployed by the end of the year. One for driving our CI pipeline and the other one likely for just doing mundane infrastructure things like enterprise apps and websites. Those are most definitely not going to have an SDN. Our network does not change that much. So this one's weird, active, active, HA. So the reason I say this is a simple thing is generally applications, if they can support active, active, HA, were built that way. They understand how to be in those states. For an active, passive, HA model, you're generally having to build automation, failure-prone automation that has to detect failure and do failover. In an active, active situation, things are actually generally more aware of these types of situations and you're just going to be able to sleep longer at night, not have to get up and fix a degraded state in the middle of the night because you might lose the next one by morning. It just, it makes the whole design and maintenance process a whole lot easier. We were able to accomplish this. Our SDN portion was actually the last thing we were going to have to do any active, passive implementation and in the process of doing this deployment, the SDN we chose, which was Mitocura's Mitonet, removed the last little components that were going to be required for not having a full active, active HA system. Pick mature automation. Try not to go this fully alone and build all the automation yourself. I would recommend that if you're doing this with a small team, with limited set of resources, also don't go out and grab the new fancy thing in order to, that demos really well. Find a tool that's proven in the marketplace that's actually driving real workloads. This is going to sound self-serving because I'm from Puppet, but the Puppet OpenStack project basically made this successful for us. It's been around since around May 25th, 2011. I picked that date because that's the first commit date on Puppet Nova. We were the first non-core automation tool in the big tent, so we've been driving ahead quite quickly in a lot of these new kind of advances inside the OpenStack community. The CI is ridiculously robust. I think we test against at least six different versions of Puppet, including Puppet 4. We do functional testing, light functional testing and run Swift on every commit. It's just very powerful. This is all commercially down-strain by Mirantis and Red Hat. Our Ptl is from Red Hat and many of our core members are from Mirantis. We got a very strong operator community there as well with Time Warner Cable, Puppet, the HP Enterprise Cloud that's going to drive a good portion of OpenStack Infra. That is also built with Puppet. So it's very strong, very mature. I would happily tell you that it's the most mature because it really was. We only had to fork one module. It was for a one-line change and the chain's already been pushed back to master. We did not have to do any custom coding around these modules to get a full HAA solution. We also released all the modules in lockstep synchronously with OpenStack projects as well. So here's the, to go along with all these, solving these issues and accomplishing these goals. These are kind of the skills that a small team's going to have to go get. No Python. You don't need to be an expert by any means, but you need to go out, you need to understand object-oriented programming. You need some type of algorithm understanding and knowledge. You need to figure out how to find Python documentation because you will be patching bugs. You will need to be able to slew through the code and you'll need to be able to talk to people on IRC in an intelligent manner about what the code's doing incorrectly so that you can get it fixed. Luckily, like I mentioned, our team switched largely from Ruby to Python in about the last year and so this was an easy on-ramp for us. I've wrapped packaging up quite a few times. You do not want to be post-patching code after deployment. It's brought with issues of tracking which versions you have in place, Python load paths getting incorrect and then eventually just making your upgrade process really, really difficult. Doesn't matter what you pick, just make sure you know it. I don't even care if you use Docker. Docker's a packaging format at the core. Just learn how to do it, learn how to build the pipelines. You know, even if you're sourcing someone else's packages, like we ended up sourcing RDO packages, but we still rebuilt them. You need to be able to host your own packaging repositories because of that too. That includes your own private registry when it comes to Docker. Packet tracing and dumping, this literally had nothing to do with our SDM. This was all about tracing packets inside various APIs, how NOVA was asking for images or how things were coming through the load balancer, Wireshark, TCP dump, InGrep, all these things were great help. In fact, when you do SSL termination at a load balancer, this becomes even easier. So it actually, if you're able to do that inside your infrastructure, I know some security models don't allow it, but if you can see plain text stuff on the other side of your load balancer, you've cut out a lot of the overhead needed to do troubleshooting and packet tracing. You'll need to learn some SQL. Not everything's available for the API. All of our usage and trending data that we use to track how many users are active and how many vCPUs we're using so we can trend those ratios, and we're getting directly from the database. It's like four lines of SQL, a couple of select statements, and we get all that data. There was some usage data we could get from the API, but it wasn't capable of being formatted in the way that we actually needed to understand it so that we could actually phase out new purchases of new hardware. Also, at some point in time, you're gonna have a request failure. That request failure is going to leave aired rows in the database, and actually trying to clean them up through the API will likely produce just more errors. So you'll need to jump into the database and clean it up. I know some people that doesn't feel ideal, but this is a dream for me moving from other platforms. The fact that once things were broken, all I had to do was do mysql, and I could fix just nearly anything was quite advantageous to me. It was borderline miracle. All right, this last one, very obvious. Learn an open stack. I've drilled this pretty heavily. Learn it before you automate it. But on top of this, to help you do this learning, get some early adopters. Don't ramp everyone on in the company at one time. We had the option to just enable everyone's users on day one. We chose not to. We wanted to make sure that people were there, that we could bring into the environment slowly, make sure they weren't having problems, and just allow us to identify what breaks and how to fix it and how often. Before we threw everyone at it and everyone throws up their arms and goes open stack sex and walks away. So that's been, that has been very valuable. Our primary users, early adopters of this was our application orchestration team. So they've been developing our application orchestrator product inside OpenStack. So it actually understands and lives inside OpenStack and they take into account all of those little networking bits they need in order to make our products work. And it's also at a scale that's greater than they could have accomplished before OpenStack. So they're getting value, we're getting value because they're slightly more technical than other users and they can give us access to VMs when they lose networking, help us out with API issues and are just generally more willing to accept failure in certain situations to allow us to actually learn and educate ourselves better. Okay, so that really is the end. I'm about three minutes fast than what I thought I was going to. So we do have time for questions. So if anyone has any, thank you. And please use the mics. All right, great. Well, see you guys later.