 All right, let's see, it's, we're about 10 minutes in. I wanna make sure that we have some time for questions, hopefully, so we'll get started. Like to welcome everybody to day four. You've all sat through quite a number of presentations already today and this week and hopefully you had a bunch of the whoopee pies so you're properly loaded up with sugar. Today we're gonna be talking about open stack at scale inside of NetApp and how we've orchestrated our environment, our engineering environment with the use of Puppet. So my name's Seth Orgosh, within NetApp I am what's referred to as a solutions architect. Means that I get to go and talk to lots of customers and find out how they're using our gear and how we can help them solve their problems. With me today is Cody Hendridge, Harage, did I say that right? Haragis. Haragis. There you go. Cody is from Puppet and we'll be sharing the presenting responsibilities. Don't be too afraid that I have a business title up until January, I was a principal engineer. I was actually the person that led the team that did our open stack build out app, Puppet. All right, so just a quick timeline of events at NetApp. So back around 2014, we decided that we needed to to have an engineering cloud environment. One of my peers likes to say that Shadow IT comes from a place of no and we wanted to avoid that problem. We wanted to be able to say yes to the engineers who are out building our code, developing our applications but we needed to be able to do it in a scalable fashion and we needed to be able to provide what they needed, when they needed it so that we can meet some more aggressive timelines. At the time, we were on a 12 to 18 month waterfall schedule for our core operating system on tap and we realized that we needed to shorten that timeline. So to enable that process and today now we're on a six month cadence, we needed to institute a change. So we introduced open stack in August of 2014. We brought in Puppet Automation shortly thereafter. We started using it for automating updates and now we are deployed across multiple sites in multiple continents and deploying lots and lots of instances. So let's talk a little bit about what the global engineering cloud is or GEC. So the ESIS, that's the overarching infrastructure services, they run the global engineering cloud. And as I said, it's multiple sites throughout the U.S. and Bangalore. We've got nine R&D labs, about 5,500 users on any given moment. We have eight team members supporting open stack within the GEC out of about 100, 120 total. We refer to ourselves or the GEC refers to themselves as customer zero, meaning they go out and they create environments that are very similar to what our customers are doing out in the field, right? So we're living the same processes in the same experiences that our customers are living and we're of course using our own technology to help drive those solutions. So the GEC is this internal private cloud. It provides a one stop portal. Our end users don't go directly to horizon. They go to a customized portal that leverages service now. As a back end, they log in with their SSO credentials. They select a number of criteria from types of virtual machines that they need, sizes, numbers, sizes of disks, et cetera. And they choose from a couple of tiers of different storage for the back end. And once they hit go, the system takes care of provisioning the systems. Now not everything is in open stack, right? This, the global engineering cloud actually has multiple hypervisors behind it. Until recently, we were primarily VMware. Our CEO decided that it didn't make sense to keep writing checks to VMware, especially given who owned VMware. And for the same reasons that a lot of our customers want to go to open stack and use open source software, we decided that building an open stack environment made sense from a financial perspective. It also made sense because we could be more dynamic. We could make changes faster than if we had to wait for the next 12 month release from VMware. So kind of at a glance, we've got about 42,000 total seats available within the GEC. That's just based on processors and memory. 15,000 VM capacity for open stack itself, so it's grown fairly large within the GEC. On a daily basis, we've got about 5,300 virtual machines running and most of those virtual machines turn over each day. So 42% of our virtual machines are now running on KVM. That's up significantly from last year. We've grown over 50% since last year on KVM, mostly at the expense of VMware and some on the expense of Hyper-V. And we expect that we're going to double the number of users running KVM over the next year. Our Bangalore site is the first to be completely KVM. So we've done away with VMware and Hyper-V in Bangalore. This entire infrastructure is built on what we call FlexPod. For those who maybe aren't familiar with FlexPod, it's our partnership with Cisco. It's a reference platform that has validated all the architecture to enable us to more rapidly deploy an infrastructure. It's a repeatable process, and this really helped us drive things very quickly from the time that we started. We are leveraging RDO Liberty, FAS on the back end, as well as E-Series. We are starting to put in SolidFire as well and Cisco Nexus and UCS Compute. From an automation perspective, Puppet of course, we are using the open-source version of Puppet. Jank ends in Git. So we already kind of covered the why open stacks, so I'll skip forward. So this is a bit of the architecture. So we followed a crawl-walk-run methodology very similar to customers who I talked to. In phase one, we did an all-in-one deployment because we wanted to learn what this open-stack thing was and how do you manage it. So we dropped everything into an all-in-one and we quickly learned some lessons. We learned that Keystone and Horizon were chatty and were negatively impacting the controllers. So that became our first issue that we had to overcome. So phase two, we went and we split those out. We moved them into their own hardware, but we reached the second problem. Now Horizon and Keystone were single points of failure in the environment. This wasn't good. We could not afford to have our engineers sitting idle while we were fixing services or doing work on the infrastructure. We needed them busy writing code for the next release. So in phase three, we introduced HA services. We set up multiple Keystone and multiple Horizon services, put them behind load balancers, but now we had another single point of failure and that was the controller. So with the controller being a single point of failure, again, we had to rethink how we were doing things. So phase four, which we titled architecture refinement, kind of built this out. So at the top, we've got load balancers sitting in front of both Keystone and Horizon as well as the GDB. And we've got controllers with their own DB and MongoDB and Compute, and we've put it into a region. And down at the bottom, we've got NetApp Storage. And then we built out multiple regions. Now the reason that we built out multiple reasons was twofold. The first was we wanted to avoid that single point of failure. We didn't wanna have any region, any controller bring down the entire environment. So by breaking out into regions, that was very helpful. It also meant that we could group compute together with like compute nodes, which was critical for being able to do live migrations, which fed into non-disruptive operations and upgrades. So I'm gonna turn it over to Cody for why Puppet. Thank you. Yeah, so before we get really into how NetApp was successful kind of going through their different phases, I wanted to make sure that I hit on what were the, even the value ads associated with Puppet and OpenStack, so that you get kind of an understanding what Puppet brings so that you can actually be successful in these deployments. So Puppet is generally positioned as a automation and situational awareness tool. So it's the idea of being able to discover what you have in your infrastructure, creating some type of business logic and a set of code that abstraction defines what that is so that you can secure it, control it, keep it compliant, and then just be really ready and have all that code in place so that you can modernize it as quick as possible so that you have all that code there and you know what your applications look like so that you can migrate that stuff into OpenStack. You can migrate that into a public cloud and you can migrate that into containers all using the generally the same set of code. This all kind of boils down to and the thing that we're predominantly mostly famous for is this common language that enables that across all these district types of infrastructure. Okay, so it's a very, it looks like this that I have up on the screen now. It's a very kind of a strong model driven language that allows you to declaratively define infrastructure so you don't necessarily define process, you define what the, what things are supposed to actually look like. It uses this common language across everything so we support very many different kinds of infrastructure and you're all gonna use the same language and it hardly changes away from this. There's some flow control things and some variables and we'll look at that in a moment. So I have the example up here for a very simple Docker installation. That's the Docker engine pulling down an image into your local cache and running an application based off of that image. I use Docker here because then we'll later see a couple other examples and I just wanted to illustrate the language doesn't change. Doesn't matter what kind of infrastructure you're using. So this is the general kind of workflow and ecosystem of how Puppet does its configuration. So we get content from the world, like almost literally. It comes from the Forge, GitHub, any other sources. There is over seven million lines of Puppet code publicly available. That doesn't even count the stuff that people have behind their firewalls that you never have access to. It's several hundred thousands of just Puppet modules that are out there. That code comes in and it's managed by the Puppet Master. It compiles a set of configuration for nodes. The nodes pick that up and can be anywhere from bare metal all the way through an infrastructure as a service platform, including storage and networking. So NetApp's been a longtime community member and has developed several modules for their actual hardware platforms as well. So you can do this automation end to end. All that information that is bundled back up in a report and an inventory and sent back to a piece of technology along Puppet called PuppetDB. That's kind of key to a couple of things that we're gonna talk about here in a moment. All that information is queryable. So what are these specific things? I'll go through these top three. I wanted to bring up the one on the bottom just because I spoke on our experience of doing OpenStack from scratch at Puppet. We have a research cluster internally that we run as a production service for Puppet. We had to repackage and backport most like a good half of kilo in order to get it up and running. So one of my tips was you need to learn how to do packaging. You need to know how to take source files and build packages. Well, the number one thing we learned from our upgrade to Mataka was Puppet is still not going to get around Python interdependencies. And most any of the complex orchestration we had to do, it was to get around the fact that upgrading one part of Python would take out other parts of OpenStack. And so that just added a fair bit of complexity. I'll just recommend two things really quickly. Clark Borland did a lightning talk that's recorded this week on virtual ends and there's public code that can enable your ability to manage these things inside virtual ends so you don't have that cascading dependencies or even just some lightweight containerization of the components. So the OpenStack community, this is an entirely community driven project. It's ridiculously mature. I kind of picked that random established date because that's the first commit on the Puppet Nova module. So it's coming up for a birthday here soon. So it's been around from 2011. I'll kind of show you what the CI matrix looks like here in a moment, but the community has 40 plus active repositories. And I was talking to the PTLE the other day and I asked him what were kind of the big things over the last cycle and he was like, you know, mostly this stuff just works. So we refine it and make it more secure. I like to say that it's kind of in snow leopard mode. I love that analogy because it kind of takes me back to the days of like when OSX was just working on refinements. And so they've just made a lot of changes inside Keystone to make sure that the way that services and communication done is just more secure. How am I doing? Oh, okay, we're doing good. Yeah, so here's the CI as of like a week ago. And it's not just a big like list of every project just tested once. This is an actual, this CI matrix actually shows that the community does different combinations of services. So you actually know if certain components of OpenStack are independent of each other. There's everything here from being able to test a sender with RBD while you're also testing that on no key, but then the next one over is Swift and I SCSI. So this is very, very thorough. It's actually probably when I was an operator, this was my favorite part about working with the community because I always had a lot of high confidence in what was being deployed. This is also doing the public ecosystem testing as well as OpenStack testing. So you know Tempest has run against this as well. So they're truly validated stacks on every commit. Okay, so here's kind of one of the big positives in our experience was we went from Kilo to Mataka. We skipped a release, we skipped Liberty when we did our upgrade at Puppet. It just worked. All the deprecations, all the migrations of configuration, they're built into the modules. Like I have up there, we skipped a release and was able to go single the multi-domain Keystone V2 to V3 without really doing much of anything. We ran Puppet once, that was about it. I added two lines to our Nova class to manage the Nova API database. And that was it besides some stylistic cleanups which was my kind of own personal opinion on code style. I also did one Neutron hotfix that I had to put inside our code. This is because historically over the last couple of cycles, OpenStack has been improving its SSL termination. There was one that got left in Mataka and that was Newton. The fix was in Oslo but it hadn't gotten moved over as a configuration file in Mataka. And so we just implemented that with Puppet. So that was kind of, since we could do that with Puppet we didn't have to rebuild a whole bunch of packages this time around. So we were actually pretty much vanilla RDO at that point as far as packaging. So here's the other big thing. Puppet knows about everything. Everything that Puppet is managing it knows about. And you can look up that data. So there's a lot of other stuff in OpenStack besides OpenStack that needs to be managed and configured in order for OpenStack to just work. So in order to configure our RabbitMQ cluster, we actually do a dynamic lookup of PuppetDB for all those components. So we ask PuppetDB to return us all OpenStack nodes that are controllers and in production. It gives us that list of host names and we give that to our RabbitMQ class so that it builds it out that cluster of communication. You'll notice once again, like I said, now that we're managing RabbitMQ, we're not managing Docker anymore like my previous example, the language doesn't change. There's even some component, there's even, and this is just kind of another example of that. This is me looking up all the IP addresses of all controllers in the infrastructure. So that I can tell Horizon where all of our MIMcache servers are. So we have MIMcache running to store session tokens for all of our Horizon dashboards so that if we lose a Horizon dashboard, then the ones that remain that people fall over to can then query their session keys from MIMcache so that they don't have to re-log in. So this is just to kind of provide a more highly available experience to the users of Horizon. And so as this goes, this is going to, when new knows check in, we will get new IP addresses and as Puppet runs, it will fill that out and kind of tell Horizon there's more MIMcache servers available. And as that data gets purged from PuppetDB, as you remove them, that list will shrink. So you don't have to actually go in and do any static configuration. So that's kind of the key points of using Puppet together with OpenStack from my experience. So I'm gonna hand it back over to Seth to talk more about NetApp. Thanks, Cody. All right, so Cody talked about how Puppet actually works and how you put things together within a Puppet environment. I'm gonna talk a little bit about how we're using it in the GEC and how we're rolling things out. So the first step when we wanna stand up in OpenStack environment is to stand up our FlexPod. So we stand the FlexPod up, we create VLANs for the instances, we create FlexVolves for Cinder and for Nova, and then we assign a bootlun. We flex clone and assign a bootlun. So all of our compute nodes, all of our infrastructure, all the servers are booting off of Luns. And this is being done without any additional disk space. We're just cloning them off and handing them out as part of assigning a service profile to the UCS blades. So everything is non-persistent and can be reassigned at will. So at that point, we feed the node into Puppet and it goes and it gets from Puppet its roll. And we've got a number of rolls defined within our manifest. So we've got web, we've got load balancers, we've got Keystone, the GlareDB, et cetera, et cetera. So we assign a roll to this new piece of equipment and from there, Puppet does its job. It goes off, it configures everything for us and in a matter of minutes, we're able to stand up a new OpenStack infrastructure. So in the Juno production, the Juno time, it took less than 90 minutes to deploy 45 node infrastructure. Fairly rapid deployment process. So the next thing that we had to do when going from Juno to the next release was we wanted to drive non-disruptive automation. Again, because we are global, there is no real good time to take an outage. And quite frankly, the days where you could schedule an outage for an environment are pretty much over. So non-disruptive operations was one of the Keystones or founding features of the GEC. So we needed to provide this seamless user experience. And if you recall earlier, I talked about why we did regions. Well, that was one of the reasons. We wanted to be able to have no single points of failure and to be able to roll through upgrades. So the first thing that we upgrade is the shared services. And we do this in a serial process. We upgrade each of the Keystone and Horizon instances. And we do it one at a time so that at all times, we've got at least one of them up and running so that when users are still hitting the portal and asking for new instances, we're not dropping any of those requests. We're just servicing them out of fewer nodes. So the next step is to upgrade the controllers. And again, we do this on a region by region basis. So we'll start with region one, upgrade the controllers in region one. And that means that region two, region three, region N are still available to service those requests. And remember that in this case, regions are all within one site. And then finally, we go ahead and we upgrade the compute. And this is the piece that takes the longest. And not because it takes a long time to actually upgrade the compute, Puppet makes that very easy. The biggest consumption of time is in having to live migrate the virtual machines off of nodes. So move them around between the compute nodes so that, again, in a serial fashion, we can upgrade them. And that was, again, one of the reasons that we had to do regions. We needed to group our compute together so that we didn't have disparate CPU resources. And we made it easy for live migration to succeed. The key is, by the time we're done with this, and I've got a slide that talks about how long it actually took in each of these sites, there was zero interruption to service. It means that we never had a virtual machine down. We never had a time where engineers couldn't request resources. So under the auspices of globalizing OpenStack, this all started in our North Carolina data center. That's our largest engineering site. And that's what we call site one. And in site one, we have five regions, 75 new compute nodes, and around 6,000 VMs. And that's where we keep our core Puppet. So all the development of the Puppet manifests and the YAML files that go into defining what Puppet's going to do start there. In each of the remaining sites, each site also has its own Puppet master that pulls from the core Puppet. So we've centralized the work and then distributed out to each site to make it local so it doesn't have to pull across the network. So we've got North Carolina, California, and Bangalore. And each of those sites has multiple regions, although Bangalore only has one. Just because it was the smallest and at the time of this slide it had just been stood up. So I said that we had a slide on the time to upgrade. So Bangalore being the smallest site took about an hour to do a complete upgrade. And this was going from Juno to Liberty. So about an hour to upgrade in Bangalore, in California, an hour and a half, two hours in site three of RTP, and about four hours in site four. So again, that last one is our biggest deployment. So 6,000 VMs, 86 nodes, four hours to upgrade end to end. We budgeted about double the time for each of these upgrades just to be safe and set the expectations properly. As far as how frequently we do these upgrades, the cycle right now as it stands is the release candidate for a new version of OpenStack comes out. We put it into development. When the GA is released, we wait about a week and really start banging on it to see if it's going to be something that we want to go to. If that's the case, about two weeks after GA we'll move it into a staging site. And again, that's where we're going to prove out that everything as the entire process works will find the things that need to be modified in the manifest and that we need Puppet to modify for us to make the upgrade transparent. And then about six weeks into the GA is when we'll hit the Go button and start pushing it out to the different sites. So some lessons learned. The Kilo to Liberty update was a lot easier than earlier versions. Earlier versions were just hard. And that's a testament to OpenStack maturing. As each release has come out, it's gotten easier and easier, more stable and more stable to do these live upgrades rather than a rebuild. It was really important for us to set expectations for our community. If you don't set the expectations properly, you're going to have unhappy consumers. So in each of these phases, we've taken a crawl, walk, run mentality. I would say maybe we have fallen down a couple times and then gotten up and rolled that experience back into the process to make it better and to improve it so that when we went into production, and most of the falling down and getting back up happened in that staging environment. So we haven't inconvenienced the end user community too much. But it's important to have those experiences and then to internalize them and figure out what changes you need to make to make the experience better each time. It is still complex. There are a lot of moving parts, but it is improving. Not to push any kind of agenda, but we are a storage vendor. And we feel that the capabilities that the NetApp Storage provided really did add a benefit to us, just the nondisruptive nature of the storage and the easy scale. So some advice, not that I wasn't giving it before. Make sure that you test your environment, put it in the staging environment before you go and hit deploy. There's nothing worse than blowing up a production environment and not being able to quickly recover it. Read through the release notes. So I could have gave you one example of stuff that maybe wasn't all that clearly documented at first, that they found. We found a change in the live migration settings that was only allowing us to live migrate one at a time. And obviously, with 6,000 virtual machines, that would have taken an eternity. We uncovered that in the staging environment and were able to make that change. As far as the team, remember GEC in total has about 120 folks working in it. There's only about eight or nine of those folks actually supporting OpenStack, which means that we've got folks in each of the local sites that maybe aren't experts in OpenStack. So we needed to mentor them and let them shadow the folks in RTP who had the most experience with OpenStack and pushing out these changes, shadow them, mentor them, and then let them do the work in their local sites. And finally, again, because this is international, be kind to those people. Just because it's a convenient time in North Carolina doesn't mean it's a convenient time in Bangalore. Take that into consideration. So next steps, what we talked about was GEC version 1. We're currently working on version 2. So some of the changes that we're contemplating or planning is a move away from the region-based HA. We did not initially go to availability zones because it wasn't really mature at that point. It has matured now, so we're gonna move to an availability zone model, which will really just allow us to manage each site as an entire OpenStack deployment as opposed to each region being its own OpenStack space. We are gonna continue to group our compute nodes in the availability zones to, again, to make it easy to do the live migrations. So older compute will go in AZ1, newer compute into AZ2 so that we don't run into problems there. And we're also gonna add the ability for consumers to provision their own additional volume. So today, when you set up your request, you better have chosen everything you wanted because if you didn't, you have to go back and start over. One of the things that our engineers wanted to do was to be able to add additional storage after the fact, and that's one of the things that we're gonna add into the dashboard. And then finally, we're investigating integration of some additional OpenStack projects. Manila is absolutely happening. Manila's file share is a service, and that kind of ties back into the previous bullet. Database is a service with Trove. We've had some requests for that, so we're thinking about it. And Kola, we're thinking about allowing our consumers to request their own OpenStack environments and be able to provide it through that infrastructure. So finally, other resources and collateral. Earlier today, one of my peers, Kevin, did a great presentation on the use cases for the GEC. Why the GEC, what do we use it for, why did we wanna do it? It did happen earlier today. I highly recommend that you go back and once it's posted up to the website, take a look at that. Also, managing DevOps and going from DevOps to planned ops, managing to maturity, that was a really great session that I sat through. I really enjoyed it. Also, netapp.io, and it's kind of our landing page, our one-stop shop for anything open source. All of the deployment guides, blogs, access to the Slack community are there and available to customers and non-customers alike. Please feel free to go check that out. And let's see. Final key takeaways. This is really just kind of summing everything up. The biggest thing is have a good foundation and then set your expectations properly. If you do those two things, you're gonna have a much more successful deployment. And I think we got it in under the wire. Any questions? All right, everybody's still in a lunch coma? Well, now we're doing live migrations within the regions. So we've grouped the compute together in the regions to ensure that live migrations would work. The other purpose behind the regions was we didn't want to ever have a single point of failure with the controllers. So having a controller within the region and then having another region with another controller allowed us to pseudo-provide HA in that way. No, we were, everything, all the services for a region were within the region. With the exception of region zero, which was Horizon and Keystone, that sat up in its own region at the top behind load balancers because those services could easily be made HA. Some of the other services didn't have those capabilities in the earlier releases. Any other questions? Anything for Cody? Thank you very much. Everybody get home safe and hopefully this was worth your time.