 Test, all right. Let's go ahead and get started. Welcome to the session. Welcome to the last day of the summit. Hopefully you've been having a lot of fun, learning some stuff along the way. No doubt you went to at least a couple of sessions on containers. I hear there may have been a few of those this week. This is building an engineering IaaS cloud to develop enterprise software in NetApp. That's quite a mouthful. Essentially, this is the journey that we had for building our private cloud internally at NetApp for our engineering group and how that's been helping us with our DevTest process. My name is Kevin Lambrite. I'm a senior cloud architect in our engineering shared services group. Let's see if we can get this to work. All right, real quickly on the agenda, I'm just gonna do a brief history of our private cloud that we've built. A little bit of how we came to open stock, put open stock in that environment. We're really gonna talk about use cases around this cloud and how it's helping to transform our DevTest process. Then we're gonna shift gears a little bit and talk about a new service that we just introduced two weeks ago into it. That's our software defined storage on demand. And then I'll give a brief demo. Talk quickly about where we're going in the future. All right. So real quickly, a little bit about the ECIS organization. They said this is the engineering shared infrastructure services group. This is essentially our internal IT organization for engineering. So we build hardware, software and virtualized solutions for our engineering community. Did I get this thing to work? We'll manual. We currently operate in nine R&D labs throughout the world and serve a user base of a little over 5,000 people. And out of the about 130 people that work in this organization, it's roughly a relatively small team of eight people that are supporting the open stock. And that's both on the architecture side as well as the outside. We are part of what we call our customer zero program. And that's essentially, like I said, building services that use our own hardware and software products. And we'll often take a lot of pre-release software before it's out there, we'll deploy it in our data centers much in the same way our customers would and try to build out solutions in a way, architecturally in a way that our customers would. So back in 2013, we set out to build a private cloud to help accelerate our engineers and make them more efficient essentially. Prior to this, it was your standard story. If you wanted a VM, you had to file a ticket with IT, you had to wait, you had to give justification for why you needed it and you had to wait two weeks and maybe you would get one, maybe you wouldn't. So back in 2013, like I said, we set out to build this private cloud. It was initially built on VMware and Hyper-V, had capacity for roughly, somewhere between 500 to 1,000 VMs. And then we added, in 2014, we added OpenStack to the mix. We also put together our own portal for it that spanned all of those hypervisors. Essentially, as an engineer, you go to this, you are asking for services, you're asking for compute on demand, you don't know which hypervisor that's gonna be served up from. So today, we're at about 42,000 total VM capacity. And of that, our KVM OpenStack mix makes up about 36% of that overall by hypervisor count, not necessarily by VM count, but by host count and hypervisor count. So at any given time, we've got roughly 5,300 active VMs on OpenStack in capacity for 15,000 and that's changing. A year ago, it's increasing considerably, a year ago at the Austin Summit, our percentage of KVM OpenStack was roughly 15%. So year over year, we've more than doubled that and we're on track to be at roughly 80% of our hypervisor count by the beginning of next year. So we've made a significant investment in OpenStack and increasing that investment considerably. At the heart of this is our hyperconverged solution that we put together with Cisco called FlexPod. So as such, it's Cisco UCS compute, nexus top of rack switches and then our own storage in the form of FAS, E-Series or more recently with the acquisition of SolidFire a year and a half ago coming up with a SolidFire FlexPod solution. And we are running the community, Red Hat Community version of OpenStack RDO and we're sort of in transition right now. It says Liberty, we're transitioning to a V2 architecture and we're at this part of that, we're upgrading to Newton, we're upgrading to Metaca and then Newton. So we still have some notes that are back on Liberty but most of it is either on Metaca or Newton at this point. And then this session, I don't go into a lot of it of how we've done automated deployments and how we do seamless, non-destructive automated upgrades, there's session 130 this afternoon in joint, it's a joint session with Puppet Labs that is exclusively focused on more of a deep dive on the architecture of this and how we have completely automated everything with Puppet. So why OpenStack? Like I said, we built this cod back in 2013 and obviously started providing a lot of efficiency for engineers, they really started to like it and so we really needed to scale this out and we wanted to use OpenStack to be quite honest with you as a way of avoiding vendor lock-in. So a lot, same story as a lot of other companies. To be quite frank, we also wanted to reduce our enterprise license agreement with VMware and then additionally, NetApp has had quite a commitment in the OpenStack community since 2011, being part of the OpenStack, Gold members of the OpenStack Foundation, we have quite a few developers that are developing drivers for OpenStack so we really wanted to start trying to use it internally as well, build this out, sorry. So I'm not gonna dive deep into the overall architecture but I did wanna show this slide, sort of where we are, like I said, we're in transition. We went to essentially a region-based architecture and region in this case is not necessarily geographical regions, region for us is a failure domain. We have 15 compute nodes in each region and we wanted to cap it at that, mostly because this is what our ops team was familiar with in our VMware environment so we wanted to have roughly a similar architecture as we started to build out OpenStack and this was not day one architecture, there was a journey to get there. With session this afternoon goes more into how, what that journey looked like but then the other thing is we also moved to have Keystone and Horizon in their own separate region because I'm putting them all in the same, all in the same stack essentially was causing issues with performance. So then we got one controller node per region and like I said, 15 compute nodes. All right, so let's really get to how we use in this thing. Why does it matter? How is it helping our engineering groups? As you can imagine, developing an enterprise storage product has a lot of challenges, it's fairly complex. We've got, in this case I'm talking about our flagship storage operating system on tap we're a portfolio company. So the things I talk about here, the processes are really for the bulk of the people that are working in engineering on tap. So we've got 25 years of code development. Code's not the same than it was but it's 25 years we've obviously added a lot, a lot of things have been refactored over time but we still have that legacy with us. 26 million lines of code, 1,000 Dev and QA engineers working on this across multiple releases and so we'd have traditionally long release cycles. So anywhere from 18 to 24 months, three major releases ago, our 8.2 release which is roughly five years ago, we had 64 branches for development. And so you can imagine there's a lot of code in these different branches that never got merged together, never saw each other with its different dependencies for months at a time. And so when it did get merged, not only would there be a lot of merge conflicts but that would lead to a lot of bugs and again, perpetuating this lengthy release cycle and as we had more features and they had more complexity, it snowballed and got even worse. So obviously something needed to be done and so the very next release, one major release over the other went to a single development branch and we're able to reduce our cycle time down to 15 months and we're still working on that. We're at about a six month release cadence now. And talk about how we got there and you're gonna say, duh. So essentially what this well is down to is we went to a continuous integration test model that we introduced in 2012 and today you would say of course, everyone does CI CD, that's a no brainer. But you have to understand where it come from. Like I said, 64 development branches, things being checked in all over the place. So that was a pretty major transformation for us. And so essentially what happens is we kick off a build every three hours and we run two hours of continuous integration test. Really what drives this whole thing is what we call a virtual test bed and at the heart of that is our on tap simulator. So that's our software defined or software version of it that engineers use for development and test every single day. And then a couple of clients thrown in to help drive load because this is a storage system and we need to be able to scale those out very rapidly. If we tried to do this on hardware, you know, this wouldn't work not only because it would be very hardware intensive for one, but we couldn't scale to as rapidly as we need to. And one of the reasons we need to scale is to be able to satisfy these two hour tests running across 20 different teams. But we also have, it's not just the continuous integration test. You know, it's sort of bread and butter and it's interesting in and of itself and everything, but what really allows 1,000 people to work into this single code line is auto bisect and auto heal. So essentially when a test does fail and we've hardened the test, we had a whole process for how we hardened the test and how we introduced them into the code line. When a test does fail, we essentially cut up the code line and start bisecting the code line all around the check-in that, you know, for those two hours and spin up a whole nother set of virtual test beds to be able to go off and run tests against those. So, you know, it's an iterative process but we need to be able to do this very quickly and that's why I said we could only do this, you know, in a virtualized environment. And then we find the offending check-in, we remove that check-in, do another build, spin up more virtual test beds, run it until we get a pass and then the line is healed and we're good to go again, right? So I would love to stand here today and say this whole thing is powered by OpenStack, it's fantastic, it's really saved, you know, our entire process. That's not the case today. You know, maybe in a year or two we'll get there, this area is ripe for rearchitecture. Like I said, we introduced this, you know, this new process back in 2012. We didn't even have an OpenStack environment until late 2014. So as our global engineering cloud has grown, as OpenStack has taken over more and more of it, this is an area that's ripe for rearchitecture. All right, so now let's talk about the test side. And I say test, not QA, because this includes both developers and QA. We've driven a lot of test process back into our development organization, both with unit test, the continuous integration test that I was talking about and our developers are also starting to write and run a lot more automation as well. So this is an area that is much more entrenched served up by our global engineering cloud and OpenStack is a big part of that. So up on the upper left is what we call our shared compute services portal. This is the portal that engineers can go to and just get compute on demand, right? So we're building a storage system, we're not building applications. So what they typically need is just, either load generating clients or clients that have their tools in them, either for development or test, or a test environment baked in so they can just go ahead and run their automated test. And the reason I say flexible virtual compute here is because not only can they go there and get templates that are already predefined, which we've got hundreds of them, they can layer their software, layer their tools on top of that and then push that back into the global engineering cloud for reuse by other teams. If we had, like I said, there's roughly eight people working on our OpenStack environment, if we had, as that small organization, had to take those and engineer that and build that into our OpenStack cloud, we'd be way behind what our engineers actually need. The other, another area where it gets a lot of use is what we call our hybrid cloud lab. So this is a part of our data center that we have extended out to public cloud, and we've got roughly 300 people that work in that lab today, and that's growing, we have a lot more products that we're starting to build that are now working in the cloud. And that is a smaller part of our OpenStack environment, like I said, growing. All of the on-demand compute from there comes from OpenStack. And by far, our largest consumer of our global engineering cloud and as part of that OpenStack is what we call our common test lab. So this is where we've consolidated a lot of our hardware, and engineers go there and get both hardware and virtual, or some combination there of testbeds. This is where the majority of our automated tests get run. And that's another key part of, how we're able to go from that lengthy release cycle, 18 to 24 months down to six months, not only having those continuous integration tests, but having a higher level automated tests that our QA group runs to the tune of about 20,000 tests a month. And that uses quite a few VMs. We've got roughly a 6,000 VM turnover per day in that common test lab. And today, the way that we accommodate that is we do a bunch of pre-pulling of those VMs. They're mostly based on VMware. We revert to Snapshot when they're done. But with advances that we're making in our V2 architecture with OpenStack, we're able to now boot those VMs on demand and offer them up faster than we can actually revert back from Snapshot. So in about a month or so, we're gonna take that, we pre-pull like about 15,000 of these to accommodate for spikes. We're gonna take that 15,000, which represents a bunch of wasted capacity and be able to convert that over to OpenStack and offer it up on demand and still meet the 6,000 VM turnover per day. All right, so let's shift gears a little bit and talk about enterprise-grade software-defined storage. So this is something that we just introduced two weeks ago into our OpenStack environment. It's essentially software-defined storage on demand. So our software-defined storage as a service. So this is powered by our Untap Select product and obviously OpenStack. This is not a marketing pitch. If you wanna know more about our software-defined storage, I've got links in the back here or come talk to me and be happy to. What I really wanna do is just tell you a little bit about it so we can talk about why this is valuable, why we put it in our OpenStack environment on demand for engineers. So essentially this is our Untap operating system running in a VM on commodity hardware, on your own hardware. So no engineered hardware in this case. It's available on VMware ESX today. We've got a version that's coming out on KVM later on this summer with our 9.2 release and you can do either single or phone node clusters. We're working on the ability to do other cluster sizes as well, that's on our roadmap. Today it's direct attached storage with disk in the host. We're also working on our roadmap is array storage and then you can get this different configurations depending on what your requirements are. This is a common reference platform. Doesn't really matter what the vendor is. The solution is vendor agnostic. The reason I show this is because this represents a test bed that we have in our common test lab. Four node cluster. The key here is really that we run it in the same way that our customers would in this configuration. So it is one Untap select instance per host. And that's while we have people that need to run it that way because they're either doing development or they're testing the solution. That actually represents a fairly expensive test bed for us to get and maintain. So what have we done to offset that, to help that out? So essentially what we've done is maybe it's available on demand as part of our open stack environment and at the base of it here at the foundation is that same flex pod configuration that I talked about at the beginning. With the addition of, we added solid fire into this to be able to offer up different performance tiers to our engineers. We've got different classes of people that are using this. Again, same things that we, configurations that we offer out to our customers. You can get a single or a four node. It's completely on demand provisioning dynamic in roughly 20 minutes. And I'll show you in the demo. The demo is not 20 minutes. You know exactly how an engineer would go request one of these things. Like I said, in about 20 minutes you can have a four node cluster. It's roughly seven to 10 minutes to get a single node cluster. So that's pretty fast. That reference architecture that I showed on the previous slide, the turnaround for that is usually hours if not days depending on what level it needs to be built up again from the RAID controller up or what state the previous tenant on that left it. So, hours to days versus 20 minutes. And the rest of it is, it's our same open stack architecture that we have in the rest of our cloud. It's all KVM, RHEL 7.3, RDO. And on the performance tiers, so we've got, if you're a developer and you're not, or just doing functional tests you're not gonna drive any load to it. Then your instance is gonna be backed by our FAS solution. However, if you, let's say you're part of the system test team or you're trying to do something that requires a little bit more performance you're actually gonna drive load to it. Then your instances are gonna end up on solid fire with different QoS tiers. That's completely abstracted away from the user. They just say, I want these different levels of performance. It's a relatively small environment that we put together now. The entire engineering team is not working on our software defined product right now. So, we've got about 200, 220 instances available in this environment. And a split, more on the functional side. And so, what are the key benefits of this? Really is all about efficiency, VM density and lower cost. Being able to provide that lower cost test bed on demand. Rapid provisioning we talked about and then being able to provide different performance tiers to engineers based on their different needs. So, why does this matter? Why do we even care? Couple of reasons. First off, this represents the first software defined storage offering that we've put in our cloud. Making it available. Most of the other VMs are relatively simple. They provide a tremendous amount of value today. But this does really represent a little bit higher level, more complex service that we put together. And we also have, we're a portfolio company, so we've got a number of different storage offerings of which a number of those also have software defined. So, this really provides a blueprint and a roadmap for how we could be able to offer those up as part of our engineering cloud on OpenStack as well. The other thing is it does provide a reference architecture for customers. Most customers would want to use our software defined storage solution as underpinnings, as maybe as part of your under cloud as foundational services there. However, we have had some conversations with customers where they've got, there was one large service provider that had a fairly simple use case. They had a number of NFE templates they wanted to replicate throughout as different clouds that they had in different geographies. It was all based on OpenStack and they were interested in using on tap select as a single node just to be able, for it's snap mirror capability to be able to replicate. So, it has opened up a number of conversations regardless of how customers want to use it or not as the case may be. But it does provide some interesting conversations. And then the other thing is, as a side project of this, we created some heat templates for this. This is something that we wanna make available on our NetApp GitHub page. We don't use them internally. We use them in the OpenStack APIs. We put together a bunch of Python classes and scripts. But, that's not something that we'd really wanna or really could share easily with other people. So, the heat templates really are. All right, so I'm not gonna go through this bullet by bullet, there's a whole other session that could be on how we built this, how we architected it. The challenges we had along the way. I just wanted to hit on a couple of things here. Part of this is, when we started this project a little over a year ago, we didn't know what we didn't know. We understood what the requirements were, but we didn't know how to necessarily map those different components to OpenStack constructs. Avoiding template sprawl. So, this is, we've already got hundreds of templates anyhow, but think of this as, bring your own image is essentially what it is. So, developers are working on this, they're building code in their sandbox, and then they point at this image and say, build me one of those software defined clusters based on this image. So, that's something that can happen multiple times a day, spread across hundreds of engineers and all of a sudden, within a week or so, you've got 2,000 templates. So, we just decided, okay, forget it, we're not gonna do templates at all. We create a bootable cinder volume. We DD the image onto that cinder volume, connect the rest of the cinder volumes to that, to the Nova instance, boot up and away you go. And because of that, we're avoiding having to do multiple copies of the image. None of these images are the same, so you're not really gonna be able to take advantage of cloning and image caching anyhow. And then the other thing, what I hit on is just sort of the last, second to last bullet here, is this an infrastructure or is this a software defined storage problem? Like I said, this is code that's under development by the ONTAP engineers. So, when we do hit a problem, how do we know, how do we dig in and figure out is this a problem we're having in our OpenStack infrastructure today or is this a problem because there was some bug that was introduced in ONTAP. So, we've got, this is an evolving process. Like I said, we just rolled this out two weeks ago. We have an ops checklist that they go through and then we're working with our software defined, our software defined storage development team trying to do this jointly and work this out until we have a really solid process for how we figure that out. All right, way too much talking. Let's do a demo. So, what I'm gonna show here is essentially going to our portal and requesting a four node cluster. Like I said, this roughly takes 20 minutes. I've sped up parts of it considerably, but I just wanted to show you the process and really, excuse me, the simplicity that our engineers get. So, what you've been staring at here for, hopefully this is going for a few seconds, is our, what we call our shared compute services portal. This is where, this is the web UI version, but you can, everything that's done on the web, you can also do via the API as well. Is this actually running? No. All right. Sorry about that. Let's jump out of here and see if we can, looks like this might be the longer version. I have to skip through portions of it. Essentially, so there's a separate link for what we get, we call that's dot, right? That's our internal name software defined. I don't have any, so let's go ahead and create one. You're presented with this relatively simple form and I'm not showing a damn thing. Yes. Sorry. So, it's relatively simple form. You basically go in and you say, this is what I want. This is the testing purpose I want. One or four nodes, go ahead and select four nodes. We only have this available in our RTP facility today. And then I go say, okay, I want, how many data interfaces do I want? Let's go ahead and pick eight. I can personalize this a bit with the DNS name that's gonna show up. And then this is really the crux of it right here. Which kind of, which performance tier do you want? And so, again, standard goes on to our FAS, mid and high go on to our different QS tiers and solid fire. Let's go ahead and pick high. And that, then you just, the other thing is you have to say, how many data just do you want? Let's go ahead and pick 10 and 10 gigs. And then this is really, I was talking about bring your own image. This is where developer or QA specifies this is the full path to the image. This is a KVM raw image. And then the other thing is we do cluster health check afterwards so you can decide whether or not you wanna keep the cluster if it fails. Then you just press go and it's off running. Let me stop this for real briefly, just wanna. So we press go there. Essentially what that did was it passed it off to our workflow engine in the background. And this is the thing that sort of spans the different hypervisors and can select different regions and different clusters. And essentially what it did is it says, okay, this is only available in our OpenStack cluster. Let me hand this off to the provisioning script that's gonna do that. And so what that provisioning script does, like I said, we use the OpenStack APIs. It goes off, it creates all of the Cinder volumes that it needs based on the system volumes that are needed for this as well as the number of data disks that you asked. Each of those data disks represents a different Cinder volume. It goes off and dynamically creates the ports that you need based on the number of data interfaces that you requested, builds, creates the Nova instances, ties it all together. And then I can just go look at the logs, the build logs and see what's happening there. This is the part that takes 20 minutes. Let's speed this up. Yep, I went a little too far. Essentially what happened was, as it went through and did the build it all, so we connect serial ports to it, serial ports are very important to be able to do debugging as well as to be able to jump on the console. And then what I did, what we do is we present the telnet line basically to the end user saying, this is how you actually get to the console. And what I'm showing here is just actually jumping on that console and ONTAP is doing its boot up process. And this happens to be the first node because that's the one that does the clustering. That part itself takes about 10 minutes. We'll skip through that. And then when that's done, you can go get on the cluster management port, which is essentially a virtual interface that spans all of the nodes in the cluster and can fail over. So what I'm showing here is just going through a couple of different commands showing, yes, the for node cluster is up and running. There's the different disks and stuff like that. And I'm just showing here, yes, it is indeed open stack. Here's what they look like in the horizon dashboard. We don't give end users the ability to go in and log into horizon itself. Like I said, we have that abstracted shared compute services portal. And then because it's the same ONTAP, you can use all the same management tools. So I'm just showing here going into our on box system manager. It was much better with everything sped up. All right, so we're in there's our for node cluster. From there, you can go in, you can create the higher level constructs with aggregates and storage virtual machines and volumes and all that good stuff. It is full-fledged ONTAP. So all right, so where do we go from here? We've already given our engineers a lot of value with our global engineering cloud, with on demand compute as well as that new software defined storage as a service that we introduced two weeks ago. But we'd still like to offer some higher level services. This is not a baked set in stone list, but there's a couple of things that we would like to add bare metal as a service. It's not a huge demand from our consumer base, but it is something we'd like to take a look at and see what we can provide with ironic, maybe database as a service. There's a couple of other things. And then Manila, we've got developers that write Manila drivers for our storage products and that's something we'd like to be able to offer out to our engineers as well, file share as a service. Containers, like I said, so many sessions on containers this week. We have a number of groups that are now starting to work with containers. And so we really want to get our head around how we can provide centralized services for containers. So containers as a service, as well as are there interesting architectural things that we want to do with maybe putting our control plane part of OpenStack into containers and containerizing that. Every two OpenStack architecture, I think the session this afternoon goes into that a little bit more, but essentially we're gonna move from that region-based architecture that I showed to availability zones and rather than limiting each of those to 15 compute nodes, that'll scale to hundreds of nodes. And the other thing is, like I said, we're on track to be at 80% OpenStack KVM in our global engineering cloud by next January. That's gonna reduce our enterprise license agreement with VMware, that's cool. Really though, more significantly, this represents our significant commitment to OpenStack. So our journey with OpenStack has been going for roughly two and a half years now. It was a crawl, walk, run. We're at the, certainly at the walking stage now, moving towards a run. But essentially for internally, we're our private cloud, we're almost all in on OpenStack. And then lastly, just a few resources here. I got a couple of links on, if you want to know more about our software defined storage solution. There is a pointer to the session last year at the Austin Summit, that was a discussion of what we've done, how we built this out. But I would really recommend going to the session this afternoon at 1.30, where Seth Forgosh is gonna talk about what we've done, how we built this out, more on the architectural side, and how we've automated the whole thing with Puppet. And then lastly, go to netapp.io, excuse me, go to netapp.io for all things containers in OpenStack at NetApp. And that's it.