 Hello and welcome. My name is Jeff Kite and I work for Hewlett Packard Enterprise in our professional services organization that focuses on cloud types of solutions and paths. And wanted to share with you an interesting case study that I recently have been working on where we were basically asked if we could take something complex like installing OpenStack and we could fully automate it so it could be installed in the factory. And as part of that, there were certain constraints we had to be able to do with certain sets of tools that we would have available to us. And so this basically focuses on kind of that journey and some of the things that we did. So if you step back and look at what it takes, if you have disparate pieces of hardware and you have all that physical rack and stack, you have all the connecting of it together, and then you have to make sure that that infrastructure is at the correct levels for firmware and other things like that. So wanted to make sure that we could take that section of the physical provisioning as well as configuring it to get ready for OpenStack and then put it in an automated configuration so we didn't have to have an expensive person sitting there all the time, babysitting this type of installation. So that's kind of the background. All right, so this slide just talks about some of those deployment challenges I mentioned. We wanted to be able to take the hardware that was physically connected together without any other configuration that had been done and then actually use automation to deploy all of the infrastructure configurations as well as OpenStack. So the different phases I divided this up into, of course, we had the physical integration that was done by another part of the factory. And then in the infrastructure configuration, we wanted to configure the top-of-rack switches, both the data switches and the management switches, as well as what we call server integrated lights out or your management processor. So anything that's IPMI like, we wanted to be able to be consistent with the server BIOS and also do all the configuration of the storage arrays that's set in the servers themselves. Afterwards, we wanted to be able to make sure that we could have a certain level of testing to make sure the network was correct, that everything was consistent between all the different nodes. And so we do that infrastructure testing and validation and then wanted to go ahead and deploy OpenStack itself. In this case, we were using Healing OpenStack that was already usable-based in its installation. After that, we wanted to be able to take the installation and do certain sets of testing and validation to make sure that OpenStack was functioning. So we looked at some of the existing tools that the factory had access to, and these are, of course, the same ones that customers in our field do as well, and realized that I didn't want to go too far outside of their realm of experience in case they had to do any troubleshooting. So I wound up selecting with certain tools that pre-existed that could be used to take XML types of configuration files and then load them on to the servers before other types of interfaces were configured. And so that included these tools for completeness. One of the notable things I found out, whenever our high-performance compute teams put together very large clusters, they actually had a test and validation suite that the factory was very familiar with. So I wound up using that as the test and validation. All the other pieces primarily were open source types of things using Cobbler instead of something like OneView using Vagrant to spin up some of the management VMs that would actually do the work so we could get some scale out of this as well. And then took advantage of things that were built into Ansible like Genja 2 templating. So the platform wound up being because I wanted to be able to scale this, I instantiated on a VM host and then connected that to the top REC switches. So kind of the high level is that I knew that I wanted to be able to easily take customer input in. And I wanted to get that into Ansible configuration files. So we already were using Excel spreadsheets for certain part of the customer integration experience. So I just basically came up with a spreadsheet where I could dump those values out and then put it into the same basic configuration that Ansible required for its host file. And that's, this is an example of that. So as part of our customer intent documentation, I may have wanted to make this totally configurable from a, and variable in terms of things of which networks I'm going to be implementing. And the number of servers, number of disks, data desks per servers, this was, this particular one was going to be focusing on what we call Swift Object Store. And so I wanted to be able to also use advanced configurations for that. So once again, the Excel spreadsheet generates the Ansible configuration file. Then I realized that originally I was just using bare metal servers. And because it needed to be able to scale, I decided to build images for vagrants to be able to, to spin up as I needed each set of infrastructure deploy. So I could have multiple of these going simultaneously. So I took the cluster test environment, which is basically a red hat based environment with the, with the test and validation suite on top of it. And then the cobbler was already being used in Healing OpenStack. So I basically took that and made a packer image of that as well. So we could, we could use that with vagrant. And then I custom made, customized the vagrant files with Ginger 2. So that I could have custom networking schemes or if we had variations in sites where we were having this running that we could be able to do stuff like that. So we basically vagrant up the cluster test VM. We do some additional network configurations. And then we, we bring up the cobbler as well. And so this is just an example of some of the, the Ginger 2 typologic you can put into any files that you're templating. So if you, if you have a whole lot of hardware, nothing, nothing's configured yet. The first thing I had to do was configure my top of rack switches. So I knew that I wanted to be able to take a pair of 10 gig switches for example and, and make, be consistent with how I connected the servers into them so I could be able to use templating for, for these files. So if you look down at the, the bottom left hand side, I, I made this a requirement that we were going to wire these the same way every single time we, we pushed them through the factory and we could have a variable number of objects servers as well as PAC or the, the Swift proxy account container servers. And then what I did was I generate a, I had the possibility of generating a, if you take the two switches and link them together in high availability mode, we call that IRF. And so I have a version for that as well as if we, if we kept them distinct. But primarily we're using these in IRF mode. So we push that out. Now I could have automated actually loading this configuration up into the switches, but the people that we're going to be running this, it makes them really nervous. So they wanted to be able to manually copy this up to the switch and leave it as a file then copy it to the start up, start up switch configuration file, which makes sense. So this was a semi automatic step, so to speak. But once again, dumped out the, the configuration. And then I used the, the Ginger 2 to customize it for things like IRF, which is on the top here that you can see as well as the definitions of the VLANs. And also put some logic in there as well. For example, if your, if your management processor network was the same as your, your OpenStack Management Network, you could basically have some flexibility in making sure that the network switch configurations would come up correctly. So with this step, we move the file up, we reboot the switches, and now they're, they're configured as expected. That now allows me to do some discovery because right now nothing else understands what their IP addresses are. And I, what I really want to do next is to boot into the cluster test environment so that I can use some of the tools that have commands that will change the management processors on each of the individual switches or apply BIOS or configure the storage arrays. So this is an example of once I boot into the cluster test environment, I then use Ansible to, with that same exact host file to go out to each node in the cluster and then take different XMLs that I created to do things like change the network settings, the host name, add an administration user and password, or if it already exists to change that administration password, but basically came up with some playbooks and logic that would do each of these steps moving forward. So now once ILO is configured, I'm able to basically, so, okay, so there's, if you think about this, there's two sets of IP address schemes that will happen over the course of this automation. One is in the cluster test environment which is consistent every single time and that's why I can actually do the configurations consistently at this stage in the game and then there will also be one when we're actually installing OpenSack itself. In a similar way, after I install ILO, I can go and basically take a BIOS captured XML file and say I want to be consistent based on the model of machine it is. So I know which model each of the servers are and in which roles they play because I guess one thing I forgot to mention is that when I first turned the servers on before I configured the ILO in the last step, I have a script that runs against the network switches that looks at all the ports to see which MAC addresses are coming up on them. And so I'm only looking at port one network switch and how they're attached. So I grab those MAC addresses so I can boot into the cluster test environment. And later on I'm going to be able to just discover these things through the RESTful API now that the management processor has an IP address. But with these particular ones we decided that we're still going to use the ConRep tool that HP had that takes and basically you can configure a certain server exactly the way you want it and use ConRep to dump the XML out and then I basically upload that file, that XML file based on the server type and the role. And so this is just an example of that. The server storage rate controllers were a little more interesting because I really wanted this to be variable. So instead of, and I knew that this would change from even more so from customer to customer. So one thing I did was I go out and recursively discover through the RESTful API that Hewlett-Packard has on the management processor and I go and I basically get the array controller information and that has some other URLs, if you will, that point to other parts of the configuration itself. So if you go, you start at the top, you find out how many controllers does it have in it. And then you have a link to each of those controllers to get more information. Then you go down and grab the JSON from those controllers to find out which disks they have underneath them. And so basically once again we're doing, so we're doing a little bit more logic here. We get the information in the inventory from the management processor itself and I wind up building two types of lists. I want to be able to configure boot disks and I want to be able to configure the data disks for the object store. So I also have, later on I realized that I wanted to be able to do some documentation, some as-built types of documentation. And so what I wind up doing is build a new matrix of information, or array of information really, where I have certain elements that I need and I want to discover and I want to pull out and put in the as-built document. So we map each of the disks in this particular implementation based on the model type of the disk. Another option was to do it based on the location in the server or size of disks. We had some different, we played around with some different variations, but that's how we separate them out into boot and data disks. We then take the smart storage array controllers and we clear out the configuration and then we know which are the boot disks. So then we take and build a RAID 1 set for that. And so some of the servers that I'm using may have up to three storage array controllers in it, but we already captured that when we got the information from the RESTful API a little bit earlier in the process. So we then create RAID 0 volumes for each of the other data disks because Swift is going to be able to manage that. And we still want to use the array controllers because if you want to use full disk encryption or other things like that, then you can activate that since Swift, I guess, object encryption is just appearing now. So we recursively go through the data disk set and we build all of those disks. So once this is done, now all of the hardware is at kind of that state I want to have before I install OpenStack. And so we move into kind of the next phase. So I want to test this to make sure it's good. The cluster test I got from our high performance compute folks has this neat thing called cluster consistency in it. And so this is before it was consistent. I wanted you to see how there were some red things on the list, but you basically refresh it and it'll check the BIOS, it'll check the storage, it'll check the memory, it'll check all these different elements to make sure that the cluster is consistent and equivalent across the different types of nodes in the environment. Another element I can do in here as well as run network tests, performance tests, or even just to double check that the VLANs are set up right on the top of racks, which is inconsistent among all the nodes in the cloud. So the next thing we do, so we've been running all these commands from the cluster test VM with ansible playbooks. And so I'm kind of moving to a new phase where I want to image the deployer node for OpenStack. So I have to shut down the DHCP server that's already been running in cluster test. And then I use the cobbler that was built in to Helian OpenStack to basically create that very first node. So there's a cobbler that sits outside of the environment, which is just another VM. And I use that to image the first node in the cluster. I use the RESTful API to grab things like the MAC addresses so I can build the configuration files for cobbler. And then I basically install that first node, that deployer node via ansible. So I run a script that says image the HLM node. So once that is done, I need to go in and configure it and set it up so that that node can actually image everything else in the environment. So there's a little, I guess, housekeeping things we need to do, like we need to make sure that we run an SSH key scan so that we aren't prompted for certain passwords. We redirect, instead of using the cluster test IP addresses, now we want to use what the OpenStack deployment is going to use. So that's still, once again, that same exact host configuration file. I want to update the SSH keys. I have certain scripts in the environment that allow me to have some added levels of security. And then I also have some scripts that I might want to run later to do things like test the performance in the environment, things that are included in basic OpenStack. So I set all those things up. I run through the initialization process for the Lifecycle Manager. And then next I want to alter the model of the Swift Object Store. So this is just basically like updating configuration files. But it's a little bit different on the Healing OpenStack in that it's a model where you basically compile it and then you're using version revision checking with Git so you can have a revision control system. So we go through that process, install the model, gets the MAC addresses from the remaining nodes. And then we basically go out and we install, oh, I also generate customized security certificates as well. But then go in and install using the Genji 2 templates to go in and then commit the model. At that point, we're ready to deploy OpenStack. And so run another playbook still on the cluster test node, which goes out and runs the playbook on the Lifecycle Manager. Make sure we wipe the disks before we do the installation. We deploy the cloud. And then we patch it if needed, if any patches have come out since then. And we configure the CPUs and basically any customizations that we need to do for performance. After this, we go in and I run the OpenStack Tempest tests that are targeted for the type of solution that's there. There's some additional functional tests for certain services like I think the Swift developers like to use another set of functional tests instead of the Tempest ones. There's some performance tests that I run to make sure that it's operating in a range of expected behavior. And then I want to take those test reports every single time I run them and keep a history of them and then generate an as-built type of document. I wind up using a combination once again of Genja 2 with markdown templates and then I take those markdown templates and turn them into HTML so that you can either display them in the installation environment or convert them over to PDF to be delivered or printed out delivered whatever it needs to be done. So, that was a lot. That's kind of the, so with this particular automation, I literally can go from the point where the network is, network switches are configured and rebooted and then it can be one button from that point on. Or I can split it up into different phases. So, right now I have it in different phases because we have I guess the experience level of the people that are going to be running this day to day are more entry level and not an understanding open stack and it allows them to be able to have some confidence in, but with the ends will play books being item potent anyway. There for, you know, for example, we just have them, if something failed that we have them run that command again and it usually, so we go through the troubleshooting process. Built in the scripts, we wound up doing some, a little bit more checking I think than I normally would do to make sure that people weren't being stupid and like, you know, the input file or stuff like that, but that's basically kind of our journey. Are there any questions? I assume we're supposed to use this. So, we do a very similar thing. We made a lot of similar choices as you did for deploying in our factories and also for our customers. One thing we had to do, which I'm curious if you had to do, was build a fairly large array of mirrors and stuff because often we found where we were doing the installs in the factory and whatnot, had limited to no internet access. Did you have issues with that or was that help dealt with it like the helium install stack level of it? I basically made it, well, so we have, we have four integration centers around the world where we have different levels of connectivity, but all of those are fairly consistent. I also wanted to have a version of this where a consultant could just take their laptop and spin up the cluster test VMs and the, and the cobbler VM and be able to, if they had to really reinstall it in the field quickly and be able to take the same input file. That hasn't been done yet, so we might be hitting some of those things as well, but as far as like, especially like organizational memory, being able to, you know, keep a record of what was done. Well, I guess in the factory something that will always happen is we'll keep a record of the as-built document that has all the serial numbers, how the configuration and all the disks and all the individual pieces as well. Yeah. And then when you have pixie booting, were you installing from an image itself or were you doing like the install from ANSI's file kind of install? Okay, there's two, there were two levels of pixie booting. The first one was pixie booting into the cluster test environment which is a diskless environment that has a shadow route across all the different nodes. That one's used a lot in the factory because they don't want to put anything on the disks unless they actually have to. So that environment was discovered through the network switches but they would have to be pixie booting for me to be able to discover them, right? So we did find out like, I guess one time the factory reversed the cables on all the servers and we had nothing, right? So we had to do a little bit more documentation for them on that. The second time around I'm actually used because now I have access to the management processors, I'm actually going and grabbing, I'm having Ansible discover those as part of the playbook, the task. And so then I grabbed that from a pixie boot and then put that in the equivalent of our server's YAML file so that when we do the, you know, cobbler deploy, it pulls all that information in and then when we do the bare metal re-image, it does that as well. Thanks. Thank you. Other questions? For, I guess this is more for like, switching out servers and whatnot for the hardware side. Is there ability to validate firmware levels and adjust that as needed so it's consistent and across your network? Right. So in the factory there is because when I booted in that cluster test environment, firmware revs are one of the things that it goes and it discovers. So right there. So if this winds up being all green, I know I'm good. Firmware is one of those, but it even looks at the firmware, the disks themselves and all the different components. And so it knows, it just highlights anything that is different. And then so we would address that then. Now in the field, what I probably would do is I could discover all of that. There's a firmware part of the RESTful API, which I could also use to go and query to make sure that everything was consistent. That's probably a really good idea. Now I'm thinking about it because we have to allow for what's going on in the factory and then also what's going on in the field as well. And so as you mentioned, if you take a server or have to replace a server, then many of the components will be different. Do you have any issues with v6, IPv6, or is it something that you all have dealt into yet? I believe the revision of OpenStack that I was using here was all IPv4. I think there are certain networks that were IPv6 capable, but I haven't had a deal with that, although I do have a customer asking for that right now. And that v that you have there, is that that's not HP1View, is it? No, no, no. I'm not even sure if it's a formal product. I'd have to check on that. But it's out of our high-performance compute team. It's a package they call ClusterTest that I was using because the factory uses it. However, I have used 1View to do these types of things as well. It's just that the factory didn't want to import all of these servers into 1View instance and do all that type of update, so to speak. So that's why I had to use a different set of tools to do that. But this ClusterTest environment, because it's booting up diskless, and it has all of the configuration tools built into it, enabled me to be able to do a lot of those types of things. Now, my next revision, as we're looking to the next generation, I probably would do a lot more with configurations and the rest of the API. So you can post as well as get information, and so I might evolve it to that as well. Thank you. Thank you. All right. Any other questions? I hope this was useful to you guys. Thanks for coming.