 Get going here. Welcome. We're going to be talking about maximizing your hardware basically by simulating a lot of your server workloads. And so I've got a little catchy tagline. Are we building fake cloud? Are we hyper-converging? What are we doing? And what does that really mean? This is a presentation that was basically started by myself, Ale Radoui and Shannon. Unfortunately, my counterparts are unable to come along on stage, but shout out to those guys. Anyways, who am I? I describe myself as someone who loves being challenged to create the impossible and productize it. These are some high-level things that cover who I am. But my name is Kevin Carter. I work for the Rackspace Private Cloud. And I've been worth the organization for going on roughly six years now. For the last year and a half, I was working with the group formerly known as the OSIC. And my counterparts were Ale, who is an Intel engineer. A brilliant guy worked on the server simulator part of that, bringing out, making it so that we could deploy medium, large-scale clouds without a ton of hardware commitment. And Shannon, he doesn't like public speaking, but he's a brilliant engineer that I've had the pleasure of working with. And so I don't know if I don't see him in the crowd. But anyways. So a quick show of hands, and not too many people. I imagine most people are still sleeping. But how many people have been testing OpenStack with DevStack for a little while now? Sure, everybody, maybe? Yeah. And then using an AIO, an all-in-one, not using DevStack with some sort of a deployment tool, COLA, OpenStack Ansible, OpenStack Helm, et cetera, et cetera. And how many people have done an entire deployment in a cabinet of gear? One cabinet, 22 nodes, a couple switches, a load balancer, et cetera. And 10 cabinets of gear? 100. That works. Yeah, well, I mean, anybody who's done a multi-node deployment of OpenStack, there's a bit of pain that goes with that, or you're dealing with interesting issues that aren't necessarily tested in the gate. So it's just calling that out. So I call that the level of pain of where people are in the world of OpenStack. You have no pain. It worked in DevStack. Hit a button 70 minutes later, it shows up and like, oh my god, I have clouded it, and it works. Don't reboot it, but it works. And then I have many cabinets of gear. And that's not necessarily a lot of pain. It's just known pain. You know where things are going wrong. You know where things are falling off, and what to look for, and how to troubleshoot them. So when we were building the tooling for the simulator, we were really focused on these groups. It worked on a bunch of different hosts. I had, let's say, less than 10 cabinets of gear, maybe a little bit more. And so we called that actual pain, because you're finding out where the wheels are falling off. You're running this for a longer period of time. You'll just kind of know it and eventually become known pain. But as you're getting there, it's in those two groups. So with that, the mission that we were given, which we didn't have a choice to accept, was deploying largest environments with minimal hardware commitments. We're building out, eventually we built about a 500 node cloud with 50 physical servers. And so we needed to be able to test that kind of workflow, that kind of workflow, without actually buying 500 physical servers to go beat on. So like I said, what we were doing, we were testing networking. I made mention of running multiple cabs of gear. You've got networks, everything coming out of interfaces, tagged interfaces, VLANs, et cetera. If you're doing everything on an all-in-one, it's all hairpin in the kernel. And you're not actually testing some of the things that you're going to run into over a longer period of time. You can do some of it with bridging and v-fairs, but it's not really a true test of how your cloud is going to run in prod. Orchestration, as I mentioned, I work at Rackspace, working on the OpenSec Ansible project. And so we use Ansible for all of our orchestration needs. So we're orchestrating at greater than one node. If you're running a bunch of bash grips, or you're running Puppet, or whatever that case may be, if you have one physical box, it's just going to go through linearly and run everything, and everything will be fine. But if you have many nodes, you have to orchestrate your services and your callbacks, your handlers, et cetera. And so we needed to be able to test our tooling at greater than one node. Parallel operations, right? Once the cloud is up and running, you're going to run API calls, NovaList. And it comes back. It gives you some outputs. But that doesn't yield a lot of data about how well your cloud environment is running. So we wanted to be able to upload millions of objects into Swift, run a whole bunch of Nova commands, build a bunch of VMs, create flavors, delete flavors, upload images, and do that all at the exact same time, and then figure out what is not responding well, what can be improved, what can be tuned, et cetera, et cetera. And the environment, it kind of is a no-brainer. But if you're running your cloud for a long period of time, even if you're not using it, it's still doing things. It's sending messages back and forth. It's creating logs. It's filling disks. You have glare replications going on in the background. And so just running an environment for a long period of time is actually going to yield some sort of usable data for how your deployment of OpenSack is going to run for the foreseeable future. And so why were we simulating our servers? Well, we were trying to do, like the mission said, is develop scale testing. We want to be able to test how OpenSack is going to run at 500 nodes without cells in a single region, without host aggregates, et cetera. We actually want to empower our operations guys. The ops guys are running these things. I'm just a developer kicking things around. And I get the pleasure of messing with production environments every once in a while when they need. But our ops guys are the guys who are doing all this work. And so if they have a single piece of hardware that they can run tests on or mimic a customer environment or test out a new piece of hardware, say they want to partner with a vendor. And the vendor is saying that they've solved world hunger. Is that true? Can I validate that without actually spending $10 million in CAPEX to validate it? So network stress testing. I've kind of beat on this a few times. But networking, networking, networking. If you can't get networking to behave in your environment, you're going to have a really bad time. And so that was something that we really focused on with our tooling. And developers, like myself, my IRC handle is cloud-null because I enjoy breaking things. So if I want to be able to test and validate what I'm doing before I actually sell this to my organization and say, yes, you should use this thing. If I want to be able to validate my testing, and so using a simulated environment, even if it's just a single piece of hardware, gives you the ability to say that, yes, I've run the code. I've executed it on multiple hosts. It does take on all of my compute hosts or whatever the case may be. And it isn't causing major downtime over a two-week period. So it gives you some really good, amazing test feedback without, again, having to dive too deeply. What this is not, right? So just to call out real quickly, the simulating server effort is not a new production deployment model, right? So this is something for tests, for development, for operational tooling, it's for capability testing, it's for vetting hardware, vetting vendors, et cetera. It is not a new deployment capability or how you should run clouds in production. This is also not a new deployment project. We're not competing with Triple-O. We don't want to compete with any of those other projects out there. We're simply taking off-the-shelf things and building an environment that looks, smells, feels like production. So the core technologies of the server simulator is Ubuntu 14.04 and 16.04. In OpenSec Ansible, we now have CentOS 7 and SUSES coming, which has been targeted on the Leap 42. I think we have a few tumbleweed repos in there right now to make things work, but anyways, all of this testing was on 14.04 and 16.04. We use Cobbler. It's off the shelf. It's been around forever in a day. Libvert, KVM, and Ansible. So it's pretty stock standard. Most of this is powered by some dirty bash scripts, but the way it all comes together is these other core technologies. And this wouldn't be a presentation of mine if I didn't show you an ASCII diagram of things, which is the host showing ethernet devices. Everything is on a bridge. I've got HAProxy, Libvert, Cobbler, and about 14 VMs running. So we're building out an HA cluster with our infra nodes. That's hosting all of our APIs, MariaDB with Galera, RabbitMQ in a multi-master setup. All of the schedulers are there. We have a deploy node where everything is orchestrated out of the single deploy node, which happens to be a VM. So two compute nodes that gives us the base capability of doing live migration between our two nodes. Does this live migration technique work? Swift, three nodes there, which actually has got about five drives each. And then we have a replication schema for three replicas per object. And so we're actually testing a production-like Swift environment with a replication network, et cetera. Cinder, we have two nodes there. Both of them will come up by default using Cinder with logical volumes. But we also tested Ceph and a few other backends, too. And then a log node. Throughout all of this stuff, all of these things are generating logs. And so we're actually sending all of our logs over using our syslog. And they're aggregated there. And then we have some external tools. You can use Elasticsearch, Logstash, Kibana to visualize those logs. But they're all being collected on an actual logging node, which is creating load. It's creating network latency in the environment. And it gives you a more prod-like experience. So where do we go from here, right? We've got a single multi-node AIO. We've got roughly 12 physical servers. If I were to kick one of these for you and give you SSH access to it, if you were just to poke around, it would look, smell, feel like real servers. There's nothing in there that would give you the impression that this is not a production rack of gear that I'm running all this on. So within the OSIC group, we decided that we were going to multi-node our multi-node AIO. And that is we took 50 servers that were all kicked with Ironic and then did the exact same thing. So we had 50 servers all kicked with Ironic. The Ironic deployment that we used for that was out of the OSIC cloud and was all on a single flat network. But like I was saying, we're trying to test real production-like environments. And so we created a VXLAN mesh across all of those nodes that gave us 10 tagged VXLAN interfaces. And then we plugged all of that into our VMs and bridges, et cetera. So every host effectively becomes a switch. We're not running anything on the host except for virtual machines. And we're deploying everything into virtual machines to build our large-scale clouds. So this is the simulator work. This is Ale, primarily worked on the simulator. But in the end, what we did was we were testing system management of greater than or equal to 500 compute nodes. So we had a three-node control plane, Swift, Cinder. We also had a SEF in the background and a few other services all running all at the same time. And it was for targeting 500 compute nodes. This is also using, like I said, KVM with nested virtualization in it. And so we're getting the same instructions in our virtualized compute nodes as we are on our physical ones. We're testing all of our technology stack. Like I mentioned before, it's Cobbler, Ansible, Libvert, et cetera. And so we needed to be able to vet our current technology stack. Is this going to work? Does it actually scale? We've run production deployments with this. But where do the wheels fall off? And how do we make this better for the community? With OpenStack at 500 nodes, you have a ton of messaging going on. You've got RabbitMQ just getting beat by your compute nodes over and over and over and over again. And so we ended up needing to do a ton of tuning to make our RabbitMQ clusters work in a way that they weren't going to continually fall over. And then proper failure scenarios. So we actually wanted to be able to create an environment where we could introduce human error, network latency, problems with hardware problems, a whole bunch of other things. And so this is how we went about going. The simulator architecture, I'll let you stare at that for a few minutes there. But it's black square boxes with other rectangles and dotted lines. But it's the same exact thing as my ASCII diagram from before. It's NICS to VLANs. But in our case, for our test, it was VXLANs going over a single physical network to virtual machines and simulated on up to 50 hosts is what we did this on. So as I mentioned just a second ago, we built out a single deployer, three controllers, three SF, two Swift, two networking, one logging, and 50 total simulators running 10 VMs each, which all were all compute nodes. We basically, the hardest part of getting all of that to work was getting the VXLAN mesh up and running. And so what we did for that, actually, is in OpenStack, we have metadata. And as you're provisioning your nodes in Ironic, you're putting a public key on everything. And so we used your public key to generate some magic numbers, which then become your VXLAN group. And then so a user, an individual user, would have 10 isolated VXLANs networks. And then we could keep giving out our environments to other people. We had 250 nodes in Ironic total. We used 50, and then there were other groups using some allotment of nodes in that. And we didn't want our traffic to be impeding on what they were doing. Now, we still, in Ironic, had only a single flat network, which is what our entry points were. But all of the tests that we did were over VXLAN. We built out our environments, again, using the OpenStack Ansible code base. And eventually, we did this all on a total of 100 servers. But our main focus of the test was 50 at 500. So the large-scale simulator tests, right? What were our findings on that? It was that everything is, in fact, awesome. No. That's actually not true, right? We knew that the VMs were going to come up well. We knew everything was going to deploy OK. We validated that work. And then we validated that we could get everything up and running in a reasonable period of time. Where things fell over, though, is our usual suspects. We've got MariaDB, Galera, RabbitMQ, Nova, and our own config management. We found some issues in how Ansible was performing at 500 node scale when we set forks to 500. That was our original test. Like, can I do all things at once simultaneously across the entire cloud? The answer is no. So we had to call that back to about 25 to 50, depending on what our workloads look like and what we were actually deploying at that time. Some of the issues that we ended up fixing were Galera. It had the bin log was filling up way too fast. And so we needed to go out and set. It was just tunables to make that actually function a whole lot better. I know DB buffer pool size was another one that we had to tune up. RabbitMQ, like I said, we did a ton of tuning in RabbitMQ. We actually ended up tuning, well, we were diving into the EVM inside of Erlang. And so we can figure out how it was running and how many threads we needed, how many workers we needed. What was the memory footprint of all of those workers. And then we moved all of that code and all of our learnings into the OpenStack Ansible code base. So we tried to upstream everything we could where we could. We wrote blog posts, et cetera. The NF contract max needed to be increased for HA proxy under that many nodes and under significant load under basically the barrage of tests and scenarios that we're attempting to do against our clouds. We had to throttle back our forks, like I was saying, because we were killing our environment with effectively distributed SSH. And it wasn't really necessarily killing the environment. It was just that we were getting SSH errors, and Ansible was just continue cruising along, giving you an inconsistent result. And IP management, which I put under config management. Inside of these projects, whether it be OpenStack Ansible or a few others, we're creating networks and then taking note of the IP space, which goes into inventory. And that becomes unruly when you have a lot of nodes, especially when you're managing them with two different types of processes. We were kicking the VMs. And then after we kicked the VMs, we would run our OpenStack Ansible deployment. And we didn't want things to overlap or kill or run into different problems. And so we were allocating different subnets across the cloud for these things. But still, it was a little bit unruly. So we could do with some improvement in overall IP management. But actually, from here, I'm actually going to do a live demonstration of one of our environments up and running and work through one of our scenarios on how we did this. And so I'm actually going to be doing this on Pyke. This is going to be a we're going to test Swift. So anyways, I got a little latency here. But I'm going to connect to the environment. And if you're not familiar with VIRT Manager, it's Machine Manager, which I can connect to my host over SSH. I've got that pre-configured here. And if the internet works, great success. So I've got my machines running, two sender, two compute, deploy node, infer node, logging nodes, et cetera. And you can treat this like a DRAC, a console ILO, so I can get access to the nodes themselves and begin playing around with them. Our environment, as you can see, is you went to 16.04.2 at this point in time. And to dive into the terminal here. So I've got my environment up and running. And this is my hardware. You just see a ton of VIRT going on. I've got Swift 1, then I'm going to jump in and beat on. And this is my deploy node, which has got the OpenSec Ansible code base stuck on it currently. And coming back up, I'm going to log into Horizon and just show you what the cloud is, in fact, operational. Dead air, sorry. And so like I said, this is running on 403 Forbidden. This is running on Pyke. And so I may have upset the gods of live demonstrations. All right, so we are going to be doing, like I said, a Swift test. And so we're going to jump into our little containers here with object storage. Last pass can go away. Right now I have no containers. I've got basically an idle cloud. Like I was saying before, you're not getting a whole lot of data if you just run one thing or if you just, and clouds are amazingly easy to keep online if you don't use them. And so we are going to come over here. And we're going to re-kick VMs Swift 2. So I'm going to destroy my environment in a way that it probably doesn't want to be by corrupting the disk. And so that's going to force my node to go down. And actually when it comes back up, it will pixie boot. Should I jump into Swift 2? It would seem I've made an error. And then we'll go ahead and show you Swift is in fact working. I've got three nodes. They're operational. Actually what I'll do is I'll shut down. And I can actually come back over to my Swift host. He's killing me. And we can actually see the node come back online. And then from our deploy node, we can begin the process of rebuilding our little environment here using our open-second-selectable playbooks. I'm just going to take just a second to come back on 100% online. Within, while that's all running, we can start uploading objects into our Swift container. And so we're going to start just creating a log stream that's continually uploading all of the logs from one of our very Swift nodes into our environment. If I refresh this with a bit of luck, we should have a yes, our test container is there. We're at var, log, et cetera. These are all of our logs. And as I turned off the checksumming difference in the log are in the Swift upload command itself. And so we should be seeing these logs increase over a period of time as the various playbooks are executing. And all of this is happening at the same time. And under our, if we look at the load on the box, we can see I've got 40 cores. It's relatively idle. But we're using a lot of memory at this time. So we're able to provide ourselves an environment that is running various deployments, retooling one of our nodes that I busted, uploading a bunch of objects into our Swift container, and staying online in a way that our production environment would. This is going to take a few seconds here. But as these things continue on, I'm just going to jump back over here, presentation. So to answer the original question, like are we building a fake cloud? Are we hyper-converging our testing? What are we doing? Are we building our ops tools? We're doing all of the above. We're doing both. And so we really want to empower our operators, deployers, and everyone else to take advantage of what the OSIC did with the operational tooling, the server simulator, and what is now known as the multi-node AIO, so that they can actually move some of the hardware commitments that they have to buy into to improve their overall scale testing, their agility, their capabilities, and vet hardware vendors. If you're partnering with a vendor, then they're going to send you some hardware. But you don't have 500 nodes to test with. You can take 10 and then put some workloads underneath all of that to create a more production-like cloud. And so I don't want you to sit here and stare at a black terminal the entire time. We can circle back on that. But I'd love to, if you guys have questions on what we did and how we did it, I'd love to answer those. I have uploaded the slides there. Any questions? Anything you'd like to see in the environment? Yes, we are. We're using nested KVM. And that's allowing us to, yeah, like I was saying, give our compute nodes and our VMs that we're building the same instruction sets as we would in a production-like environment. Yeah, yeah, by all means. Yeah, sure. So two questions. One question about the hardware configuration of the nodes, how it looks, especially the storage. And the second one, what was the tool used to put the workload on your test environment? So we built out all of our test scenarios using Ansible. There are various different Ansible playbooks. And they're currently in the OSIC GitHub under our, I believe it's under the QE section. But they have built out a ton of different test scenarios that were doing similar things that I attempted to do here during the live demonstration, as well as others for live migration, lots of upgrades, rolling replaces of nodes, running multiple different operating systems simultaneously, 1404 and 1604 side by side under the same code base. As for the hardware configurations, for this specific piece of hardware, it's Intel S3700s in a RAID 10 with Xeon E5 processors. And for our VXLAN mesh, which was enabled us to use the VXLAN mesh, was Intel X710 Nix, which gave us VXLAN offloading. Without that, actually, it's really that's kind of a good point to make. With the VXLAN, if you don't have VXLAN offloading, you're actually going to have a very poor performance time. It comes down to a single channel and a gigabit per second per node. So you can have 100 gigabit network on the backside. But if you don't have a card that can handle or give you proper VXLAN offloading, you're going to run into terrible, awful slowdowns. And you won't necessarily know why, because you've invested in hardware. It just doesn't have the instruction sets to deal with it. But the Mellanox ConnectX Pros, I think 3 plus, as well as the Intel X710, have those. There's a few other cards out there. Those are the only ones I am familiar with. But our hardware is Intel X710. Or it was on SSDs. They were on SSDs, yes. For our Cep testing, they had two SSDs, two for Juno, and then a RAID 10 of 1500K SAS drives. Oh, not RAID 10, but a bunch of drives. Yeah, of course, of course. Any other questions? Yeah, the people with cells were working on that? No, we did not. We played with cells. We really wanted to use cells. But the end result was that we didn't find that we needed it. We used regions, though. Once we got over 500, we decided to create multiple regions. A lot of people will describe regions as Chicago, Dallas, Virginia. But we were doing row one, row two, row three, and then using a single keystone overarching everything so that it was a unified identity. We would have liked to use federation, but we ran into some usability problems with it. It'll get there. It's getting there, actually. It's coming along a very long way. But yeah, cells, no, we didn't. And cells V2 is not necessarily 100% supported across the board. Don't be shy. Beat me up. Any other questions? Well, our playbooks are done. And our stuff is still uploading. And if I scroll up, I'm hoping to find that there's no errors. And so we can come back to validating our servers and know that we have three Swift hosts still. So yeah, we effectively went through, nuked one of our nodes, continued to upload during the entire rebuild process, and then re-ran our playbooks while we were continuing to upload. And our cloud is now back to being perfectly idle. But yeah, anyways, anything else? No? Yes? Well, I will get off the stage and stop talking and let them get back to being on time. Thank you.