 So let's go ahead and get started. This presentation, as you see, using OpenStackin for the Bench Market Cloud. I'm Melvin Hilsman. Work with Rackspace. It's my good friend, Isaac Gonzalez, who works with Intel. So a little bit about me. Previously, I worked on Rackspaces, Provid Cloud, team providing support to large customer install bases, finance, e-commerce, mobile, big data, et cetera. And I currently have the honor of being a technical lead on the OSIG operations team. That's where me and Isaac work together. And as you can see here, you can reach out to me, Mr. Hilsman, pretty much any and everywhere. IRC here, Snapchat, Instagram, whatever. Thank you, Melvin. So I'm Isaac Gonzalez. I've been working for Intel for almost five years now. We're on the OSIG with Melvin based in San Antonio. We both are on the Ops engineering team. You can reach me out on IRC. That's my IRC handle, pretty much. So basically, we're going to talk about OpenStack Innovation Center, just kind of those who do not know what it is. And then we'll talk about workloads. We'll talk about OpenStack Infra as you can see the agenda here, what we deployed, and then some issues and remediations. So OpenStack Innovation Center. Exactly what is it? OK. So essentially, for those who are not aware, it's a joint effort between Rackspace and Intel. It started in 2015. We recently celebrated our one-year anniversary a few months back. At the heart of our efforts is the passion to accelerate the enterprise adoption of OpenStack. Essentially, we contribute as well all of our work upstream. So you could go to github.com slash OSIG. And things that we've done, you can participate as well as download and use if you so decide. So why should you care? At the end of the day, we foster open source principles. We align with the goals of the OpenStack Foundation. Again, we contribute all of our work upstream. So for Rackspace and Intel, it pretty much is a good deal. But what does that mean for you guys? As contributors, developers, operators, consumers, producers, end users, et cetera, whatever you see yourself as in the community, at times you can find it difficult to implement a new feature. Use a tool or a third-party resource. In your current environment, you may not have the funding. You may not have the people. You simply may not have the time. So OSIG is a great resource to address, most if not all these needs. We again align with the OpenStack Foundation. So it only helps in the long run that if you begin to work with us or use some resources that we offer, that there's kind of a great opportunity to get your code implemented, put it in front of the right folks, so forth and so on, that it can only help write your relationship with the community when you work with us. So it's not something that's outside of the community or something we're trying to do to circumvent the community. OSIG's roadmap focuses primarily on manageability, scalability, reliability, high-availability and security, as you can see in the slides here. You can go to osig.org to basically get more about us, see this and some additional information. These things can be primarily seen in ease of deployment, live migration, and high-availability of services, testing, and validation. We're also heavily invested in training and recruitment. So basically, we've had issues, unfortunately, with our slides, so I did want to do that. But I did want to present to you basically how much time over this first year, how many developers we train, how many hours of development. I think there's basically one year of work that has already gone into the amount of, we have folks who have gone through development that have become core reviewers. There's just basically a lot of work. And again, go to osig.org to kind of get all that. And last but certainly not least, if you again go to osig.org, there's the Developer Crowd, which is basically a 2000- node cluster that you can request resources. There's bare-metal resources, as well as essentially virtual machines within an existing OpenStack cloud. All you have to do is sign up for it. And generally, our only request is two weeks after you're done using the resources, you generate a white paper that says how you succeeded or how you failed and what you were trying to do and how the resources helped you. So again, we're here right because we're on the operations team. So operators are our general stakeholders. We provide and they provide us data and feedback on solutions. We also assist heavily in the operator community by helping to schedule the ops summit stuff that's happening, the mid-cycle, contributions to the operators' code bases that are up online. Iza's going to talk about the what and the how. All right. So as Melvin mentioned, we have in the OSIG ops team, we have our roadmap. We have several KPIs. We want to test out features like migration, high availability of services, stuff like that, scalability, upgradeability of pure OpenStack. So we do this at scale. We are using the developer cloud to actually test all those things at scale. So we just got this task. We figure out that in order to test all of this stuff, we kind of need our cloud to be production-like. We don't want to deploy a 44-node cluster, 100-node cluster without being used. I mean, it doesn't do any help for us to test all those stuff in an empty cloud. So we do need workloads. We got into this discussion. We got into a room for several hours trying to figure out how are we putting workloads in our clouds so we can actually test this stuff at scale in a production-like environment, right? So we consider a lot of options. As you may know, there's a lot of tools out there that might help us do that. You have rally for control plane, execution, perf key, and several stuff. And for data plane, you may also find tools like Shaker. A few minutes ago, there was a talk on this same room about BrowBeat and how it can help you exercise your cloud, test a lot of stuff on the scalability and stuff. So we came out with Infra. So Infra, how many of you are not familiar with OpenStack Infra? All right, so we have a couple of folks. So OpenStack Infra team is in charge of supporting the systems that test all the gating for the OpenStack developers, right? They have all the Zool jobs. They have all the infrastructure for Garry and stuff like that. So they basically run workloads and tests every single day. They create a lot of VMs per day. And we're kind of getting to reviewing what are the requirements for them to run their workloads in our clouds. So we think about why Infra is going to help us. Well, we have this transient workload. It's not like a synthetic, static workload that will be running constantly. So OpenStack Infra will give us this transient peaks and stuff that with synthetic tools will be very hard to achieve. And plus, we will be doing a community good, sharing our resources with OpenStack Infra to help this greater good, which is providing our developers, OpenStack developers, a better testing infrastructure, right? So this is where we decided to contact them, get in touch with them, and hey, what do you guys need to run your workloads in our clouds? So we have two clouds. Melvin's going to talk about that. So this is what we deployed for OpenStack Infra. All right, again, right? So the idea is that we had some tasks we need to complete. We needed to generate some workloads. We wanted it to be as close as possible to what would actually be happening in production. So in OSIC, we have a bunch of clouds, but we identified two specifically to help with this effort. So we have a smaller cloud, Cloud Cloud 8, which is basically is a 22-node cloud for the OSIC engineering team. This cloud primarily is focused on development and testing, and it's not expected to be available very long for Infra. This is relevant to Infra. So reaching out to Infra, that was kind of our initial ask of them. It's like, hey, if we gave you guys these 22 nodes, could we take them away if we needed to make some changes adjustments based on data that we got back? And they were perfectly fine and OK with that. So that worked for us. It goes back to the flexibility and the why of Infra. And then development and testing, we have Cloud 1, which is Cloud 1 is probably 300 plus nodes, bare metal nodes, running a lot of stuff that we can't just tear down when we see fit. So the benefit of Cloud 8 is that the things we learn in Cloud 8, we can translate those into Cloud 1. So again, Cloud 1 is longstanding. It offers more resources. And again, it digests the findings of Cloud 8 into Cloud 1. So here is a graph of Cloud 1. Not all of the, of course, like I said, it's at least 300 nodes. Not all those nodes or lists here can be seen. But this is basically like the architecture of Cloud 1, in a sense. So right now, between Cloud 1 and Cloud 8, we actually, the OSUG team is providing just as many VMs to OpenStack Infra as every other provider combined. I say that because, again, the benefit is that we're doing the community good, and you can as well. As well as you can get feedback on your environment prior to you testing or doing something in production, you can do it in development. So issues and remediations. So we're doing this, of course. Deploying OpenStack Infra is not really a simple task per se. You would definitely need the help of OpenStack Infra unless you have some considerable folks to dedicate to it. And again, that's where the benefit came into us of we had stuff we needed to do right then. And Infra was definitely available there for us. So issues we ran into were IPv6 related, raw images versus QCAL images, provider network priority. This was regarding multi-home networks, and then some FDB table max issues we ran into. One second, I'm sorry. So what I'm going to show you is, I'm going to show you a chart, the grafana.openstack.org, that basically this is the OSIC Clouds here, the resources that we're providing to Infra. So what I'm going to do is I'm going to go back a little bit in time here and show you where we had issues. And then we'll talk about our remediations, and you will be able to see the differences. All right, so these graphs here are most relevant to us. So as you can see, ready-no-launch attempts, this is basically amongst all the Clouds, or Cloud1 and Cloud8 in particular, but the different quote-unquote provisions that are given to Infra within those two Clouds. This is like Infra's launching a bunch of nodes. So there's ones that are being built, ones that are being deleted, ones that are currently in use. And then here you can see, obviously, there's a lot of errors early on, right? I mean, you have almost 500, at one point, almost 500 nodes were failing to be launched, OK? And then again, here as well, time was increased. But again, in this early portion, you can see there's a significant amount of time for those nodes to become ready. And then you can see, again, here's those peaks and valleys that Isaac mentioned, where basically you're getting somewhat of a real production like activity going on within these Clouds. So with these issues, this is what we were having. We were seeing hundreds of nodes failing at times, higher amount of time for those nodes to become ready. Further down, again, job runtimes were higher. Going back up, some of the API calls, again, for creating servers, getting servers, deleting servers, so forth and so on, they were all high. So how we fixed, I'll talk about the IPv6 issue. So what was happening is, traffic was getting dropped when we wanted. So again, the issue was, we didn't have enough IPv4 addresses because Infra needs a static IP for the VMs that they're launching. So of course, we're giving them 1,000 VMs and we don't have 1,000 IPv4 addresses to match. So we said, let's move it to IPv6. But the requirements were not only within the Clouds themselves, but there were Edge devices that needed to be changed in order to support the IPv6. So what we ran into was that configuration we had on the switches regarding LACP. We had to disable that in order to get IPv6 to work because with multi-home networks, what was happening is the first available, so in Liberty, the first available NIC, no, I'm sorry, that's Providing Network Poverty. Well, no, yes, I'm sorry, my apologies. So LACP needed to be disabled on the switch in order for us to, in order for the traffic to not get dropped when we switched from IPv4 to IPv6. This was dealing with router advertisement issues that you get with IPv6. I'll let Isaac hit on the raw versus Q-Cal. Yeah, so when we just started launching VMs at first, so in our reference architecture, we have SEPH installed, right? We need shared storage because we were doing some live migration testing. And at some point, we realized that some VMs were taking like 30 minutes or more to build. And basically, we were using this Grafana that OpenSec Infra provides right out of the box as soon as they're up and running in your cloud. We figured out that actually SEPH is trying, every single time it builds a VM, is trying to convert this Q-Cal to image to a raw, right? The fix was released, which is like put all the images in the raw so that there was no conversion needed anymore. So pretty easy fix, right? But Infra was running perfectly fine before that, before we introduced the SEPH back in storage, right? They're using it, folks are launching plenty of VMs. But with us wanting to do some live migration tests, we need to share storage back in. And of course, this introduced an issue where it wasn't one there before. So again, I'll show you in a second how what those times look like after this. So then Provider Network Priority was another one where this was related to upgrades Liberty to Mataka. So in Liberty, with multi-home networks, again, IPv6, IPv4, is that the first NIC that's available was set as the gateway. And of course, well, in our situation, IPv4 was the one that was being set versus IPv6. In Mataka, it was changed from the first NIC to the fastest NIC. So that's where the problem came in. So we had to force the router, we had to do some routing magic, to make it where that the NIC that we wanted would be available to IPv6 versus the IPv4 so that the gateway would get set appropriately. And then again, we were able to resolve some of these failures that you guys are seeing here. So as you can see, like I said, during those times, there were a lot of failures. If we look at this week so far, again, there's not as many air-no-launch attempts. So we know, I mean, magnitudes down from, like I said, 3 to 500 down to 3, 4. And even some of those, sometimes there was a talk in for what we were doing earlier, where sometimes if they push a new image, you'll get a failure just because it's a new image. Sometimes, so it's not always necessarily that. But again, from 3 to 500 down to 1, as you can see, is the norm to maybe 3 is very good. Ready-no-launch attempt. So you can see there's a lot more color there, which means there was a lot more activity that could be handled. And then again, here's the number of resources they use. If we zoom out just a tad bit, we'll see that there's been, of course, things have died down now because we're all here. But if we go back, you can see that the number of launch attempts, you've got 200 here, over 100 here, over 100 here. And I mean, there's large numbers here as well. So close to, I don't know, 4 or 500 nodes, again, VMs receiving the benefit of the changes that we made and us pushing that code or those information, like the LACP thing. We got to talk with Cisco about that and start working on resolving some issues in our switches across the board versus just specifically the cloud one. And so next steps. So yeah, more than next steps here, I would like to see is like a call to action, right? So we started all this just thinking about workloads, right? We were just thinking about our KPIs or roadmap goals. How are we generating? Or how are we getting someone to use our overcloud so we can catch stuff that otherwise, like with synthetic workloads, it won't happen, right? So if you're in your company, you have dev environments. You have some resources to spare. I mean, don't hesitate to contact OpenStack Infra, share your resources, and you will get, like I said, all the mail we showed you, it's right out of the box as soon as they're up and running. So here's what they need. Basically two tenants, access to Nova and Glantz APIs, disk with 500 gigs. And for this first instance is for the mirror images and a bunch of stuff there. And for the node pool, where the actual tests, the VMs that Zulu will spawn, those are their requirements. So there's no like any fancy requirement you might need to get in from up and running. It's just that requirement and the public IP for every single VM. So like I said, right. Other than testing your own stuff, you get someone that can give you feedback before going out on production. You have metrics right out of the box. But more than anything, I think that one of the key companies for OpenStack to be successful, to be adopted by enterprise, is that we all support our developers, which is core. It's really important that for them to have a really good testing environment, so they don't have to wait that much time to their tests to run. So I encourage you to contact OpenStack Infra, IRC. They have Fungi, their PTL there. So that's pretty much it. Some credits of the people involved in this effort. We have Paul right there. And yeah. Questions on Developer Cloud? I would say you're happy to. If you could go to the mic. It's not there. This one right here. I could talk loud. OK. Oh, there it is right there. Yeah, it's a mic right here. For a great presentation, thank you very much. I appreciate it. So in terms of your reference architecture, is that documented anywhere? And is that something that we can take advantage of? Just based on what the goal is here, I'm very excited. And I'd like to do that inside of our company as well. Yeah, definitely. Cool. Are you saying in regards to what we did for Infra? Yeah. OK. So we don't have that documented, I would say no. However, you are welcome to reach out to us. And this guy in the back, side, middle, is that Paul? Paul, raise your hand. Yeah, don't look back. So Paul is a great resource. He helped us get hours up and running. I mean, if you have resources, please reach out to them, and they can definitely get you up and running really quick. And if you run into issues, they're very good at sticking with you and helping you resolve those issues as well, looking into the logs that they have available to them that you may not be able to see. So very, very good team to work with. Yeah, and like we mentioned, if you go to the ozik.org website, you will find blogs and stuff that we are doing there. So we will be publishing white papers and all the reference architectures we have and stuff like that. I would say there's probably not a specific architecture that you may have to implement, like have to implement. But there are certain requirements, as we show that they would exactly need. Any other questions? Do you have the onboarding and the requested provisioning? Is that automated, or are you having to kind of manage that in the background after somebody requests the access? So if I were to come in, I'd say, I'd like to get three racks and this many BCPUs. Do you guys have that automated, or is it something that just somebody goes out there and assigns? Yeah, so if you go to ozik.org, as Isaac is showing you, click on Access to Developer Cloud. It's simply a form that you fill out. I think the process is going to move away from this into GitHub. I don't quote me on that, but I believe so. We're actually trying to make it easier and make it again publicly visible because our efforts are to align with the Overstacks Foundation openness. But yeah, so basically you sign up and we give you servers that have been tested that we know can run an operating system. We know there's no issues. Well, at the time we give them to you, there's no issues with NICs, hard drives, so forth and so on. We did, however, perform an effort called novice install, where basically we took, so there's this myth that OpenStack is crazy, difficult to install. And we wanted to debunk that. So we took basically over eight iterations of who we identified as novices. Some folks that had some experience already in OpenStack and a little bit knew about it. And then some folks who were totally green. And over eight iterations, we went from bare metal provisioning all the way to logging in to Horizon, from 40 hours to just over six hours. So again, GitHub.com slash OSIC, you should be able to find the stuff there. If it's not there, you can reach out to hash or pound OSIC-ops in free nodes where we all hang out at. And we can definitely get you the information that you need because it's not proprietary. Anything else? Sure. So when I think of benchmarking to improve product, I usually think of gating on a benchmark, right? And make sure that you block the developer who slows your cloud down. In this case, you've moved your benchmarking out of phase with your CI. So what are your feedback processes like to make sure that when your infra on cloud eight discovers a problem that that gets rolled back into your product? So, I mean, again, the good thing is that you are pretty much in control. So that's kind of what you alluded to, right? I have the control to restrict something from happening. So you do have that control with infra. Again, we said, hey, we need to be able to stop you guys from using our cloud. And if we need to make some adjustments, make those adjustments and then re-enable it. The good thing about infra at the end of the day is just code. So these guys are very helpful, like I said. And you simply, you can do it yourself. You send a patch that says max servers or minimum servers for this provider, which is me, zero. Once that's implemented and you can talk with them, they can get it done in short order. Then now there's no more infra jobs coming in or until the ones that are currently there finish. And then you can do whatever you need to do, get whatever feedback again, they'll provide data, you'll have data, make your changes, make your adjustments or whatnot. And then send another patch, hey, re-enable X number of servers or max servers, whatever it may be. Wait for them or talk with them. Short order, you're back up and running and you can get more data relevant to your cloud and your adjustment. Does that answer your question or kinda? Okay. I guess I'm less curious about how you work with infra and more curious about how you work with your internal developers and integrators to make sure that, okay, there is a problem with our cloud. How does that get noticed? How does that, how does the fix get into your product so that it doesn't have to get messed up? Yeah, we monitor it. So first of all, we don't see our reference architecture as a product, right? So what we, so core thing of the OSIG, everything we do goes back upstream. So every single time we find something in our reference architecture, the output should be reported back to the community, right? So that's what we do. We find, maybe we found bugs. We have bugged bugs for OpenSec, Ansible. We're trying different deployment tools. So the whole goal here is to get this feedback out to the community, out to the developers, through white papers, through box and stuff like that. But you can, again, that, so you would have your own internal monitoring team, right? You would have your own processes and procedures there. So for example, for us, we use, and we decide on influx data stack, the tick stack, to monitor. And so we were able to see certain things again happening. We already had some monitoring tools from, because it's an effort between Rackspace and Intel, and we were able to lean on the Rackspace expertise, then now, the things that these guys were seeing, the line of communication, the openness that we had, then we could say, hey, we're having issues trying to get IPv6 running. Let me grab a network engineer, offer whatever he was doing and spend some time with them. Let's figure out, let's start tracing some packets, so forth and so on. So that's kind of the process. I would say it's the same process you would use for any other issue you find in a way you resolve it and within your own particular company. Does that better answer it? Okay. Any other questions? All right. All right, well thank you guys so much. Thank you very much. Please reach out to us if you have any questions. Or if you want to use the cloud, go use it. Sign up, osik.org. Stop playing.