 All right, I think it's right on, pretty much on time. We'll go ahead and start now. So good afternoon. My name is James Penwick. I'm from Yahoo. And I don't know if you saw the keynote this morning, but we talked very briefly, just finally able to share a little bit about the things that we've been doing with OpenStack at Yahoo. And I wanted to talk a little bit today on a little bit more depth on what we've done so far and what we're going to do. So one of the first questions is why are we using OpenStack and why are we doing this? So Yahoo is one of the first mega-scale infrastructures in the world. And being the first is pretty lonely. You kind of have to forge the trail on your own. You have to find out everything you do, you're the first person to hit that scale issue. And so when you are the first person to run into these issues, that means you're also the first person that has to solve them. And so as we've kind of been blazing this trail along and figuring out how to do things, we made a lot of mistakes along the way, just accumulated some cruft in our infrastructure, and just had some of the ways we did things were just a little bit legacy, a little bit out of date. And so OpenStack is a fantastic way of wrapping our entire infrastructure with a single consistent API that you can use to then manage your resources. I think you all know that. That's why you're here. So we have hundreds of thousands of compete resources at Yahoo. Tens of thousands of them are managed by OpenStack right now. Tens of thousands of those are VMs that are managed by OpenStack. And so talk a little bit for a second here exactly how we're managing those things. So we actually have three different types of clusters of different flavors of OpenStack at Yahoo. We have one we call developer productivity. Every developer at Yahoo gets up to five VMs per Colo, basically, which means that no matter what, your quota is set, it's static, you log in, you build instances, you start working. There's no waiting. And this has been a fantastic use case for us. We have provisioned VMs. One of the things that are serving production traffic right now is where you go to the well, you request quota, you get it, you build instances, and you're off and running. And then we have bare metal. And this is where, again, you go to an OpenStack API. You either go to Horizon UI, you go to the Nova CLI, and you boot a host, and it comes up, and it's off and running. As of October 1st, all new hardware at Yahoo is provisioned through OpenStack bare metal. It was very exciting for us. And we're actually working on moving all existing hardware in situ onto OpenStack. And then we're going to talk a little bit. That's one of the things we're going to talk about today. So it's like, how do we use OpenStack to manage our bare metal? Well, the very first step is we needed to, it's Yahoo's like a big aircraft carrier, right? So our infrastructure is huge. And a lot of it's well thought out, some of it's old. But it takes a while to get some momentum to move things. So the first thing we do is we say, well, we can't overnight replace our entire provisioning infrastructure. But what we can do is take OpenStack, we took the grizzly bare metal code, we took that BMC controller, and we kind of lobotomized it. And we use OpenStack as sort of a puppeteer where the rest of our legacy provisioning infrastructure was the marionette. So host provisioning, our equivalent of kickstart, our power management, our operational asset database, our DNS database, all these things are then interfaced through this custom BMC. The nice thing is, is this did reduce the target area of where we had to make modifications. We can try and focus most of that into this bare metal controller, so the rest of the OpenStack stuff could just be left alone. Now, then there's the, what about everything that existed before OpenStack? So hundreds of thousands of compute nodes. So we actually, it sounds a little simpler than it was. We did a lot of work. We added a few features to OpenStack. So one of the things we did is we added a no re-image flag. This means that if we say nova boot, no re-image, we'll use the nova bare metal import. We'll register a bare metal node with OpenStack. And then we actually issue a nova boot call. And we say nova boot, no re-image. We assign a tenant name. We also pass along the asset tag of the host. And what this does is it causes OpenStack to go through what it thinks is the entire boot flow. But we're stubbing out things like actually rebooting the host, actually re-imaging it. We leave that stuff alone. So this way, a host goes from being non-OpenStack managed to OpenStack managed in a very short period of time. The user never knows. It just happened. After that point forward, one of the changes we make is we flip a flag in our asset database. And it's no longer possible to use the legacy provisioning tooling alone to re-image this host. Its state is now owned entirely by OpenStack. So we used nova bare metal to do this, highly modified version of nova bare metal. This is kind of sick and wrong. We did this because Ironic was not ready yet. But we really needed to get the company moving. We really needed to get this API in front of our infrastructure. So this whole time that we've been working aggressively to develop nova bare metal, move the entire company onto it, we're also simultaneously investing very heavily in Ironic. And then pretty much at the same time that we're finishing moving the entire company into OpenStack bare metal, at that point we anticipate we'll be finished developing and deploying Ironic. And then we're going to move all that hardware again, pick it up, forklift it over, and drop it into Ironic. Again, with the hope that there is no impact. So you can see that about this time next year, we should be most hopefully should have hundreds of thousands of compute resources managed by OpenStack Ironic and none with the nova bare metal grizzly code. No, that's right, I animate it. That's where we are right now. So one of the reasons I wanted to have this talk, talk a little bit what we're doing, mostly focusing on having a broader Q&A session at the end, we've got a lot of our engineers here who did a lot of the work on this. So it was an opportunity to help share some of the pitfalls we ran into in case you too would like to move to OpenStack bare metal for your legacy gear. But there's some features that we're having to implement in Nova bare metal that we don't necessarily want to bring into Ironic in the same way that we implement them in Nova bare metal. So these are the ones we want to work closely with the community and find upstream ways to solve some problems. So dig into these. An example is OpenStack does not yet fully understand the topology of your data center. It knows what instances are and it might know what a compute host is. But it doesn't understand truly what a rack switch is or the relation between a compute resource, both a VM to a hypervisor or just a standalone bare metal host. It doesn't understand the relationship between the say that and a rack or a rack switch or power domain. So a good example of this, Yahoo, we build very large applications. We have a lot of users. We want to make sure that no matter what, we can survey your data. We don't ever want to go offline. So let's say that you were to, if you just have just a typical fault domain thing where you strike your host across racks, you lose the top of rack switch, it's okay. Your hosts are spread out on different racks. If your hosts were all in the same rack and that top of rack switch went down, you lost the entire app. And you've gone out of business and you're probably gonna lose your job. So OpenStack needs the ability to say, OpenStack, boot me some hosts, VM or bare metal and ensure that they are in these different failure domains. Make sure that my hosts are in different rack switches. With the VMs, there's the ability to do anti-affinity insofar as make sure my VMs are in different hypervisors, but we need to go a bit deeper or restart the presentation and we're back. Power, power is another issue. Okay, so your hosts are on different top of rack switches. That's good. However, something bad happens and you lose a PDU in your data center. Well, if your hosts are in two cabinets that are on the same PDU, then your application went down. So example, if you imagine each color is an entire rack populated with one application and the PDU goes down, then those hosts in the blue racks there, that application's gone. And again, you're probably gonna lose your job. So we need the ability to teach OpenStack, boot me some hosts, make sure they're in different racks and also make sure that across a minimum of power domains, we're available. And I'm sure other companies are gonna have different needs to represent this sort of fault domain locality. This blast zone is another term people use to represent this. Now, touching back on the rack switch, this is something that is kind of a touchy issue with me and I fought very hard against and we're still trying to decide how we wanna do this. In general, saying I want build me hosts and make sure they're in different racks, that's a totally acceptable thing to do. But to ever say the opposite and say build me a bunch of hosts and ensure they're all in the same rack, that's sick and wrong and I'm angry at you now. So in general, every property that had this requirement, any group that said, well, we wanna be able to make sure hosts are in the same rack. I talked to them, I'd work it out and I'd find that in general, no, they're just trying to make sure they're a host are in different racks. This is just how they used to do things, change the way they think and they're totally okay with it. If I say like, look, if I can just ensure that automatically you can be assured you're in different racks, they say that's great. There is one team, though, where this is not the case. So if we assume that we have a bunch of compute hosts and a bunch of, say, HDFS hosts and these compute nodes are talking with HDFS hosts and they're actually moving a lot of data, let's assume like a really intense grid computing environment, then the link between your switches is gonna become saturated. So you're gonna add additional trunk ports and that's gonna get saturated, so you add more and the whole time, while you do this, you are continually raising the cost of your network. So okay, in this case, the cheapest thing to do is actually to bundle your compute and, say, storage hosts in the same rack. So okay, in that one context, you're trying to do the right thing, you're trying to save the company some money, you're trying to take load off the backplane, keep it all locally on your top of rack switch. All right, so fine. So in that one context, we should support this. Now if we have rack anti-affinity, we should also be able to sort rack affinity so for that one specific use case, we'll allow it, however ideally, we wouldn't tell anyone else about that because it's sick and wrong. Now those are some of the technology gaps we're working on, I would like to engage a lot of y'all after this, if you have interest in working with us in this area, to work together, let's get some specs and really wanna get this upstreamed. Something I found interesting is just, one of our engineers mentioned this kind of need to another engineer from another company and they said, oh yeah, yeah, we had the same problem, we implemented our own internal thing. So okay, good, we're not the only ones, we're not totally crazy. But this means that chances are other companies have had to do this, both for VM or bare metal. So this is a good opportunity to kind of bring this to light and see how we can find a good, sane, upstreamable way of accomplishing this. And touching on that some more, some of the challenges we've had with OpenStack, this sort of OpenStack as a whole is integrating the business processes. So let's, you know, looking at bare metal, when you're using Nova bare metal or using OpenStack to manage your bare metal, you're using hardware. Hardware costs a lot more money for, you know, than a VM does, it can cost you thousands of dollars. Now, that usually implies this is gonna be tighter business process control around that host. And so you're filling your data center with all these hosts and you're not just gonna hand them out like candy. So in a public cloud, generally speaking, you have a pretty big quota that you can consume because it's great from a public cloud perspective. I wanna give you lots of quota because then if you go out, you're gonna give me money in turn. Private cloud's a little bit different. We need to make sure that you're actually only using as much as we're prepared to give you within certain constraints, obviously with a good amount of headroom. So if we say, yes, go ahead and boot some servers and you can boot these machines with some old 15K SaaS drives and you turn around and boot a bunch of machines with SSD, that's a problem. We need to make sure that you're only booting the hosts that you've been approved for, especially if there is some that costs a great deal of money and are a specific use case. So we need a way of doing quota by flavor. Now, there's also different security zones. We don't wanna have dozens and dozens of open stack entry points. We actually do, it sucks. So this means that if you have a different entry point for every single security zone, then your users now need to know exactly which API to go to to boot which server. So it's like, well, these ones are internal corporate hosts, so I go here and these are internal corporate hosts for this colo, I go here and these ones are external user-facing traffic hosts over here, so I go there. It's a real pain in the butt. So we wanna bring these things together into a single cluster. Now, if we do that though, we need a way to make sure that if you've been approved to boot hosts in corporate, you're not accidentally booting them, say in the DMZ and installing software in the wrong place because then nobody's gonna like you. Another thing to gripe on with quota, quota logging. Someone comes and says, I want quota and I say, okay, and I grant it to them. Someone else comes and say a month later and says, why does this property have additional quota? Like, who gave them this SSD quota or who gave them these extra VMs? There's no logging inside of the Nova quota management mechanism. There's no reference, there's no way for someone to request additional quota through OpenStack, have it reviewed and approved. There's no audit trail there, so everything you do has to be done externally and there's no way to, graceful way to butt these things up together. So this is something else that needs to be addressed. Other challenges we had in the process, these are actually challenges specific to the horizontal migration. At first, with Grizzly, when we were doing these horizontal migrations, it would take eight plus minutes to bring each new host in, in terms of the Nova Bear Metal ad, Nova Bear Metal Create, and then do the Nova Boot with a no re-image flag and that could take up to eight minutes. Well, we have hundreds of thousands of servers. That's not gonna scale. So a lot of these things, sins of the past, some inefficient API calls that were in Nova Grizzly or OpenStack Grizzly, they've since been fixed, we have to cherry pick those things back. But it's definitely an example of why, when I say don't do what we did because it was sick and wrong, and if, and ironic is a good place now. So that's what we've done. Those are some of the gaps we have. And I wanted to turn this next bit around into more of a discussion and say all right, help. Is there anyone else who's looking at doing this as well? Are there anyone who's looking at working with us on this? Are you aware of any solutions that we've missed that'll actually help us address some of these problems? All right, what do we got? Try and help other solutions. Okay, so my email address is up here for the recording. I don't know if the camera could hear it, but someone's emphatically saying like, this is a huge problem and they wanna help us with it and they're gonna do all the work for me. That's what she said, right? This absolutely. Yes, and if you get it done by the end of the summit, I'd really appreciate that. Anyone else? No? Okay. So do me a favor, drop me an email. What's up? Oh, so someone asked the question, why bare metal? It actually, it's not an IAVE question, it's a very good question. If you're running, if you have a massive data center, like why would you ever use bare metal if you can just run everything inside of VMs? The truth is there are some workloads for which bare metal is still better. Even with the, so VMs, great, super fast, containers are actually, are lightweight, super fast, but there are still use cases where bare metal is faster and more powerful. So grid computing is an example of this, and there are other high usage applications where you need to have the low latency, you need nothing between you and serving traffic. So in those cases, you use bare metal. Does that answer your question? So the follow-up question is like just the pure reason of just doing things faster. So yes, again, what do you think at scale, when you think that if you think in a micro scale, you can say like, well, look, we can buy a few servers or you can use twice as many VMs, it's not that big of a deal, and that's fair. But when you get to this sort of mega scale, hyper scale, when you're talking about hundreds of thousands of compute resources, and you say just use twice as many VMs, in general, by the way, this is what I tell properties, is what I tell people is I say, just build them smaller and wider and stop worrying about it. But there are still cases where when you do the math, you can say, you know what, it's still not working out. Is that when we get to the hundreds of thousands type use case, we can see that some of that starts to come apart. Now that said, I am still working with a lot of properties and pushing on them harder and harder to move to VMs because I think the majority use case should always be VM or container and specific isolated bare metal should be the minority. But now remember, when I go back to that first slide, we were, our infrastructure existed predominantly before VMs, right? So our infrastructure is 20 plus years old. We have a lot of applications that started over 20 years ago. There's a lot of momentum that goes moving with these things. And so even though VMs are coming in and we're having a pretty rapid adoption, they're still getting ahead of that, is tricky. And we still have hundreds of thousands of compute resources in our data center. So we need something to manage them. Yeah, question. I don't know ironic well, but I mean, with so many machines, do you have a similar way to like group them together and do you use like the equivalent like host aggregates, availability zones or even cells to help organize these huge list of machines? So we use, that's a good question. So we, to group these hosts together, we have abused the availability zone concept pretty maliciously. So all of our hosts, the way that when you go to boot a host right now with an over bare metal when we've done this, this marionette plan, there's a lot of information you pass in terms of an availability zone. So we, in that availability zone, you specify which backplane you want. And this is like a, you know, which network backplane. If you want a public or private IP, if you, the limited support we have for power domains right now, you pass in there. So you can pass power domain one or power domain two. All of that is captioned entirely through this heavily abused concept of availability zones. And we have not upstream this because it's not, the way that we've represented this in the database is not what we would want to do upstream. This was a, we got to iterate, we got to get this done. We need to make this thing work so that, so that we can have this stance stood up while we work on open, on ironic. Do we use our Nova bare metal management to deploy the compute nodes upon which we run VMs? It's really hoping you wouldn't ask that. We do not because, by very nature, so one of the rules, and this is sort of me shooting us in the foot just because I told people, like, look, we're moving to the future. So if you are operating in a Layer 2 network or you have these other constraints, you don't get to be on OpenStack, you're gonna go live in this legacy environment and you are going to conform with the right way to do things. One of the things that no one is allowed to do is have customer specialized networking requirements. Our hypervisors have custom and specialized networking requirements in that they need to be in special racks whose switch ports have been set to trunk where there's a second subnet, we use Layer 3, so there are two subnets for every rack, one for the hypervisors, one for the guests. We actually have them isolated at the network level. The two are never allowed to communicate. So for that very reason, in our particular environment, at this exact moment, we do not. However, we are changing that. We are moving to actually support that. We have created something called one of those availability zones we have, aside from public and private, we also have a single versus multi IP availability zone. So we're gonna kind of reuse that same concept and adopt it. So the question is how many different flavors or hardware configurations do we have? Why do you guys keep asking me questions I don't wanna answer? So there is, we work very hard, we have a team that works very hard to ensure that when someone requests hardware at Bear Metal that they get the exact right config for the workload that they're executing on. So this has resulted in a good number of configurations. I can tell you that what we have, so there can be a large number of what we call a custom configuration. They're not actually custom, but they're more specialized. However, we are working to push towards a more limited model of saying like look, we're gonna have a menu of five things. These five things you can get really quick, really easily. These other things you're gonna have to wait a little bit longer for. So the answer I can't give the exact number, even if I don't even know it off the top of my head. But there is a good number of different flavors in there. Actually, so the question is, do we have all these different flavors and different availability zones? Not quite as such. We actually just use the flavor concept to represent them. So when you go to boot a host, we use a config string, a string of alphanumeric characters to represent the type of host you're booting. Just for the sake of simplicity, let's just pretend one of them is web server. It's easier than spelling out what they are. So you would say your flavor ID you pass is just gonna be web server or whatever that config name is. So we don't stash them in their own availability zones. Sure. The question is, can I describe a little more about this horizontal migration process? So there's a few different steps. We do kind of a pre-validate to make sure that the information in our asset tracking database is correct. The reason for this is every host is represented by a config string. And so if we import this host in and we actually record as the wrong flavor, that could bite us down the road. So we do a pre-validate step to make sure that's correct. Then we have a script way here wrote that pulls the host in. So it goes out to our asset database. It looks up the host. It then determines all the different information about the host, its state. So what OS it's running, which property, which project, really in terms of Keystone terminology, which project that Yahoo is using that host, its host name, its config, its flavor, all that data. It then creates, it then brings all that data together and does a Nova bare metal create command. So it actually loads the host and do bare metal as if it was just a piece of inventory. After that inventory has been registered, it then issues a boot call and says, all right, Nova bare metal boot with no re-image, goes through the re-image process, sorry, opens that goes through what it thinks is the process of booting a new host. However, when it gets to actually rebooting the host, actually re-imaging, it does nothing. It just passes right past that. So that's what that re-image flag does. Brings it all the way back around. So this is basically mimicking all the steps so that OpenStack thinks it's the one that booted the host. So it knows and it's in the database the exact state of the host. The final thing as it does is it makes a call out or one of the final things it does makes a call out to that our asset database and it flips a flag and says, it's just a Boolean field that is OpenStack managed. It flips that to true. From that point on, when that value is true, only the OpenStack service can enact any changes to that host. So the question is, do we image from a bare metal machine? No, so to be clear, we take a host in its current state and we do nothing to it directly. It never changes, it never reboots, nothing changes. We mimic the process of doing those things in OpenStack. We basically lie to OpenStack and convince it it's booting a new host. But it's actually not, it's a non-destructive thing. It touches nothing. So in terms of the host, the only thing the project would notice is that a value flipped on our asset database, but nothing else happened. And from that moment on, if they do NovaList, it'll have just appeared. It was in the asset database before, now it shows up in NovaList and they could do NovaReboot or wherever they want to it. So the question is, what about networking? Last question, okay. The question is about what about networking? If, what do we do to import all the network data? Well, when we import the host, we look up its IP address, we stick that in the NovaNetwork database. We're using NovaNetwork for our bare metal stuff right now. We stick that information in the NovaNetwork database and we say, and that's, we don't do anything, I'll special with it. We just record that data in the database. Sir? So this is the last question. So the flavor, when we look up that host the first time in our asset database, we look up what its flavor is supposed to be and then if it does not yet exist in the flavor table we create that flavor. No, no, no, if the flavor ID, because we use the flavor name and flavor ID, we use the flavor name for both, ID and name. Yeah? Sure. So if every type of host was a different configuration then you would end up creating a flavor for each of them. Yes, that's exactly what would happen. That was the last question. Thank you.