 All right, good evening, ladies and gentlemen. My name is James Penwick. I am with Yahoo, and I'm here to talk about what we've been doing with Yahoo, Ironic, and Neutron, and how we've been bringing these three things together. To start with, this talk was originally proposed by Shradha, my colleague. However, she was unable to make it, so I'm presenting it in her stead. I just want to make sure it was known that she did all the heavy lifting on this. So why am I here? Yahoo operates at Megascale. It's a very large company. We have a very large infrastructure. And we've been around for a long time. We were one of the first Megascale computing companies out there. And as a result, when we were overcoming a lot of scaling challenges, when we were designing our data centers and our applications, there was no one else out there for us to learn from. So they're fast forward to today. And there's a lot of things that Yahoo does really, really well. And we've learned a lot of really hard scaling lessons. There are some things we don't do as well. Bits of our histories, a little bit of legacy from our past. And now, when we try and take OpenStack to wrap around our entire infrastructure, we can't just exclude the things that are inconvenient. We do need to be able to address them. Specifically, the only way to really truly move forward out of our legacy and onto a modern commodity compute platform to truly wrap our infrastructure with an API is to be able to support these old use cases, to really ease the transition from the old to the new. So why I'm here is to share some of the interesting work we've done with Neutron. Some of these things are probably very specific to Yahoo's environment. And so they're not necessarily something that anyone else wants. And in those things, if no one else is interested in those, we're not going to upstream them. But some things are representative of what we've seen at scale in our experience. In those things, we do want to upstream. And in many cases, we have specs up. Now, so speaking of our environment, our model differs from a lot of other private and public cloud companies in that we're kind of a hybrid. So very large infrastructure. We have a lot of internal customers. And so this means ultimately we have many different tenants or projects. However, it's one company. And so this unique relationship of many tenants, but all under one umbrella, has some interesting connotations to how we design our network. So for example, we don't have tenant networks. We don't try and isolate our tenants from each other in that way. We have security zone constructs. But in general, we want our tenants to be able to talk to each other, because many of them comprise different pieces of the same application. A little more about that network design. Our legacy networks are layer two. However, all of our modern go forward networks are multi-dimensional folded clas, layer three designs. These are awesome. They're performant. But we still have to be able to support these old layer two things. Other interesting things about us, all of our tenants share the same network space. And we don't have any overlapping IP space. And this is something that definitely sets is a bit of a change from a lot of other employers out there. Globally, none of our IP space overlaps. Bare metal hosts all have static IPs. I'll get into that in a bit. And we do not use a separate provisioning network for a bare metal. So in terms of scale, Yahoo operates at hundreds of thousands of physical nodes. Our VMs are in the tens of thousands. Our legacy provisioning system could handle hundreds of concurrent boots with ironic. If you saw Simon's talk, we're currently managing hundreds of concurrent boots with ironic, but we're targeting 10,000 plus. So these are the kind of constraints that we take into account when we design. Internally, we do have to support a lot of internal applications. So whereas OpenStack is the source of truth for the logical state of our infrastructure, we have a CMDB, Config Management Database. This tracks the physical state of all our machines, what facility it's in, what rack it's in, but power supply is attached to all that stuff. We have a separate DNS system. It's not designate. But we've had this for a while, so we have a separate DNS system that we need to be able to integrate with. We use Chef. We have a whole variety of other internal systems. So we need to be able to get OpenStack in, make it performant, and make it work with all the stuff that was already there. So to do this, we've had to make a few changes. We kind of tongue in cheek refer these as hacks. These aren't just simple hacks we've done. We have actually invested effort in doing a lot of these things right. So change number one, static IPs. And I've drawn a little picture to kind of help with this one. So our legacy layer 2 networks, as many of you know, layer 2 domains have some intrinsic scaling limitations. And these limitations are so spanning tree, for example. If you have a lot of different subnets and a lot of different VLANs on the same layer 2 domain and you have a lot of notifications going across, you can have spanning tree meltdown. So something that we've done to help prevent this is we've implemented automation that during a nightly automated change window, we'll find unused VLANs on switches and we'll remove them. So to define this, if I have a couple of racks here and we see these, I have two VLANs. The orangish one is VLAN 100. And the green one is VLAN 102. Let's say in the left rack, the tenant using those three green machines, if they don't need their instances anymore and they delete them, then they're just inventory. Someone else comes along and boots some hardware. They may just happen to end up on VLAN 100. Now, when this happens, this means that if we change the VLAN on the top of rack switch, on the tour, that that night during that automated change window, that VLAN will get removed from the switch trunk and also from the baz as well, so from our backplane. So the problem here is, of course, we could just configure a neutron to come back through if these were deleted and created under a different VLAN. However, that trunk change must occur during an automated change window. Now, commodity compute and infrastructure as a service doesn't do you any good if you have to wait 24 hours to boot a host if the situation isn't exactly right. So to mitigate this, we've said, OK, all of our hardware right now is going to be static IPs. We're just going to make sure that we have an IP set aside for that host. And when we go to boot it, we will always use the same IP address. We want to dynamically select IPs for our hardware. And we're going to move towards that point when we can. But for right now, we do have to address our least common denominator. How does this whole process work? So our site ops personnel, these are the people that are on site, that are physically installing servers. They get the call. They install it to install the gear. They find an available IP address for it. They make sure that trunk is set on the switchport. They make sure the access setting on the switchport is correct. They rack the hardware, and they turn it over. At this point, we have automation that picks this up. It imports it into Neutron. It then, inside the Neutron node definition, we store the IP address. And thus, when we go to boot hardware, we will look up the IP that's already in the node definition, make a call to Neutron, make sure that we get the correct port, and turn up the host. So will we upstream this? Maybe. If anyone else needs this, you can come talk to me, let us know. And we'll be happy to work together to share this and upstream it. Otherwise, if we're the only ones in the world that need this, we'll just kind of hang our heads in shame and hide this one. Next thing we did, single process DNS mask driver. So DNS mask, by default, it runs a separate process for every subnet. Well, at mega scale, that would be a lot of subnets and a lot of processes. So this isn't something that was going to scale for us. So what we did is we went ahead and modified DNS mask so that we just pointed out a directory, dump config files in there. And it would simply gather all the data it needed from all the config files in that directory and would then be able to serve DHCP for those subnets with a single process. That was great. It was very handy. It got us to our testing phase. However, we felt that DNS mask didn't scale. So if you're using DNS mask and you're seeing this, this is a problem for you. And you would like some code to help with this. Maybe we can share this with you and you could make use of it. Otherwise, we quickly realized DNS mask wasn't going to work. So we moved on to the next thing, ISC DHCPD. And we look at what DNS mask does. And we say, look, this is not proven to us at scale. We found it was difficult to properly debug issues. And the fact that it's every time that you wanted to image a new server, you had to restart the process. And that's not something that you really want to see yourself doing in an enterprise grade system. ISC DHCPD, however, has an OMAPI API, which you can call that API and you can dynamically add hosts to it. The only time you ever need to restart ISC DHCPD is when you're adding a new subnet. That's a far less common occurrence than simply imaging a host. So that's not too bad. We also have our legacy provisioning system uses ISC DHCPD. So we have experience with using this thing at scale. So that gave us a lot of faith in it. So it works great, actually. So we've written the driver and we managed this thing using the OMAPI API. The current state it's in is that there's two ways that you can interface with the OMAPI API. There is something called OMShell, which is a Linux utility. And then there is PyPure API. The downside of PyPure API, PyPure OMAPI, was that it was GPLv2 and it was not Apache licensed. So we originally wrote this against OMShell, which is pretty clunky because you're shelling out to call this command line utility. However, we worked with the folks that wrote PyPure OMAPI, and they actually did change the license to Apache v2. And so we've already done the majority of the work now too. We've actually done all the work to move to PyPure OMAPI. And so in our next release, we're actually going to switch to that. So if I have some additional information on why we made the switch, we have already put the spec up for this. It's ready to go. I've linked at the bottom. You can barely read it. That happens. So if you want, you can just drop me a line and we can send you a link to the spec where you can find it up on Garrett. One thing I will call out, I put it in red. This, just like DNSmask, it does not support overlapping IP space, because like I called out near the beginning of my talk, we don't do that. It wasn't relevant to us. So it wasn't something we felt we needed to think about. So the next few hacks here are things that we are either in the process of still writing or are just about to go out. So one of them is multi-gateway subnet. So like most of you, we are fanatical about uptime. We want to make sure that we keep our website online so you keep using all our products, clicking ads, and paying for my house. So one of the ways we address this is in the fanatical approach to redundancy is we make sure that on top of power, network, and all these other constructs that we have redundancy for, we also make sure that all hosts in a given segment of a data center down to the smallest piece we can go, we also make sure that we have two different gateways for all these hosts, and we actually switch which default gateway every host uses. It's a simple algorithm. You take the last octet of the IP modulo by the number of gateways plus one. However, there is a problem. And that neutron doesn't support this, doesn't know anything about this. So we simply added it, or we're working on that. There's a bit of a trick here, because is the schema change in the database tables for neutron? Of course, there's a column for gateway IP. Well, we're going to have multiple gateways, and it can be between one or many, depending on exactly which network, which part of the world, et cetera. So we're working on this because it's going to have a schema impact. It's going to have API impact. And we really do think this should be an upstream patch. This is one of those things, that perspective of scale that we think we would like to share with the rest of the world. And so we want to upstream this. And so if there's something you're interested in, please let's work together and get it in. All right, so the next thing, multi-IP assignment. That's one that might sound a little weird, like you can easily add additional IPs to a node. But the trick is, we need to be able to assign extra IPs to a host in the same subnet as the host itself. So we need to make sure that when you do NovaList, you can see all the different IPs and possibly host names that have been assigned to it. We need to make sure that we stay in sync with all of our internal systems. And we also, as part of our requirements, we're simplifying it. We're not actually going to attach the IP to the host itself. We're going to basically reserve it and then make it possible for you to go and then log in and alias the IP. Now, it sounds easy, but it's really not. Because Neutron, and even though Nova Network make the assumption that you have a static IP pool and you have a floating IP pool, and these are two different subnets, there is also the assumption that Neutron only allows one IP per MAC, one subnet per MAC address per port. And this makes sense, right? It's like a NIC. So you're not going to have more different subnets trunch the same NIC on different MAC addresses. So what we had to do, skip that on myself, that happens. Another issue is there is no IP quota support in Neutron. So in Nova Network, you could actually control how many floating IPs someone could attach. In Neutron, we found that there is not yet quota support for this, this I think just got skipped. So this causes some problems to make sure that one person or one group doesn't accidentally go and snake all of the available IPs. We also have another issue, which is what happens if all available IPs in Iraq have been exhausted? If you go and you go to provision something, and it goes and it finds a compute resource, everything looks great, and then it goes to grab those extra IPs, it's not going to have enough. It's going to return an error, and then people are going to hate me. Users get mad, they go after RSEs, we then get mad, and they go after me. And as a result, everyone hates me and I'm sad. So the solution we're looking at right now is we're going to have several different allocation pools for static IPs and extra IPs. The way this is going to work is we will take a subnet and we're going to artificially split it up. So we're going to reserve some IPs for our network devices so our gateways are broadcast. We're going to reserve a section of the IP space for our hosts that are in the racks of 40 hosts in Iraq. We'll reserve 40 IPs for those hosts. Then the remainder of that pool, we're then going to set aside. And we're going to put that in, we're going to set that aside and attach it with a floating IP construct. Now to handle the quota aspects, we do want to look at getting this into some sense of quota in a neutron. But what we're doing right now is we're actually just using Nova's old floating IP quota. It still worked, so we're just using that. And of course, the next thing is that we have to be able to schedule an IP availability. We learned that, obviously, if we make this mistake, users are not very understanding about this. They get upset. They come after us. So we need to make sure that when they go to schedule a resource, that it's going to succeed. So our current thought is we're going to create something called the IP availability filter. And this can search for a network with available IPs. So what this would do is during provisioning time, it would call out to Neutron, find a network that has enough free addresses, and then use that filter when selecting compute nodes. It will also, after we find a compute node upon which we're going to build, we're going to do one last check to make sure there isn't a race condition that we still have enough IPs free. If there are not, we'll enable the retry filter so that we'll quietly just go and start the schedule process again, find the next host, and try again. So the last thing is IPv6. This is one of those things that seems like a no-brainer, like just support it. However, again, legacy infrastructure, we have some applications that do not they're allergic to IPv6. They just don't support it. And so some of these things that are legacy applications, we're actually working on deprecating them, there isn't a lot of value in us spending time and effort trying to convert these things to support IPv6. So we're just going to let those things be IPv4 only. Of course, we have new applications that are trying to get with the future, and they want to be IPv6 only, and the others want dual stack. So how do we support this use case? The simple way we're going right now is to say that we're going to add a flag that allows you to specify which IP type. So you can say Nova boot, blah, blah, blah. IP type IPv4, IPv6, that'll give you dual stack, or you can say one or the other. If you don't provide one, we'll just take a default value. And the simplest way to do this is we already have all of our IPs in DNS with a default hostname. So we'll do the same with IPv6. We'll create a record there. And when you go to boot the host, you provide a hostname, and we rename it in DNS. If you say IPv4, we do the A record. If you say IPv6, we'll do the quad A. If you say both, we'll rename both to whatever your custom hostname is. It's a pretty simple way to do this. It addresses the use case, and it's fine. It's actually fairly simple and beautiful. So that is actually the last of the things that we were presenting on for this. So if there are any questions, or if anyone wants to come chat about it afterwards, let me know. And if you want any of the code, if you want to work on the substream, that'd be great. Sounds like now. Cool. All right, thanks, y'all.