 Hi, welcome. So we're here to talk to you today about some of our scaling and deploying fun of neutron at rack space and Kind of some of the things we ran into some of the things we think need to get improved Yeah, so let's get started So right now rack spaces were still fairly early in our deployment. We've done about half the DC's We'll continue the other half when we get back from the summit But we're kind of moving from gonna early versions of quantum and Belange Moving them full into neutron. And so primarily where we want to take the focus of this talk is kind of The interaction between Nova and neutron and some of those things that can kind of get out of control when you operate at a big scale So when we talk about scale We're talking about tens of thousands of compute notes hundreds of thousands of instances and most of those instances on two or more networks So that's a lot of calls that can come into neutron And so when we took our approach towards what we needed to deploy neutron to and what we need to achieve with it We had a couple main criteria that we had to do Obviously anytime you replace a service we need to make sure that everything we were already using kind of those same API's needs to be in here. It still works We also really wanted neutron to start being now the authoritative source for all network data in our public cloud So, you know, yes, we have different things below it. It's the ultimate state Controlling that Also, if we're getting rid of Melange, we need neutron now to also be able to take care of our IPAM in an open-source fashion So also rack space we have a lot of different types of networks We have overlay bridge ports and coming up We're gonna have containers and a lot of other different types of things that neutrons gonna be plugging in so we needed a very modular back end so that No matter what network you're plugging into neutron has the driver that can basically orchestrate that And that basically allows us then as all these new products in our wide portfolio Kind of it gets integrated into neutron. We have an easy way to bring new products in and it's much more upstream So what what would end up coming out of those requirements is quark? Basically, it's a an open-source plug-in that we wrote in-house for the neutron v2 API It also comes along with all the IPAM that we needed to be able to kind of keep track of all the IP addresses with our public clouds We needed a couple things also to be able to orchestrate and to achieve the conversion to neutron So along with the quark plug-in that runs on neutron. We also needed a database migration To be able to take that data out of quantum and Melange aggregate that and put it into our quark plug-in Also upstream generally has you know One way of you viewing what the API should do and sometimes rack spaces business requirements Require that we do something slightly different and we don't want to keep trying to go to upstream to take every one a little niche case That we need so we implemented a waffle-house stack of what we call our middlewares To be able to kind of tweak some of those business requirements, but still maintain the regular neutron API So when it comes to what we actually set up and deployed It's generally a three tiers of things so up in front that's taking those first API requests are load balancers Those are you know doing our health monitoring of the API nodes directing those to request to the active nodes We actually have quite a few API knows these corpse scale at horizontally anywhere from you know two to eight depending on the size of the DC and the scale that that that DC needs to Achieve and those then run our quark plug-ins on Underneath that API node and then our waffle house gun in the API stack And then we kind of borrowed you know from a lot of what we already do with a lot of our other Nova services And that we kind of have playbooks already to build up you know very H a H a built databases to where we have coral sink pacemaker Stoneth to be able to kind of very quickly respond to any kind of outages along the big eight base And with that I'll turn it over to Justin who's going to talk a little bit more about our waffle house implementation the kind of what we can achieve with that Hi everyone, I don't know what's more comfortable standing or sitting in those chairs. They're pretty terrible Well waffle house as he mentioned before was a way for us to have our own specific requirements without having to push things upstream these requirements are very very specific and they probably wouldn't make sense for upstream and It helps us to prevent us from having other differences from upstream code So we don't have to constantly merge them in during our deploy process and then deal with conflicts. That's terrible And we feel that upstream's efforts would probably be be better for the to help the broader community Instead of just dealing with all these nitpicky tiny tiny things At at at rack space we have a we've kind of jokingly call it like the API mullet It's a business logic in the front and a party in the back it's It really it really does it helps with a lot of Thank you It really it really helps with the dealing with the business logic for a company any company really And but it does allow us to do all the stuff we've got to do in the back end A very basic example and I have like two or three examples of what waffle house does and by the way This is an open-source project that you can go ahead and freely put in front of any neutron or nova environment a Request will come in and just like normal middleware pipe filter model It'll hit the waffle house middlewares and those are called waffles by the way waffles middlewash waffles So the request hits the first waffle and what we want to check is to make sure that this one for this example Is that does the request have a particular you ID in it? It could be a network ID could be a port ID. It doesn't matter Since it is residing with the request it does have access to the whole body So you can introspect the body and you can learn whatever you need to learn from that The other thing you could check is to see if the request violates policy Such as you don't want a person to have a particular Cider a very specific cider like Something a random number that you don't want people to have or it could be IP policies for instance Or for instance, what if cork our plug-in? Wants to have an additional piece of information that neutron doesn't provide so neutron doesn't provide this piece of information to Nova I mean to neutron Nova doesn't provide this information to neutron normally so using this we could make this waffle actually Insert that new information either by querying some external database be it keystone or it could be Nova again it can query Nova again for more information and then it'll just pass it through as if nothing happened So neutron is none the wiser and then it'll go through and it'll do his job Another thing that can happen is called routing for instance Keystone has roles so you can apply a role to any one of your tenants if there are certain tenants You want to do very specific things with Such as you don't want that tenant to have you want you want to enforce that this tenant has These particular two networks service net public net for instance, and then some additional networks for your own business logic You can use this particular routing waffle, which will introspect the Role from the headers or from the context It will then pick a different whiskey path if they aren't a part of that particular role It'll just go normally like nothing happened But if they are a part of that particular role It will then just insert these other waffles in front of them and then it allows you to do more checks And this is very helpful because this doesn't change neutron at all And Nova is really not doesn't know anything about it either So it's it's very quick and very easy to debug actually The reason why we're used we use waffle house primarily the first reason why we did it was because There are a lot of calls to Keystone whenever Nova and Neutron were interacting before when we had Melange in quantum None of those calls were ever reauthorizing or contacting Keystone or identity But then we went to Neutron Neutron by default is like hey, you need to re-auth all the things So every single request would be like oh is this token valid? Yeah, okay, and then it would hit Keystone again and Given the amount of traffic we had and the amount of info cache updates was which Andy will hit later It turned out to be debilitating What we did is that we added no off support to Neutron which it already had kind of but using waffle house. We were able to Actually make it work. So our Nova is working with normal Keystone Authentication and then no Neutron on the other side has no off So every single request is just perfectly not off The API request will come in with the X forwarded for header It'll hit the Neutron API with waffle house And it'll do a pointer query on the on the original requester to the DNS server and it will return and if that particular host name was a part of our configured trusted domains Then what we'll do is an additional step where we'll make sure that that particular address record is the real original requester Mail servers do this from the past for a long long time. So this is pretty normal So by doing this we at least know that this request which is going through no off is coming from a trusted source It'll then This may sound a little complicated But this is the way that you would configure such a thing at the top is the normal composite and the API paste and all of this Waffle House junk is happening inside of API paste so you can go ahead and deploy it with puppet or ansible whatever you need to do so you have the normal Composite where it says no off and the no off Keystone strategy that we're using has the DNS filter Portion right the front and you could see that we took out the off token and the Keystone context For our no off so we're no longer using those middlewares as provided by open stack and then the configuration for the DNS filter is right below and our white list is a space separated list of domains and You can see that it says enabled equals true because all of the waffle house things are feature toggled off By default so you can install it and it will not affect your environment until you explicitly enable them so it is relatively safe and you can turn it off and on by changing the that particular flag and in addition to the The auth problems. We also had a lot of other call volume issues which Andy will talk about now So here's a picture of one of our deployment nights the green is the call volume that we had Pre-deployment the two vertical bars are the start and finish of the deployment and The red is the number of calls to neutron after the deployment We had nearly triple the call volume Just by turning neutron on so this was a big problem our API workers were overloaded So we had to decide we had to really dig deep down and kind of see where's all this coming from and it turns out It's the info cache updates from Nova almost a hundred percent of these requests are those info cache calls So really quickly info cache is Nova's view of the network model So Nova can respond really quickly to requests Like let's say a server list right and it returns the network info from that It doesn't want to make a call all the way to neutron to service that request So it keeps its own copy of the network information there This cache is refreshed on any operation that goes out to neutron So if you add an interface remove an interface add an IP floating IP Whatever the cache gets refreshed from the Nova site and we really would like to see a callback system For this kind of thing, but these updates they happen all a ton They happen on Nova compute restarts, which can be very interesting if you're doing a restart of Several thousand of these at once you can have a big stampede of these requests coming in as well They also happen by default every minute or every heel instance info cache interval And this is six calls to neutron per port. So these things were crippling our API nodes turns out We just set the value to zero. We don't have these updates happening anymore There is a little bit of risk with the consistency of them But since any networking operation is going to refresh that cache anyway It's been fine for us in production to just have that at zero and it completely reduced our call volume. Oh And speaking of cache updates, there is some issues around that with Nova cells This is just kind of a graph of our global sales workers rabbit in queue after that deployment that we saw and Let's say The messages that were coming into that queue were coming in at a rate faster than they can be consumed This was causing all kinds of chaos for us. It builds were being stuck. They would be issued They would be building but they would never quite go active because that requires the message to go back up to that global queue. So For an idea of context Our normal kind of fluctuation in this graph wouldn't even show up here This was just exponentially larger in terms of message volume We found this during a deployment and we sent this patch upstream, but we since Disabled the healing of info cache updates anyway, so we're double safe from that. I guess Kind of the things that we need as operators for Nova and neutron to work at the scale that we have Quickly is a callback system the periodic update Just didn't really seem to work for us at the scale that we're at We'd also enjoy benefits of having a read-only slave for some calls that would be read only Some of that needs to be worked on in our plug-in specifically, but I think the neutron project itself could benefit from that as well and Neutron does work with cells, but it's It's not something that cells is cells isn't like native to neutron We only have one collection of neutron workers. We don't have a collection of neutron workers per cell Right, so we would really like to see cells and neutron work together so that we can kind of segment a lot of this load that we have to deal with and Yes, we would like to have fewer calls that do more so building my instance Getting my ports and my addresses and all this stuff I really like it to just be one call for all of the networks that I'm going to need instead of a Back-and-forth that says okay. What do I have? Oh here it is. Okay. Here's the ports that I can do Here's what I want. It's a little bit too chatty for the scale that we're at So we would just like to have more calls or sorry fewer calls that do more So what's next for us at rack space specifically we're looking to publicly expose neutron later this year We're also a mirror on our team is also doing blueprints in the community for security groups with OBS So we're really excited about that And then here's some links to our patches that we've submitted that may be particularly relevant for other implementers Blueprints that we're working on specifically around the callback system that I mentioned earlier the OBS firewall driver We also have links here to waffle house which will point you to the waffle house middleware So we use for Nova and a wall house middleware So we use for neutron and then we've also linked our Quark neutron plug in which handles our IPAM that these guys mentioned earlier Thanks Any questions you guys have? Do you think the current bulk updates supported can be used in place of what you're doing for multiple updates which you mentioned right now I'm I'm sorry that that mic was really really quiet. I did I couldn't bulk bulk updates how How effective is bulk updates for Noah and Neutron as compared to what you have put is there any Any attempts on your parts to use that Well, I think I believe the problem wasn't the fact that we couldn't make many ports at the same time The problem was specifically that it was a it was a bunch of steps Nova doesn't particularly do it already And it was a lot of questions It was a question-answer Conversation between Nova and Neutron and not just say Nova allocate for instance with all the information To Neutron. So if Neutron had say allocate for instance, that would work great Versus Nova calling allocate for instance and doing all the Neutron calls one by one by one Which it has to do because you know, it doesn't know the network yet and at this cell level What is that is not there in Specific cell to new Noah Neutron Noah you meant what what is missing in there? cell interaction with Neutron, so we can't have a Neutron node per cell we break up our deployments into several different cells per region and we can't have a Neutron Instance per cell and then a one that is above the region that the cells replicate up to that's not supported right now It doesn't exist in Neutron But that would help us with some of our scale issues that we have so current support for region which is being introduced Do you think that will help our so we have to point all of our cells to a bank of Neutron servers now instead of Logically segmenting them out Okay Thanks There's a good presentation from eBay yesterday about similar problems with Neutron at scale and they mentioned the healing interval Which they had to tune down as well They also mentioned keystone calls for for and they went to Pki for that How did you guys find the keystone volume? Well, the keystone volume was terrible but the what we did instead of Trying to change I What we did is we switched to know off between the the services and Waffle House was the way that we Performed that without changing a lot of code All it appears that open stack works great as keystone Or it works great as no off It doesn't appear to work great as kind of both and that's where the waffle the waffle house Filter comes in because it creates the context for you because when you're in no off You don't really have a good context to work with and it also Inserts some of the information that's missing such as the roles which we use for to make some decisions So you did do keystone to Nova though, right? You're but you just turned it off from Nova to Newton, so right Nova is our main consumer of Neutron and We just turned that off between those modes of communication. We did kind of a Operational readiness thing and we noticed this call volume and then we kind of extrapolated that to scale and said oh my gosh We've got to do something about this. Okay. Thanks It also might be worth mentioning on that point that we're still into where Neutron right now is only servicing the internal stuff So we're only servicing the compute nothing directly from consumers or from customers So that when we actually publicly expose that will be nodes that will be running the full off stack but So with waffle house would you characterize it as like a framework because it seems like at its core You're just adding filters to sort of the you know WS GI What what exactly is waffle house a framework to kind of chain those together on top of what you already get out of API the pace To be honest. It really is just a collection of filters. It is a pipe and filter model very very basic Each one of the waffles the filters stands alone they there is a There's recommendations in the waffle house Code I guess to not assume that another filter exists some of them have to exist such as the routing But otherwise they're very small most of them are less than 50 lines very easy to test and It it's just a collection and all of them have been stripped of all of our specifics so anyone can use them and we really do wish that it becomes a a collective Place where everyone can have these really strange business Logicy things and just collect put them in that one place then we can all share without having to bloat the Infrastructure such as Nova or Neutron with these weird requirements, right cool. Thanks Okay, it's a good presentation and we also said yesterday our you know the same issues what we faced at eBay my team Yesterday, so I'm just asking how you guys are testing before you upgrade your environment because you have a huge lab Or you use your cloud itself to test all the upgrades and how do you guys do that? So so we get to write on you know rack space as a whole has a pretty big You know CICD devirer environment so that we go through dev QE And we've actually got some very very talented take QE devs that have written a lot of our tests to a lot of the Scale and break testing so you know we emulate it all of the info cache updates Kind of just seeing where breaking points were same with the API volumes those at the same time and a lot of those Kind of test and we've got pretty much like a you know fairly big You know QE and integration testing environment or we can do those kind of sort of tests We also went through I mentioned a little earlier We went through a kind of a operational readiness exercise for this and in that exercise we identified some metrics that we would watch in our smaller pre-pod environments and kind of look for Normalities and jumps on those to be able to pinpoint some of these bottlenecks some of these kind of hot zones of the code to Help us make these decisions to make it work at scale. So is it open source anywhere? You are seeing infrastructure. How you guys are you know What are the different test tools that you have to run in your performance lab? Do you guys have that open source or it's internal? The tech So for Testing are getting ready for this deployment essentially we we did two Two major efforts to test one was a functional test, which is equivalent to what? It's done upstream with a tempest with a tempest. We we have our own functional testing Harness but it's essentially more or less the same thing that that it's on upstream for With the tempest for the performance side of these we have we have a harness that we call performance testing as a service and essentially what that's that's a you can think of these as a Rest request engine and essentially we we write we write a script We deploy that a script to a number of agents and with those agents we can simulate any number of users let's say a thousand users and And that's how we create load to the test we we created a three load test one was a basic crowd operations for ports subnets and Networks we have another Test that simulates Computes compute notes doing the info cache of update cycle. Sure. So we we simulated loads of 100 500 a thousand computes With that a number of simulated instances and updating the info cache periodically every minute of as they They mentioned and the third is that that we load test that we are running against the the system is an I-Pump test. Okay. So you said open source or it's internal tools that it's I'm not sure if it's open source But if you talk to me later, we can we can discuss. Yeah, I barely made it. I'll talk to you. Yeah Thank you Hi, you guys are running multiple neutron instances in your deployment, right? Yes, so how are they synchronized? Is there a mechanism to synchronize or it's not required So so luckily that's one of the big big benefits of moving to the neutron v2 is everything happens kind of Transactionally so we can actually scale these out horizontally and not worry quite as much about the sequence of things happening Anything else? Yeah, and then that's a quite a bit of what went into quark was to make sure that those Transactions happened very cleanly across, you know several worker nodes But yeah, that's kind of what the plug-in does you feel free to check it out I'm just wondering on the on almost the other side of the the network performance itself How did you ensure that the actual performance of the network was gonna either stay the same? I'm sorry. Yeah, the network performance. So I mean most of neutron is about the orchestration of the configuration Yeah, the performance itself Whole yeah a lot of the same team But that's more I'll do a bit above like OBS and the flow orchestration of that which isn't changing by switching to Yeah, yeah So his question was around we've split up obviously got no offer internal nodes And then eventually we're gonna go back to publicly exposed and actually need to do the full off And so how do we kind of deal with that? Those will be separate notes So we'll have a different pool to handle the customer load versus the internal load over the actual scale the cloud we kind of benefit inside at rack space that we have these kind of All of our infrastructure that generally controls our public cloud goes in another cloud And so we can dynamically spin up the resources that we need to be able to build out those sort of things Anything else? It's got to be something But Mike Mike, please So what's your lead time to you know take the code from upstream and then going and deploying in production? We pull down upstream daily Okay, so and then it's you know going through within our build process the eye And there's been a couple talks in the house of kind of what rack spaces approach to that is but I Think what's running now is based on a release candidate for ice house. Yeah, okay. Thank you. Well. Thank you all very much