 Okay, there we go Let's get going today. We're going to talk about getting our DNS as a service to production using designate I'm at Fisher. This is Eric Peterson for both principal engineers at Time Order Cable who work on the open stack team Clayton O'Neill could not join us today. He actually did a lot of this work So I want to leave his name and contact info up here Today basically what we're going to talk about is starting from investigations how we have how we got these Designate running in production today A little bit of background We typically were providing infrastructures of infrastructures a service giving people VMs right but the most requested feature We got was self-service DNS before we had designate the process was manually go file a ticket Someone on our team manually go add the entry it didn't scale and it was annoying Engineers don't really like to work on tickets or do manual processes like that Well another little piece of background we use VxLan base private tenant networks So this means we want to attach records to floating IPs when they're associated and we also share a single DNS namespace across both our regions We also share the same namespace with a VMware infrastructure cluster so that customers can use the same The same domain regardless of which type of infrastructure they're running on So how do you get started with designate? First you watch all the summit videos from Atlanta and Once you do that you hop in the IRC channel and you ask Kyle and the other guys there a ton of questions and just Bother the heck out of them usually What this as this came out? Kilo was still in heavy development. We actually asked you know should we just go ahead and use kilo and we're told absolutely not It's not ready yet. So all this talk today is based on Juno Oh, we'll talk about kilo here at the end We read a ton of docs During this process we learned that power DNS back end was one We wanted to use just from a use point of view and a testing point of view We also kind of decided at the time about how we're going to do the architecture here across our multiple regions and have the shared shared domain space I'm going to give you a brief brief overview of what the architecture of designate looks like This is a zoomed-in version. This is one control node in one region We have multiple regions and multiple control nodes per region, but just for simplicity's sake. We're going to show this if you look at the Look at the designate API box here kind of in the middle. This is where all the rest requests come in We front this two ways for internal requests and horizon. We front it with HA proxy For external requests from customers or CLI kind of things. We front it with an 810 hardware load balancer Further to the right, you'll see a designate central box. This is where all the heavy lifting of designate occurs This is where all the database things happen where all the data is stored and in our case We're using across the region Galera cluster to store this data because we want the database records Shared across both regions since the domains are shared across both regions Central DNS also hosts all the power DNS back in code In any communication between the API and this happens through a rabbit MQ message service You'll see a power DNS box as well power DNS is what serves up the domain data It's configured to use its own mySQL database and designate central keeps this up to date Basically keeps these two-day bases in sync Party nest also see has a connection to info blocks. We are using info blocks to Basically serve our external facing DNS So it made sense to sort of make this connection here and have info box Basically secondary the DNS data from power DNS Finally in the middle something we're going to talk a lot about today is a sync The designate sync is something that essentially listens for neutron and nova events and then takes action on them So the kind of events we're interested in is somebody associated a floating IP or somebody disassociated When those events happen, we want records to be created automatically So that customers don't have to go in and click a button and create a record if they spin up a VM and Associate an IP they get a record that that would be our goal Okay, Eric Yeah, so delving into a little bit more about the groundwork to get designate running in our environment One of the first things that we needed to work on was Puppet work so a lot of our infrastructure is deployed via puppet So one of the first things to do is to go out to the puppet community the open stack puppet modules And look at them and kind of take a take a peek at them. So we started with that And then we kind of realized that we couldn't get a Designate package that was available from canonical at the time that we were ready to go with so what we did In addition to that we were also going to be running this on our control nodes And anytime you're adding a service to a control node that for the most part is stable You want to kind of be careful how many things you're changing adding doing things like that So what we did is we decided to deploy designate in a virtual environment And so right now this work to have puppet deploy designate through Python virtual environment It's a local patch that we're holding. We're curious to see how other people react to it We're welcome to to share this information share this approach with others But it's it's not something that maybe the entire puppet deployment communities ready for all of that yet The other area that we wanted to work in To add some more a little bit more polish around was the the UAC so the designate plugin for horizon Our end users kind of vary in skill levels Some of them might be very familiar with setting up their own DNS records Some of them might just say I've got a VM. I need a pretty name to go with an IP address Could you please just make things simple for me? So another aspect that we looked at was policy enforcement changes So within horizon you can see up above. I've got like an admin user. They can create a domain delete a domain We're really trying to isolate what our customers see what it what they have access to for designate So we've added policy enforcement to the designate horizon plugin So down below is what what a more typical user for our designate service would see they don't have the ability to create their own domain Just one more example of some more policy support You can see DNS shows up on the left hand side. It does not show up on the right hand side We're doing this through keystone roles So if we grant a keystone role for DNS to a particular tenant or a particular user in a tenant Then they'll have this DNS features show up within horizon So taking this approach enabled us to kind of have a limited beta and have a service that's out there in production But not expose it to every single one of our users just yet So it enables us to kind of work out some of the kinks and kind of control who has access to it early on Another thing that we changed was the the ability to Create a record the screen on the left was a little bit harsh maybe for some of our customers So we've softened that up a little bit you can see like When you want to create a record like some people will know what an a record is and they'll be like oh that makes Sense some people will say what what's what's an a record or a quadruple a or whatever? So we've kind of simplified some of that stuff. We've added some extra you Airhandling and messaging for end users when they don't create the they don't use the right format for the name some additional information like that Some other things that we needed to consider for rolling designate out to production Was we needed to figure out how we were going to limit things to to our to our end customers What top what domain were they all going to to share and live off of? We also had to to kind of go through and provide some some documentation to end customers to say these new folks that we're going to be bringing on What can we set for the right level of expectations for them? DNS as a service was a little bit kind of a new concept for some of our customers So we had some documentation to work through Also, we're working with info blocks and we had to write a little bit of our own custom Synchronization code or our own kind of registration code to keep info blocks happy with what designate thought the state of the environment was So rolling out designate our schedule where we are So we've got a limited use in April. It was really when we started As I said before we've got usages controlled through keystone roles So you have to have this certain designate role to be able to use To use it We've got the designate sync beta starting kind of after the summit and really right now We use a tool called node pool and it runs within our production cloud And it spins up hundreds of VMs all the time and right now That's a that's a tool that we use to help support some of our CICD infrastructure, but it's using a designate right now So when it stands up instances, it's really creating a lot of records tearing them back down So eating your own dog food in this case has helped us kind of flush out to make sure that that it's going to work in a reasonable way We're looking for general availability here in the next month A lot of that depends upon our upgrade to the kilo schedule. We'll look at kilo There's some changes with kilo that are coming up and we're kind of looking forward to that But also upgrading all of our other infrastructure to kilo is no small feat So we're going to have to kind of watch how these efforts track for timeline So what we offer to our to our end customers? What what do they see? So right now they get get their own domain there. You can see it's right now It's a tenant so whatever their tenant name is that cloud that TWC net We've talked about giving them the power to be able to choose their own domain So we might be able to do that so instead of tenant you might be able to have an arbitrary Name that you would like The crud operations on the actual records, they're going to be allowed The crud operations on the the domains they are not allowed And so you can see that in that earlier screen where I couldn't see the create domain button And we're doing that because just all the synchronization We need to do with info blocks and keeping everything in sync and making sure all the DNS Infrastructure components are all in agreement was a little bit it became extra complicated for us And it wasn't something we were ready to kind of bite off on you just yet So what would a customer see so this is what I'm going to kind of show through an example here So this is me logged in to horizon. You can see I've got three instances there I've got an instance named DB and I've got two instances named web server All three of these instances have a floating IP address And if you go up at the top you can see that I've got a project my project or tenant is called Eric's Stuff so it's not very you can tell that the developer did that Yeah, so what would I see if I've got if I've got this information What would I see for my DNS records? This is what you would see So the interesting thing here or the thing to point out here is I had two instances in Nova called web server So how do you resolve that that problem? So the first instance it will get web server dot Eric stuff that cloud at TWC net He'll get his IP the DB one that of course will get its own as well But then the ability to Disambiguator to kind of clarify when we've got collisions We use the the floating IP address except we use it with dashes And that's how we give that that extra web server instance its own domain name. That's the policy we use so Okay So tools and monitoring is something I worked on for designate And so I want to talk to you about what you might have to do for this First testing we have a standardized test framework that we use I wrote some smoke tests for this This is basically is designate working. Can you create a record? Is it resolvable? The stress test is an extension on this let's blast designate Let's just create like thousands of records simultaneously deleting them Let's see what happens Basically it's designate gonna blow up when this goes out to production. It's in heavy use When we did this we found that was taking a while to get some of the records to be resolvable We found the priority in us has a refresh option set to a few minutes I think so we had it set to every five seconds so that when a record was created it would get out Refreshed out to info blocks very quickly We had some basic timing with this we found 80% of the records were Resolvable within 10 seconds within 20 seconds. It was 98% and by 30 seconds. They were all resolvable Those are just rough numbers But when is my record going to be resolvable as a customer every quest a question every customer asked us all the time so we wrote all these great tests and We found problems The first one is when you try to create and delete thousands of records at the same time the database deadlocks Your API call will time out after 60 seconds and you look in the log and it says the database is deadlocked I believe this is fixed in kilo We don't think we're going to run into this product in run into this in production except possibly with the sink Customers going into the gooey and making a record. That's not going to be tens of thousands of those going on at the same time When we saw the deadlocks we started diving into the database and there's a concept in there of records and record sets The problem we found was there was record sets without records and things seem to be out of sync We talked to Kyle and apparently this is a normal situation or known issue and it's not something to worry about We did see a second database problem, which is something to worry about So designate has a database power DNS has a database when those two get out of sync For example, the records gone from designate, but it's still present power DNS You now have a name entry that the designate doesn't know about So for the for these database problems, we actually wrote I sing a monitoring to go work for them I mean, would it be probably easy to just have I sing also just fix them But we want to kind of see the scale the problem first It's my knowledge in production. We've not seen the orphaned record sets. It was only when I was stressing it quite heavily So what do we do for monitoring we as I sing as I mentioned we run basic API checks is the API responding on the VIP and on the nodes We do the database monitoring that I mentioned in the future I would like to know For all the records designate has are they actually resolvable because there's a couple pieces in this chain There's designate power DNS and info box anywhere along the way that chain breaks the records not usable anymore Finally for the sink, you know sinks creating deleting records all the time You can have an issue where you have a floating IP that doesn't have a record and you can have a record That doesn't have a floating IP. I'd like to know when those things happen so that we can clean them up So at this point I'm going to go into the sink handler You can run designate without a sink the point of the sink is automatic record creation Your customers might be fine. Just going in the horizon and doing this The sink is a pretty important feature just because it simplifies things for them. They automatically get a record They don't have to think about it But overview of a sink. What a sink is a sink listens to events from Nova or Neutron registers handlers for them and Does things based on you know, what handlers it registers? The configuration is done on a per handler basis So what were our requirements for a sink? This might be different depending on if you're using Nova network Neutron or whatever, but basically when Floating IP was created or deleted. We wanted to create or delete a record Records had to go in the tenant domain as we've discussed before and we should base these on the instance name as Eric showed You so web server becomes web server dot with the fallback rule if we already have a record of that We wrote this code to support multiple tenant domains Although we don't currently use it and we wanted the names to be flexible so that if you don't like your DNS records being Eric's Stashed off you could possibly set it to be something else. So the code is flexible that we don't currently use that feature So how do you get started with a sink? We started with the Neutron sink handler. It does everything based on floating IP addresses Which is how we got our great idea for the fallback name Puts everything in a single domain not a domain pertinent Doesn't handle instance metadata, which is an important thing that we found out later was good to have But it was a good basis to start. We also just for completeness of sake dug into the Nova sink handler But since we're using Neutron you didn't get useful events out of it that we would need for associating floating IPs So as I said we started by forking forking the Neutron handler We took the handler we wrote a CLI wrapper around it so that instead of putting it out there and like waiting for events to come in We could just Exercise it by telling it a floating IP was associated. Go do what you need to do and let's see what happens Initially, this was calling into the designate REST API and again, we were using a Juno target for this So how do you write a sink handler? Well, you start by looking at all the messages Neutron sends out and there's there's lots of messages And you figure out which messages are going to be useful Once you get those messages you look at the payload and you say given this information in this event Which I'm going to show you here in a minute. How can I get all the data? I need to create instance name dot tenant name that cloud that TBC net What we found we did this is Associating floating IP was great the event had a ton of information in it But when we started doing disassociate or deletes things got to be a problem Now this here is an actual payload from Neutron When an associate event comes in and I'm going to talk through it in a minute First thing you need to know before we get into this Neutron sends out events at the beginning and end of a CRUD operation But the event that goes out at the beginning the start event happens whether or not the event succeeds So you could get an associate start event that might not succeed So you can't use that event you have to use the end event because you know then that it actually went through The other thing is we can find any documentation on any of these events, so We just kind of watch these and read them and thought those fields look kind of useful and let's see what happens This is a payload specifically for a floating IP update end event This goes out after a successful associate event happens We know it's successful because we have a fixed IP address and a floating IP address in here The other interesting thing here is the port ID This is the port ID of the Neutron port associated to the instance So from this information though, we need to find the instance name because that was our strategy for naming this this thing So what we do is we take We query Neutron for the device ID associated with the port ID and We can take that information and get the instance uuid and then go ask Nova what the name is I had to read the notes on that one. It's a little complicated. It's simple We would love for this payload to be so have everything in it, but Not a choice Okay, so what happens with the disassociate event? You remember we can only use the end because the start may fail The problem is after the end event is the point it disassociated So all the really cool fields we were using are now dull So what we did at first was we take this tenant ID and we go query designate And we've walked look through every record and try to find something that matched This was a reasonable workaround, but it's not very efficient So at this time we started digging into the records database a little bit and we found a bunch of fields called managed And then so started digging the source code trying to figure out what these fields are And it turns out that these managed fields looks like you can use to track metadata inside inside designate using the RPC API So at this point we basically scrapped most of the code And started using the designate RPC API We hadn't used it originally because it's not documented and so we thought it might be might be really hard to use Turns out it's actually way more flexible Remember before I said we took the tenant ID and we had to search every record Well using these managed fields like managed resource ID you can do a query Find me every record that has this key value pair that matches And so that the search was much more flexible and much more efficient This ended up being a way better way to disassociate floating IPs There's one last problem. I mentioned we check for associate and disassociate events when you delete a VM You do not get any floating IP events You get a port delete event. This is the sum total of the useful information that you get in the event So now we have this and we need to go delete a record. So how do you do this? Well, the fortunate thing is again, we used a managed Managed extra field and we just match this port ID into a record. We've put into designate Without this we actually had no way to solve this problem Okay, so the syncs getting complicated at this point So let's break it down because there's only four actual steps the sync does I'm step one. This is when the when your VM goes away. If you get a port delete end event Find a record and designate and remove it That's pretty simple step two If you get a floating IP update end or delete end event, then we go delete the record So this is interesting because the floating IP update event could also be an association happening But just for code simplicity, we just go ahead and delete the record delete any related records anyway And then we go immediately create them. This just ensures that if we've somehow gotten out of sync Maybe we can we can get those records deleted before we go read So this is the deletion Step three, this is the creation If it's an update end and it has a fixed IP address and floating IP address Let's go try to find what domain we should be using based on the tenant Let's go get all the instance information, which is the name and let's make an a record Following the pattern Eric talked about before Step four also matches what Eric said if the if the record creation fails Because there's already something there called my instance Then we basically fall back to using this floating IP address dashed format So the step the sync works. It's running in production now with node pool, but there are some problems with it Cleaning up these managed records as a pain if you wanted to do this in Juneau you you Set a policy JSON file and then you made curl calls Which is obviously so much fun to do However, I think Clayton harassed Graham Hayes enough and with Graham's in the audience today and this feature to To change managed fields was added to the CLI I think for Liberty possibly You don't want users to be able to Delete managed records because managed records also include like your DNS servers and your SOA records You don't want customers messing with those The next thing is if the sync fails in any way along here leaving a record behind failing to create a record Users have no idea because the sync is completely transparent to them So what they'll do is they'll file a ticket with us saying I don't have a record. What's going on? We'll have to go dig in the logs Also the name conflict instance dash floating IP Does that doesn't make sense to me? It doesn't make sense to anyone our customers are never going to realize. Oh, yeah That's a real obvious name. It's be easier to use IP addresses if that's a DNS record you're gonna give me So yeah Finally We really want to get reverse IP lookup supported here I know if I was a customer and I and I my record wasn't working I might do a reverse lookup on my floater and see like oh you gave me a different record or something I think they're supported in this designate for this, but we just haven't implemented it yet Finally Eric showed you a bunch of horizon screens instance name IP address. Why not the domain name? That would be great That's something our customers want Okay So this whole big section about the sync. This is all public code. It's on github We mirror this from our internal Garrett so what you see on github is what we're what we're deployed to production with right now We'd love to get patches on this if you send us a patch and it's good patch promise. We'll take it and Also feedback or comments. We have a great read me there It's not completely TWC specific even though we've made these rules about Domain pertinent and that kind of thing. We've tried to make the code flexible so that other people can use it Yes, so what's up? What what's what are we looking forward to one of the main things that we're looking forward to as we we look towards the future really? This is the architecture. We're looking forward to and a lot of this looks very similar The one thing is the lower right hand section is a designate mini DNS And we're looking forward to that because that's going to make this synchronization with power DNS and designate It's going to resolve a lot of those issues where you've got a Simplified version of what should be reality and so looking looking forward We really look to this this new architecture in kilo and about kilo we plan to migrate as soon as possible Obviously when you're trying to upgrade your all your open-stack infrastructure upgrading to a new version is not always the quickest thing It's usually Provides plenty of excitement for you. So we're but we're trying to do that as soon as we can Obviously as they just said we're excited for the mini DNS In the transaction retry retry support Right now we're working on migration plans actually Clayton right now is actually looking at a lot of the migration work We're also looking for a better integration with our info blocks components Info blocks right now. They have a prototype designate driver available and they're also looking for tighter integration with a neutron and designate So with that I hope that everybody got a little bit of insight into what it takes or at least what it took us to get designate running in In our limited production almost full production at Time Warner Cable Finally, we'd like to thank Cal Makenis and Graham Hayes Everybody in the open-stack DNS channel and all the support that the that the community's been able to provide us to help us roll this out With that I don't know if there's any Questions and a lot of times you go up to the microphone. I think is what they prefer Except maybe Dave might be loud enough. I Believe it was written to support that yes in the sink. Yeah in the sink Yeah, that would take some reconfiguration and stuff But customers have asked about basically bring your own domain or bring your own sub-domain And customers have also said when I made this tenet called whatever I didn't expect you to call my domain that so Can they give me something different even if it's a domain pretend it give me a better name for my domain So that's also something we like to do But the goal is to get a minimally viable product out see how people use it see what complaints people have and then Roll it from there Yes, that's my understanding. Yes. Can we please use the microphone? Oh, can you go the mic please make fun? all right so the first question was does the Domain sink problems go away when mini DNS replaces power DNS and the answer is we think we believe so. Yeah How many how many sequel aqua me calls are left in the code? Oh? I I'd actually the reason we're asking that obviously is because when you're making the DNS We were excuse me. You're making the sequel queries and not just I saw I really appreciated you saying hey by the way use the RPC the project has on there and don't make the Database calls yourself if you can avoid it because that's the migration issue, right? Yeah, but are you still doing? Have you have you committed to moving all towards using the RPC callbacks in the sink handler? It's completely RPC. Yeah, the sink handler for sure. Yeah So I was curious as to why you didn't do use just like Debian packaging or something like that with like why you used why you Decided to go with virtually and be so there's two reasons the first reason was when we started there was no a bunch of packages for this They came out in mid-April for kilo I think at the time That's that's number one number two. I think in general and this is another discussion We're really interested in looking at virtual environments. There's something to be able to more isolate our services No, Eric runs horizon off master and he can upgrade that without breaking everybody else If Eric drops horizon on there or another service and a bunch of dependencies, you're gonna cause problems with libraries It's the typical packaging problem Designate was also perfect for testing out how we might get a puppet module to support virtual environments because It was new it's new. It doesn't have a lot of use doesn't have a lot of development activity So that was a good test bed for us My other question was so the horizon UI changes or the UX changes Did you did you guys do that yourself or were you pulling that from somewhere? Yeah So as background I've spent a few years as a horizon developer And so all the changes that we made we contributed back to the community and we continue to work with them to try to figure out You know, can we improve it a little bit more and can we add this that or the other thing? Do you have any plans for IPv6 support? It's probably something we're looking into we haven't specifically discussed it in terms of designate. I See somebody in the audience who's like wincing or smiling. I'm not sure with that question That guy right there is gonna do it So I had a question here So is So I saw in your architecture diagram that you have a gallery cluster. Yes So so it's basically to have just one DNS infrastructure across all your regions, right? So we have a global identity system so Keystone uses already sort of this global Galera cluster and if we have a Tenets there by our global right so therefore if we're doing a tenant per domain or domain per tenant Designate also from a database point of view should be global and that was the decision we made So and so is your power DNS still local because power DNS also has a mysql data. No, it's all I believe it's also replicated Okay, so that's also wait. I'm sorry. No, it's not it runs on the Control knows so it's not replicated. Yeah. Okay. Okay. Okay, and one more question I had is When you put the port delete when you send a port delete a message and when you could designate consumes it Doesn't it do a reverse lookup doesn't try to do a reverse lookup to find the PTR record and deleted as opposed to going through the database I mean, I didn't understand that part We take that The port ID that comes out of the delete and we have saved that into a managed extra field in designates database So we just take we just query and say find me all records that match All the records. Yeah, and then delete them delete those records if you find them Yeah, otherwise you'd really kind of be out of luck because you'd get a port delete and you'd have to look up What what device was this port attached to and at that time it might be gone So you're really not sure what's going on at that point. Yeah Cool. Thank you Thank you for good introduction and my question is does designates support DNS round robin Yes, round robin. Yeah, I don't know the answer to that. I should say there is a DNS deep dive talk I think at 11 1150 today That would probably be a good place to go visit and check out for that question Okay Thank you very much. I really want to use it in my open-stack environment and I also want to use Want to manipulate some other DNS service some like Google Cloud DNS or some Amazon Route 53 through and horizon. Do you attend to? bring it to the plan to cooperate with it If I understand a question, right, I think the answer is no Thank you. All right, sorry to be back again, but I wanted to make a point. It's less of a question We have the same problem by the way if you're You're grabbing RPC messages that are specific to an undocumented plug-in agent pairing So if we go to an SDN vendor that's using neutron That's not supporting the same kind of plug-in agent for L3 agent. Those messages aren't there Okay, so it's just a little cautionary tale. We tried to give the community last year and we'll say it again We don't have common networking constructs We had them before but because we have such varying Implementation of L2 and L3 stuff like this, which is awesome. We have to do things like this ourselves Aren't possible without proprietary APIs in many cases. Okay. Thanks All right, is that it?