 sweet. Thanks for coming today. My name is Darren Froze. I'm Darren on Twitter and Darren on GitHub. And when I'm not crawling around an underwater wreck, I'm a site reliability engineer for Datadog. To start things off today, let's briefly talk about what service discovery actually is. So in its simplest form and for our purposes today, it's comprised of two main components. Service registration, where some service on some node or in a container or maybe even in a unicolonel, says to a central authority, I provide this service at this IP address and port. And on the other side from there, service discovery is the other component. So where some process on a node or in a container says to that central authority, hey, can you tell me where to connect to this service? Now there's obviously other parts around that, but that's what we're going to focus on today. And that's really it. So Datadog's journey to service discovery started near the end of 2014. We had about 370 VMs in AWS. We were ingesting about 1.2 million metrics per second. We've been around for four years, but we were in the process of cutting apart our monolith and taking out the components piece by piece. So we were growing in staff and in machines we monitored and we were having some pain around configuration management. So rapid growth is always challenging. It exposes the areas you need to deal with next. We had gotten to where we were by doing things a certain way, but we couldn't do those things the same way anymore. In order to scale to the amount of traffic that we were seeing, we needed to add many more machines to share the load across our entire platform. As that quote indicates, you can get pretty far, in our case up to 400 machines, but it was increasingly getting cumbersome to manage raw IP addresses. We needed a better way. And by the way, you may not be able to see it, but there's an article at that link at the bottom at the blog and it's great. Even though it's two years old, it's got a lot of good stuff in it. So at the time we were using a hybrid of Chef searches that take about 30 minutes to update and large numbers of manually managed IP addresses. Those environment files in our Chef repository, those are some of the hottest files that we dealt with on a day-to-day basis. And there's nothing really wrong with that, but it was getting harder and harder to manage. As you can see from the graphic above that I unfortunately had to obfuscate a little bit, the amount of services that we were extracting out of our monolith and adding to our application to keep up with growth was growing and growing. If we were to prepare for the future when we all moved to containers, pods and unicorns, there's no possible way to keep all the locations of all the things in a single static file. Plus it's really error-prone to manage that file. It was getting really troublesome to merge. We could see the writing on the wall. So I had first used console back in 2014 of June to serve as a backing store for environment variables for my Docker project Octo host. I was only using the integrated key value store at the time, but I knew that includes service discovery and so I wanted to give it a shot. So I got approval to take a quick spike into seeing if it would work for Datadog and here I am 16 months later still on that quick spike. At the time we thought our desired end goals were pretty simple. We wanted to have a register and provide a catalog of services on our cluster and provide an integrated key value store in our cluster. And unfortunately for people who like to complete things and move on to the next project, those two little goals led us to an almost infinite amount of yak shaving and rabid trails that on the quest for infrastructure Nirvana. So some of you may not even know what console is today and thanks for coming regardless, but I'll go for a quick introduction. Console is a great tool written by the guys at Hashi Corp and there's a distributed and strongly consistent key value store that sits in console that you have access to from any node in your cluster. It has pretty flexible ACLs that allow you to lock down parts as needed. You can register a watch against the keys in the key value store and when the key changes, it will automatically run a handler for you. Very, very efficient way to get information out. It has an opinionated service discovery framework that gives you built-in DNS and HTTP end points to query. You can also create locks. You can do remote orchestration and job execution as well. It's really quite cool. Console has server nodes. Yeah, worked. Nice. Console has server and agent nodes. You run console on every node. So the binary is identical. The only thing that changes is the configuration. So the server nodes participate in the raft consensus protocol to keep things consistent. It's how they agree in a distributed system. There's always a single leader out of those server nodes and if the leader slows down and stops responding, the other server nodes have an election, kick out the guy and start a new one. That leadership election is not really a big deal. It's not like a Postgres failover where you actually have to do something. It's hands-off and it happens automatically. Now, during that election for approximately five to 10 seconds, it depends on your network. You can't read or write to the key value store in most of console. It's kind of in a degraded state. If you have time and want to learn more about raft, that animation can be found at lots of information at the link above. It's got some great explanations of what raft actually does. So given that console is awesome, and it is, I'm telling you, we still weren't sure if it would work for us and if it would help. How would it fit into our environment? How would it work given our needs? How would we even end up using it? We really had no idea. So we rolled it out into staging. There were about 100 nodes in that environment, and we used M3 medium size for the server nodes. Our phase one plan was pretty limited. It was really just an initial deploy of the server and agent nodes. We added some registered services, and we were exploring the service catalog. We really wanted to see how would it act in our environment, and would it interfere with anything? We quickly found that no, it didn't interfere with anything. In fact, the agent binary only took between 15 and 60 megabytes of RAM on each node. Everything seemed pretty calm, maybe a little too calm. Now, given that I work for Datadog and Datadog monitors things like console, this is the part of the talk where I say you need to be able to monitor console if you want to roll it out. At Datadog, we have a philosophy of monitor first, which means if we can't monitor it, most of the time it doesn't get rolled out into prod, and it really helped us with rolling this out. Over the next couple weeks, we built a Datadog integration to monitor console. We learned how to break it, we learned how to fix it. We figured out that it likely wouldn't break the world if we wrote it out to prod. So it's probably fine, and we shipped it. We shipped it in the state that it was enabled, and not really being used. So sort of like a dark launch. It was running an active on every node, but we didn't depend on it, and it wasn't being utilized, except in a very exploratory sense. At that time, we were about 370 nodes in production. We spun up five and three large instances, and then we started adding agents. Three at the time, you can have three, five or seven server nodes. And at the time, three didn't seem to be cutting it. It was having a few too many of those leadership transitions. So we wanted to have a bit more cushion to survive a failure if something was to happen. And to the astute viewer, no, that's not a token ring, that they're not being connected in a circle. They actually all connect to each other. Every node actually talks to every other node. When everything finally got rolled out into prod, it was stable, which was pretty awesome. During our first explorations in staging, we discovered what we considered to be the two most important metrics to monitor to know if console is working correctly. Do you have a leader? And one of the last leadership transitions was their one recently. Overall, these metrics will tell you whether or not your cluster is healthy. If you have a leader, well, when things are good right now, and if you've been having lots and lots of leadership transitions, even though their hands off, it's not really a great sign. We're going to come back to this because those two metrics play an important role in today's talk. So now that we're live and on prod, what do we do? One of the first things we did is we added what's called a data dog service. We wanted to use the console service catalog to have every node in the service catalog so that we would be able to do all sorts of fun stuff with it. Fun stuff like the next slide after this, of course. So we used Chef to install the service as a small JSON file. That's an example of what the JSON file ended up looking like. We also used Chef to add all the node's roles and the availability zone as tags. So now that we had a complete picture of all nodes, we could do things like, can you guys see that? Oh, yeah, sweet. Okay. So we could use a command that we called SSH to roll or we could SSH to a node that had a specific role. We could also use another command called host by role to find all hosts with that particular role. It's incredibly useful and it's how, it's the primary way that people get around our clusters today. At data dog, we already had an orchestration solution that many of you are probably familiar with called Capstrano. It did all sorts of things for us and console has something that's similar called console exec. Now it can run any command you want on any group of nodes. We're limiting that command to just console server nodes. The catalog of nodes is always up to date as nodes are added and removed to your clusters. You don't have to manually do anything which is great. And it's fast. It's pretty nice. So things that used to take multiple minutes under Capstrano, you can do in five, maybe 10 seconds if everything's slow. Unfortunately, we aren't super happy with some of the security trade-offs with console exec. You can't really turn it off unless you really do a lot of extra limiting that we weren't willing to do. So most of the time is disabled except when I need to do something really fast. Sorry, Mike, wherever you are. One quick pro tip. If you're using console exec, do not tail a large or active log file because all of the server nodes participate in sending you those bits and all have to agree. They use RAF to agree. And that's a very quick way to melt down your server nodes. So they're aware of it. I'm hoping there's maybe a fix, but we'll see. Remember the strongly consistent key value store I talked about earlier? Well, it turns out it's really handy to have a global data store on all of your nodes that's available from local host on an HTTP call away. So we wanted to use this for configuration data, but we also wanted to know who made a particular change and we wanted to know when those changes were made. Turns out somebody already built most of this and get to console is the solution that we used and it works great. So we took it to console and we created our config repo that has some selected configuration data that's widely used across our stack. It's a very popular repo at Datadog who knew that if you build something that's really easy to use and does things quickly that people use it a lot. Every 60 seconds, get to console, checks to see if there's any changes in the Git repository, pulls and merges those changes into the key value store, then console distributes those changes to all nodes and the processes we have in place then act on those changes. So it's given us a whole bunch of really cool capabilities like quick reaction time and flexible configuration. It's also capable of sending a broken config file to every node very quickly. It's a tool with a very sharp edge. So as we were using get to console and our console config system more and more and because of how we were using an abusing console this exposed what we felt was the weakest part of console at the time and that was how console reacted when we were reading from the key value store at a high velocity. So console's leadership transition mechanism is tied to latency. After not hearing from the leader for about 500 milliseconds for whatever reason the other server nodes kick it out and elect a new leader from that group. The old leader kind of comes back with its tail between its legs promises to do better next time and joins with the others waiting for its next turn. When you read from the KV store at too high a velocity or from too many locations at once console 0.5 has a tendency to freak out and have those leadership transitions. As you can see the above graph from January to May it wasn't that awesome and in all fairness to HashiCorp this was at least partially self-inflicted. Each leadership transition took approximately six to ten seconds on our network to be complete and during that time again the KV store is unavailable for reads and writes. So we adjusted our code we made it a little more tolerant of those interruptions and moved along. Here we are at the end of May and everything is turning up and to the right. We're growing in all areas as we start to use more and more features of console. By this time we're getting serious with service registration. As you can see we only go up to the letter D there. Many of our services were being registered and we're playing with how to use the various discovery mechanisms. There's a simple HTTP API that's built into console that can answer many things about nodes and services in your cluster. It's flexible and easy to use but for us to use it would require a little bit too much re-architecting on some of that monolith. So we decided not to go this route in our app and in our extracted microservices. We decided to primarily go with the DNS interface at least for us inside of our application and those microservices. So it's simple and flexible as long as you know the name of the service you can easily get a list of IP addresses where those services are running on. As with anything to do with DNS it has its own drawbacks. You know there's some strangers with some libraries but it just works the vast majority of the time. Our newest services a custom data store and a metric service that we deployed last summer and fall the only way to get to them is through this DNS interface. There's no other interface allowed. As we add additional instances to hand of the load and handle our requirements those new nodes are added into rotation and the DNS updates and we're good. As with anything new though we were worried about services being unstable. Flapping in and out of the service catalog. What would that do to our app? Would it happen because of console? Would the health check mechanism work as we hoped it would? So we did what many pragmatic engineers do and we cheated and rigged the game a little bit. In some case like the above we made it never fail by calling been true every 60 seconds so it's never going to fail. In other cases we remove the check all together. In still other cases like Cassandra and Kafka we use proper health checks at a decent interval and I'm happy to say that service flapping generally isn't a problem. I know I know some of the HashiCorp team have told us that there are some issues with services but it's not a console problem we're at least not seeing it and we have many many services. If your service is flapping in our experience it's because your service is flapping. Sometimes we'll see Cassandra a bunch of Cassandra nodes roll in and out of the catalog and that's because it's Cassandra doing whatever the heck Cassandra is doing. It's proven to be a very reliable mechanism. Another quick pro tip on the whole flapping thing. Console has the concept of a data center. Within that data center every node needs to be able to talk to every other node on multiple ports on TCP and UDP. So if any of those aren't correct or you have some firewalls and things like that that's when you see flapping but again that's not a console problem. One of the side effects of having a very fast-paced system that ingests millions of data points per second is that using DNS was always a very risky endeavor like everything's always a DNS problem right? In fact before console we didn't use DNS internally at all. It was always bare IP addresses for the speed and given console's proclivity for read-induced leadership transition we were a little concerned about what adding millions of queries per second to hundreds of machines would do. So we did a few things to help mitigate this. First of all we installed DNS mask in front of console so we wouldn't be querying console directly. DNS mask will intercept anything with a .console domain name and directed to port 8600 which is the port that console listens to for DNS requests. Secondly we added a short 10 second TTL to all of the services that console knew about. Consoles TTL by default is zero and that's just a little too quick for us. So we also looked at creating a host file based on the contents of console's service discovery database. Okay I guess that's the end of that slide. So we looked into our existing bag of tricks around console. We grabbed console template another hashy core product and we started with that. Our plan was to build a host file on every node and load it into DNS mask directly and it seemed like the most straightforward option and it would have been great. But even in our staging environment it was chaos. It looked a little something like this but with everything flying and no little girl. When everything was updating when each node was querying console service catalog it was putting so much repressure on console's data store that it was pretty much eternal leadership transitions. Multiplied the nodes by the number of services and the number of records for each service and yeah it was not feasible. To be clear again this is not console templates fault either and I side noticed that console 0.6 fixes this. But here we were console 0.6 was still a twinkle on hashy corp's eye and we couldn't wait until December. So we got to thinking let's build this host file on one node. We used a KV store to distribute this file to all the nodes. We'll use one of those little nifty watch things to write it out to the node on the end. And it worked. It worked really really well actually. We weren't seeing any problems at all. There was no transitions. There was nothing. It was super stable. We very quickly found out that without rate limiting that process reloading all of your console agents for whatever reason on each node leads to those services dropping in and out of the catalog. And if you're dealing with an automatically generated host file that has all the nodes in it so over a 30 minute chef run that's about 20 nodes per minute that are dropping in and out of the catalog which meant that 40 times a minute this host file was being regenerated sent to 600 nodes at the time and written out every one and a half seconds it was doing that. And that was pretty stressful for me personally as I'm watching it. But we made sure to enable some rate limiting and there was one thing that we noticed during this entire process. One thing that surprised me is that there wasn't a single leadership transition during this entire time that even that then we were sending around this 40k file to 600 nodes every one and a half seconds there wasn't a single problem. Console didn't crack under the pressure whereas on the other hand with our console config repo every single time we made a config change and console updated we had between one or two leadership transitions. It wasn't a big deal but it happened every single time. So this was an important clue for us to learn about how to handle console. The very next day this anonymized commit obviously to protect the blame list said them to our configuration repo and says JSON files are now pretty and standardized. So somebody took it upon themselves to lint and clean up some inconsistencies and that's great most of the time it's normally a really good thing but that's not. This was the next couple of hours where we valiantly battled against console and get to console. Get to console grab the changes it would try to load them into console the first few keys would get updated the watches would fire everyone would try to grab it at once crash console then get to console will crash. So it was tons of fun. It was way too much read pressure and again we were seeing that limitation. So in the end we ended up having to disable console on a whole a whole bunch of different nodes and then allow Chef to restart them over the next 30 minutes slowly. So what have we learned here? Well one of the first things we learn is that it's very hard to be one of the early adopters of a new distributed system. There's no console stack overflow where you can go and just steal some magic incantation that will fix everything. At the time there is really very little real world real world information available other than the docs which are great but we wanted to know how other people were using and deploying this. Another thing we figured out at around the same time and something that I suspected for a little while is that we were sort of doing it wrong blamelessly of course. We had about a hundred tiny keys and these tiny keys were the keys that were being read and extremely high velocity across the whole cluster. Essentially we were DDoSing ourselves. This was not the sort of problem that showed up in staging of course that staging didn't have enough nodes. So we didn't know that it was this was the problem until it was a little bit too late but just to remind you of the fact so console service or a coup d'etat when the leader just can't keep up. It can't keep up because it's doing all of the things all at once. Bigger CPUs can do all of the things and more of the things at once and we're in the cloud at amazon and what the hell we just raised a series C so we went to bigger nodes. We upsized our server nodes. We deployed these new and bigger servers one by one and we put the old ones kind of out of their misery and it was good and things were beginning to be made right again. They were still happening but nowhere near as much. We just took a little detour and talked about leadership transitions and you might be asking well how did the DNS mask and the host file go down? Like what happened with that part of the story? Well to let you know because I know you're all really interested. It worked really well. You might be able to see the above we added or the below I guess we added an additional host file and then gave it a 10 second TTL. And DNS mask is one of the few things in the known universe that actually seems to honor a DNS TTL properly. It works great with console. In testing you can see it only forwarding the first request serving the cached answer for the next 10 seconds and then going back to console to re-resolve. It's pretty nice to see software that actually respects a TTL. And it's quick. I'm not a DNS guy but I was trying to measure with dig and I was getting zero milliseconds which is obviously pretty hard to measure. Zero. So I built something to query for me and for items DNS mask sends to console on this node it takes between 600 and 700 microseconds to get an answer back. When DNS mask is serving out the cached answer it's between 100 and 200 microseconds. And for things that DNS mask has in a host file it's again between 100 and 200 microseconds. So fast enough for us at the moment. I added some metrics generation around a DNS mask log watcher on one of our nodes last week. And this single machine one of our thousandish is constantly doing about 20,000 console DNS requests per minute. So clearly the Python libraries are not respecting the DNS TTL but it's querying DNS mask and I don't really care. So because of the multiple responses there's about 50 to 60,000 responses per minute. Querying console directly on our system would be mayhem. It would be a problem like to put it mildly. So take this and multiply that by a thousand and yeah just wouldn't work. So even though we've been having some success with console we were still sort of tiptoeing around. We were kind of... People were freaked out. We were only at 600 nodes at the time. What would happen as we grew even further? We had already doubled in size from previously. Internally there were some concerns as we scaled. Like they didn't trust that console will be able to keep up. So a parallel service discovery app was written and deployed beside console. Unfortunately it's still there. I'll get to that at the end. But some people were certain that the apocalypse was coming and I honestly wasn't sure either. The next month, so this was in August, we hit 700 nodes and it seemed that the fear was warranted as all of a sudden nodes started randomly going deaf and mute. They couldn't talk to the servers. They couldn't see the updates that we had placed in the key value store. They lost their console lock so the services disappeared. You know those two services were consoles the only way to find them? Yeah that was not good. Oh it was bad. So when we bounced console on those nodes it usually fixed the problem but that gets pretty tiring after five minutes. And the biggest problem was that I couldn't duplicate it reliably. But I could see the grumbling in slack whenever it happened. Shortly after that started happening I was in New York for the week. I worked from Canada most of the time. And I heard someone mentioning, I heard verbally, I heard someone mentioning it that it was deaf again. So I immediately went into branding Greg mode. Well at least partial pseudo branding Greg mode. And I was finally able to duplicate it reliably. And then it cleared up. So I did what we normally do at Datadog and I made some graphs. I wrote some code to watch the URLs that were exposing the problem and I waited. These graphs and a large amount of wire sharking that James from HashiCorp was sitting right over there ended up reading a lot of. Thank you sir. That helped us to track down what was happening. A server node or nodes was losing their connection to the leader. And as a result the agents that were talking to that particular server were going deaf as well. HashiCorp quickly found and fixed two deadlocks in the multiplexer code that was underneath it. And there was also another bug that spun up hundreds of additional connections per node that was proving to be a bit of a problem. But it was still happening. The last puzzle piece dropped into place when James again, that guy many beers asked me about the Zen AWS Linux bug where rides the rocket bug, the old Quake reference which we thought we had fixed but for whatever reason it was still there. It was still happening and it was interfering with the server and agent communication. They were dropping packets that just never got read. So a couple of days later that bug was totally eradicated from our systems and prevented from returning through judicious use of Chef Fu. So I held my breath and the GitHub and I just kept the GitHub issues that our team had filed internally. They remained open and the Parallel Service Discovery app ran alongside Consul. But the tide had turned and the sentiment was trending in the right direction. That takes us up to October and we're fluctuating between 800 and 900 nodes at this time as we're retiring some services and adding some new ones. Kafka is one of the most important systems at Datadog. All the data that we ingest goes into Kafka and all of the consumers read from Kafka to do all the things with all the metrics. So in October we used Consul Confed to completely swap out our primary Kafka cluster without so much as an external peep that entire time. It was pretty cool. We also started using Consul Exec and so Consul Exec, sorry Consul Events. Consul Events are sort of like Consul Exec but with predefined actions. So you can't just do whatever you want. You do the things that it tells you that you've already predetermined. So that watch right there, what it does is it waits for an apt update event, determines if this apt update event is new and if it is new, it's not an old one, then it actually runs apt-get update. In our old cluster, using Capistrano, it used to take between 20 and 30 minutes to do an apt-get update and that was if the static file was exactly the same when you started the update as when you were getting to that node that was in there that might have been removed. If that was the case, then it would crap out enough to do it all again. Now it takes between 60 seconds and 90 seconds to run an entire apt-get update in over a thousand nodes in production. Our staging environment takes about 10 seconds now. We were also using Consul Lock for a number of processes and I gotta take a drink, sorry. With Consul Lock, you can run a highly available application and have a hot spare step in if something happens to the currently running application. So we normally run three instances of these jobs and with only a single one running, when that one would decide to take a rest or crash or do whatever it feels like doing, then one of the remaining processes will automatically jake over to make sure that there's always at least one of them running. It works pretty well and we have a few of these at the moment. Get the Consul runs under this so it's always pulling in the Consul changes and our DNS host file runs under this as well. Here's an example of an Ubuntu, an Ubuntu Upstart script, there we go, that works with Consul Lock. It's pretty easy to get quite a bit more reliability out of an unreliable application that you know is going to crash all the time. When the leadership transitions grew, we again bumped up the size of the server nodes one more time because that's what you do with Consul 0.5 when you get a lot of leadership transitions. Now, which is the very bottom right, obviously, we see a leadership transition every one or two days and that's it. It's pretty awesome. And this is one time where money did buy happiness, at least my happiness. So one important note is that we had two small outages last year with related to Consul, but at no time during this entire period was it ever Consul's fault or was there ever something that happened to Consul that we couldn't explain. The first time in March that we had a three-minute outage that was caused by a packaging chef problem and that was kind of annoying, but it was quick and it was done. We had one outage in July that was related to somebody who again shall be blamelessly nameless, they were started all the server processes at the same time. And that is unfortunately one of the things, one of the big no-nos with Consul, you don't do that. Just don't. That's a pro tip. Just hey. So Consul's been super solid this entire time. In fact, there's been more than one time where some kind of network partition has happened in U.S. East. You know, it's Amazon. You never know what's happening. It's always still green, of course, but there's been more than one time where this has happened where either Kafka or Zookeeper or Cassandra or something has got screwed up where we'd have to intervene. We've never had to intervene with Consul. We just break it ourselves. As a side note, if you want to come work on these sorts of fun problems we're hiring, so make sure you come chat with me or go to the website. Just let me know. So here we are at scale. It's January 2016. And what can I say that we've learned over the last year working with Consul? Well, no matter what you heard me say during this presentation, Consul is awesome. It acts like an incredible data center backbone that helps to scale your operations by having these helpful primitives available to you alongside orchestration tools, persistent local data store, service discovery. I'm a total fan boy, and if you can't tell already, I love it. And I promise you, I don't get a cut or a kickback from HashiCorp if you run it, but you all should if you have a need like this. Even Hof loves it. Like, what's not to love? So monitoring as you deploy it is not optional. It really should be a no-brainer. I happen to know a few people that might be able to help with that, but just, again, just saying. If you're starting out, make sure you upgrade to 0.6. And if you're not there yet, find a way to upgrade. It's really not that hard. I did it a couple weeks ago and then afternoon. The HashiCorp team took a lot of community feedback and a lot of the bugs that I've been filing for the last couple of years. And they ended up totally rewriting the storage back end. Version 0.6 has completely solved our read velocity issues. It's not to say it doesn't occasionally leadership transition, but it's not every time. In fact, like I said, it's once every couple of days now. They also fixed a number of bugs and added a whole bunch of new features, ones that I haven't even touched today. The new client binary takes about one-third of the memory from before. So from the 15 to 60 is now quite a bit lower. It's now stabilized down about 20. The servers take up about a quarter of the memory and it's a little bit harder to see, but that red line is when I finished upgrading all the console, all the server nodes. And now it sits around about half a gig. It moves up and down a little bit, but it's nowhere near four or five gigs that it was before. When's the last time that a software upgrade actually used less memory? Console 0.6 is the bomb. There's no question. So console servers really love a sizable CPU. So make sure you feed it the right size of the machine. If you're in the cloud, just make sure to upgrade your server nodes until you don't see leadership transitions anymore. If you're not in the cloud, then get your POM early and get yourself some new machines racked because it'll be a while. But I have some example sizing that I did with 0.5 specs. So you might actually be able to get away with smaller nodes now because console 0.6 is just that much more efficient. And as always, your mileage may vary depending on how you use it, how many services, all those things. One thing that's been emphasized through this whole process at Datadog has been to architect for failure better. Console is a distributed system. And so you don't have the luxury of having everything on one node. Connection problems will happen. Nodes will connect and disconnect. Add retries to your connection routines. Add exponential backoff and circle breakers. All these things will make your stack more resilient. And that's obviously not a console-specific recommendation, but something that was really pounded into us when we were dealing with last year's shenanigans. So the next one's fairly self-explanatory and it's if you're inducing consistent leadership transitions with the velocity or volume of your reads, you need to upsize your server nodes and or change how you're doing it. For example, if you're running an app server and the app server spins up multiple processes, please don't make each of those processes read from the KV store on the same node. That's bad. If you can avoid having each process do that, it's a pretty big win. Make efficient use of these connections that console will go much further. If you've got a number of machines reading all of your keys at once, then you might end up having a lot of pain like we did. One thing we're looking at trying now is using fewer and larger keys. Not anywhere near the 512 kilobyte maximum, but having lots and lots of really small keys was really the root of a lot of our problems. One thing we decided early on was to lock down the key value store using those ACLs. So that changes were only coming from places that we knew and feeds that we could audit and examine. If your configuration can induce behavior change and configuration is supposed to be able to do that, then having it available as a free-for-all is a bit of a nightmare. Lock it down and make those configuration changes audible. Console watches are super powerful and it's how we distribute the data using the key value store. But they can fire a bit too often. So I have a small go utility that I built on a weekend that we've been using to make sure that it only fires when it's a new event. So it tracks the state at the time of the watch last fired and then if it's the same event, it doesn't fire again. So it's called Sifter and it's available on GitHub. So we use it in production, the Datadog is pretty nice. My last tip is pretty key. If your output isn't unique and you can get away with it, then don't build your configuration from console data or services on every node. Build it on a single node and then use the KV store as that transport mechanism. Now that's my last tip, but related to that last tip, I have one more thing today. The last year has been a huge learning curve for us working with console and transporting configuration data using the KV store has been a very pleasant surprise. It's fast, it's stable, it's reliable, and so we've been steadily revising how we do it. As with many things, the first version was written in Bash, the second version was written in Ruby, and the last version is written in Go. Ruby needs to get in production now for a couple of months. It operates tens of thousands of times per day for us and it's been really fun to develop. It handles that delivery lifecycle, the inserting the configuration data into the key value store as well as the extracting and delivering it on the other end. We also have a Chef cookbook that helps you to create the console watch that's needed to most efficiently distribute the config files. So we call it KV Express and we're making it available today. It's a tiny Go binary, again, it handles the inserting and delivering of the files. It emits, we're a metrics company, and it emits metrics and events just so you can audit and measure what's happening. The file that's uploaded is the same as the file that's delivered. We compare hashes to make sure that it makes the journey intact and we only use the finest of all the hashes. So it's efficient, if it doesn't change and it's trying to be reinserted, it stops. It doesn't reinsert it. If console thinks that the watch needs to re-deliver it, it checks to see if it's the same file and it doesn't re-deliver it. It's optimized for safety because we don't want to write a blank file and because that's generally pretty bad to whatever service depends on that config file. And you can also run commands after the file is delivered. It's also super fast. Once it triggers from a console watch, it's under 500 milliseconds to deliver a 40-kilobyte file to 1,000 nodes. The stats are weighted most heavily under about 300 milliseconds, but there's always the stragglers to make my histograms and heat maps look terrible. I originally tried to measure the entire process from start to finish based on syslog timing, but everything was happening within a single second, so that wasn't helpful. So we added some higher precision logging and inserting to the key value store it takes usually about 100 milliseconds and under, and that's on the left. And delivery from the key value store usually takes about under 300 milliseconds and down very often much closer to 100. When it inserts records into the key value store, it can throw it in an event to Datadog so that we can audit what's happening. You can see on the bottom we added some Postgres node somewhere, and on the top there's a Kafka node that's getting removed as well. These events can be shown on a timeline. They can be graphed. They can be measured. You name it. Again, because we're a metrics company, we emit a ton of metrics. The size of the file, the lines in the file, we don't want, again, zero line files are bad. When it's firing, how often it's firing, how long it takes, we also emit like other things like panic metrics, not long enough metrics, checks on mismatch, things like that. And I have a very, very quick demo. It should take just a couple of minutes. I originally tried to make a video of the demo from start to finish, and unfortunately, even for me, it was super boring because it's just, it's kind of one of those things that happens in the background. So what I'm going to do today is I'm just going to show the, I'm going to clear that. And I'm going to show an ad hoc usage of KV Express on a prod cluster, so don't tell anyone. So in addition to being able to use any watches to deliver things, you can also, sometimes you might need to deliver a config file to a bunch of nodes quickly, because you're having an outage, you need to change something, and Chef has busted or something. So we also have something called, we call it ad hoc usage, where I set aside the temp key space in that program called console, sorry, the temp key space, and we can use that to write. So I'm just going to cat my demo script. And so here, I'm just going to sudo. So this command is the in command. It says KV Express in. I'm feeding it a URL, and it's just a 600 line config file I grabbed from a cell stack website this morning. And I'm going to insert that key into the scale 14x key, and in the temp prefix. And so what, when you do it with the URL, it automatically shows the output just to make sure that the thing that you put in is the right thing. And if we take a look here, KV LS Crap Temp, we can see it's there. So console KV. So there's the SHA-256 checksum for that, that text file that we grabbed. So what I'm going to do here, I'm going to take this file. And so this is a console exec. Again, it's one of those scary things that sometimes you get to use. And what we're doing is saying for the service data dog, run KV Express out, minus D is send out the metrics. For the, again, looking at the right key in the right prefix. And we want to write the temp slash scale 14x.conf. And then the dash E at the end is the running the command at the end of the process. So you might, if it's a config file for Apache, you might want to do pseudo service Apache restart or something like that. In this case, I'm going to, I'm just going to, well, correct the file name first so that it's not remaining there. And I'm going to change the name because I think I did this a few times testing already today. So like I said before, if it's the same file and the exact same text, it won't do it. So that's going to go across 1100 nodes, 1200, almost 1200 nodes and it's done, right? So it's kind of fun. Before we used to do that with Capistrano, again, it would take, in sometimes we would just end up stopping, like stop everything and do it slowly because it would take so long to do everything in parallel. So, so that's KV Express. It's available there as soon as I can click open. So it'll be available quite shortly. And I think we have time, a little bit for some questions. So, sir. Of the leaders, okay. So encryption for secrets. We don't have any secrets in there at the moment because there's no good way to do it at the moment other than maybe using something like Vault. So hopefully that now that this has calmed down and the presentation is done, now I can start working on Vault to do that. Now it's another hash score product. It ties in with console. It's probably a better way to do it. We looked at a few other ways it just, there's nothing really good. As far as the horizontal scalability of the servers, like I said earlier, you make them a little bit bigger until the leadership transitions go away. Now, I don't see a problem with console 06. You could probably get away with three nodes in production and it depends on your use case. There are some people who have very little traffic, very little read traffic. We have a lot. And sometimes we were hitting we were hitting the key value store a thousand nodes with some of them with 10 proceeds that 10 proceeds at once. We're reading a hundred keys. We're still doing that and it's still not breaking anymore under console 06. So it's great. It's ready for production unless you have 10 or 20,000 nodes. I think at that point it starts to break down. That's your question. Okay. Anyone else? Oh, in the back, sir. Well, we're not saving any money, I'm just for sure. Not with those five C, the C32XLs. We're not saving anything. The biggest problem was around that service discovery. And it really was whenever we were trying to update any of the things in that environment file, it was taking two hours to get out. And that's too long. If we had a problem, first thing people would do was, okay, stop, chef, everybody. And we would stop chef and then we would roll things out manually just because we didn't have enough time and no one trusted chef. It chefs great. And we use a lot of it, but just in that 500 milliseconds versus 30 minutes or two hours, you had a lot more capabilities that you could do. Like I mentioned earlier, when we replaced our entire Kafka cluster, we wouldn't have been able to do that with chef without a lot of pain. And it would have taken us a lot longer than I think it was the two or three days that it took us. So it just gives us the agility that we need to do things, especially when things are hitting the fan. Sorry, sorry, you there. I know, that's crazy. We wrote at the same time. Now, console template is great. We still actually use it to template stuff from the service discovery. The dedupe feature and now Armand had a talk on Tuesday and Armand's one of the original authors of console. It essentially does the similar thing. The only thing that is different, now we've integrated into our chef workflow so that we use an LWRP that when it goes through chef, it looks to see if there's a file there. If there's no file, it grabs it through KV Express and then it writes the watch automatically. So we don't have to do any of the setup on every node for all of these watches because it's handled through the LWRP. Essentially, they're both doing very, very similar things and if that had been available in November when I was writing this, then we might not have done this but it wasn't so. Sorry, sir. Oh, yeah. We do all that. We have stale. We have all that stuff. We actually never found some of the consistency modes to make it a very big difference in our environment. I might have been doing it wrong. I know I do a lot of things wrong but that we never found that to be a very big difference for us. We still have it on but it really doesn't make a difference. Yes, we have some of the KV requests that we make are in that mode, yes. You know what? I'd never got around to timing it at that time. It was more just get it to work because everything's burning. Sorry, over here. Sure, so your first question. So Consul comes with all the stuff with ZooKeeper. Consul comes with all the service discovery built in. So that parallel service discovery app that we wrote was based on ZooKeeper because we have a bunch of clusters because of Kafka. But it doesn't have the DNS interface. It doesn't have the services. Like it's not built. You have to kind of handcraft it all yourself. And we like ZooKeeper. It's pretty good. But for this, we also want it to be able to use HTTP to query it and not have to have a really thick client and that just isn't available for ZooKeeper as far as I'm aware. A second question was around. Well, part of that is we don't, we don't get to depend on any stuff that AWS, only AWS offers. And we're doing about 5 million metrics per second. We've increased five times over the last year. A lot of the things that Amazon comes out with right away, they can't handle the volume that we're at at the moment. So at this point, it would be nice, I think for some low volume stuff, I'm looking at using some Lambda stuff. To throw it into Kinesis for things like logging and some metrics that aren't kind of our critical path. But Kafka is our, if Kafka doesn't work, we're screwed. So that's just kind of how it is. Sure. Sure. And it totally depends on your volume too. Like if you have a similar volume, then it might work. But again, for us, we couldn't at the time. Anyone else? Sorry, I saw you first. Sorry. Yeah, you. No, sorry. That's 300 milliseconds from the time that the watch fires in the console on the local mode to the time that it's done writing the file and finished. So it's all in one data center, but that's, so I actually wanted to measure for this talk. I wanted to measure, so how long does it take to write that right? And then what's the gap in between the two before things start, before they start reading? Only problem was with all the, with NTP, I couldn't get the sub microsecond precision. So it was, before it was even done writing sometimes, nodes were already starting. So I want to figure that out exactly how long it takes, but it's so fast as to be, I couldn't measure it yet. No, no, we're still in, we're just in, we're in three AZs in US East at the moment. We have hundreds and hundreds of terabytes of data in S3 and in all sorts of other things. So it's moving to multi data centers pretty interesting for us. So, sorry, you, yes. Yeah, that'd be a problem for this, for sure. We don't have that, we don't have a situation that we're still all on VMs and we have consistent ports. Containers aren't, haven't been an option because of the amount of data that we have. We have nothing, almost nothing that's stateless. So, so we can't use containers at the moment. And when we go down that route, we might have to go and overlay network or something. Maybe you would do like a live network or something else that has like IP addresses per container. But otherwise, you'd have to use SRV records which console supports as well. I just, we just haven't got, we just haven't needed to yet. Yeah, I think so. We, we haven't, we haven't tried that yet. So, Brenna, if there's another one, sir, over there. Yeah, our applications are not currently aware of SRV. They're not querying for that. We are registering with a, with a small JSON file in each node based on the role from Chef. So, when we have a Chef role, it writes a specific JSON file for that role only and the JSON file registers the service. Again, once the node gets killed, the service goes away. That answer? Okay. Who else? Yes. Oh, one back there. And then, sorry, you, after that. Not yet. Just like I did, I said I just finished upgrading two weeks ago. And we haven't, but we're looking to that because of the AZs and some of the weird stuff we've seen between them. So, we're going to try it, but we haven't yet. Sorry. Doors. No, I've, so on my Octahos project, which is basically a very tiny, terrible, not terrible, but like a very lightweight Heroku that does, none of the things that Heroku does around state. I register with an API call and point to those containers. So, you can do it that way as well. We just chose to do it with these JSON files because of the stability of our nodes and Chef. But you can do it with API calls and it's no problem to say, here's the port, here's the IP, and you're good. That answer your question? Yeah, you could. I know I do that, I host probably 60 containers on one box and all the config is built out of console service discovery catalog that's registered via an API call. So, oh, sorry. Sorry, I didn't totally understand that. We haven't integrated with it, but I know there is some movement around that. I know HashiCorp is building their own Docker image and I know there is some integration, I just haven't used much of it yet. Sorry. Okay, thanks for having me, everybody. And if you have any questions, let me know.