 Just a quick disclaimer There is not a single mention of Docker or containers in this talk So I don't know about many of you, but before I started working at Datadog Even though I love monitoring I would monitor my boxes at the end of a project I build the thing test the thing launch the thing Then maybe monitor some of the things and write the docs So I would set up I would set up my tools. I'd grab some metrics I'd get some pretty grass and I thought that I was doing the right thing But I kind of missed the point. I thought monitor all the things was it, but nope In fact, I had failed and I deserved a wicked dragon kick from a devops ninja I had missed one of the crucial times to monitor my service I think about monitoring it from scratch and and start thinking about it when the packages are still in the repo How else do you know that it's a doing the thing right even when you're building the thing? It might not be doing the thing you think it's doing. So monitoring last is a fail. You should be monitoring first Plan to monitor it before there is even data Especially if it's big data For example, we needed we decided we needed to prototype something that could help us We had hundreds of VMs and a 30-minute chef run to change a feature flag was too long We looked at a few options. No offense to chef Well and console looked like it had the components that we were after small binary DH DNS and HTTP interface key value storage failure detection. We were pretty excited But we were a bit of afraid of it This is a new tool. How much memory will it take will it interfere with other processes? Would it be destabilizing to our clusters in impact production? There were a lot of unknown unknowns things we didn't know So we started as many of us do by fixing staging and we read the docs and chefed up some recipes and got a cluster running We seasoned it to taste now. What should we monitor? Well on the console server side, we started with a few things We started with overall average networking Networking per server CPU per server and great. We've replicated Munan in 2015. So I guess my work is done here But we also wondered if the agent would use up all our precious memory drive the umkiller crazy and start stop our processes Nope didn't do that either. None of our worries materialized most likely because we weren't really using anything But as we work with with console we broke it and fixed the cluster We found the two metrics that were the most important one. Do we have a leader and to the orange line? Has there been a leadership transition event? So after a couple of weeks of exploration and watching it idle without taking down the world we thought well staging is in prod Let's see how the cluster behaves with more nodes It's probably fine. So we push it to prod and well That's not so good. So sure. It looks like a lot of leadership transitions to me Especially since they're it's a lot more than staging how about we had a couple more server nodes and see what happens You know, there's three and a half times the nodes in prod So that looks a lot better five server nodes is about right. So now that it's all calmed down and we're feeling lucky Let's try installing the app that writes configuration files and reloads our services on the fly. That sounds so awesome well Maybe that wasn't such a good idea too many things querying for too many other things all at once Maybe building that file on every machine is not the right thing to do. There's got to be a better way so We decided let's build it on a single node and distribute it with the integrated key-value store no more unending leadership transitions No more scary graphs much. Wow So because we monitored first we can experiment and see the impact of our choices before they become next year's technical debt Because we monitored first we have our data that we can make decision with rather than just our gut feels Because we monitored first when we ran it to strange pauses We could then collect additional metrics and discover whoa the individual nodes aren't going deaf The server is and that's affecting groups of nodes so please Please monitor first not last Make sure the thing that you're building is doing what you think it it's doing before it's too late And you have to do a 270 away from certain peril Monitor first because you just never know what she might do if you don't Thanks for your five minutes come find me to talk more about console or even Docker. I'm Darren Froze and thanks