 So my name is Roger Gnazio. I'm a tech lead at Mesosphere specifically for the day two operations team. So everything that pretty much comes after provisioning logging, metrics debugging, that kind of thing. Today we're going to be talking a bit about the metric side of things. I'm also the author of Mesos in Action. It's more of an operations-focused book on Apache Mesos and it also goes into DCOS a little bit. So if you're interested in picking up a copy of that, you can use the code on the screen at Manning.com and get 39% off. So today I'm going to talk a little bit about what day two operations means in the context of Mesosphere and DCOS. I'm going to provide a quick introduction to the various Mesos APIs for monitoring. Then I'm going to introduce Snap, this cool new plugin-based metrics collection system. After that, I'll touch a little bit on what it was like to develop the Mesos plugin for Snap and how you can write your own plugin. I was going to give a demo, but it looks like because of the difficulties, I won't be able to do that today. So maybe I'll do something and post it online after the talk. So first I just wanted to start off with what day two operations means to us at Mesosphere. And we joke a little bit internally at Mesosphere that day two operations is everything that comes after day one. So what does that mean? Everything that happens after you provision a Mesos or a DCOS cluster falls into day two operations. So that's logging, that's debugging, that's metrics collection. Really anything that you need to do to operate and ensure the health of a cluster. We recently started a day two operations working group in the DCOS community Slack. So this isn't just a team at Mesosphere anymore. This is I think 30 or 40 people in the DCOS community Slack that are interested in this kind of work. So we would love it if you joined us. You can join the day two ops channel on Slack or we have a mailing list day two ops working group at DCOS.io So before I get into the snap portion of this talk, I just want to just want to introduce the various Mesos metrics APIs that we have available. These APIs provide information about workloads running on the cluster, individual containers, frameworks, resources that are being utilized, etc. So there are a lot of these different API endpoints and I just want to highlight a couple here just for the purposes of this talk. The first one is the redirect endpoint. In a Mesos cluster, you commonly have somewhere between three or five or seven masters for a production highly available deployment. The redirect endpoint returns a HTTP 307 redirect to the leading master. And this is important because you never want to query the non-leading master for its metrics. I'm not sure if this has been improved in recent versions of Mesos, but it used to be if you quoted a non-leading master, it would return information, but it might not always be up to date. So the leader is always going to have the latest state of the world of your cluster. The next is this metrics snapshot endpoint, and that's a summary of the master's metrics. Kind of a high-level in operations view. It's things like how long it's taking to query the Mesos internal registry and how many messages are being sent back and forth between the frameworks. There's also these state and state summary endpoints and this provides a high-level or a detailed view of all of the cluster state. That's all of your frameworks, distributed systems, containers, configuration flags, etc. Now that was just on the master. So that's the three or the five or the seven machines that make up the master quorum of your cluster. On the agent, on port 5051 we have a few different endpoints. Specifically, we have this new containers endpoint that I believe came out in either Mesos 1.0 or maybe .28, and this provides metrics about running containers, container IDs, CPU, memory, disk usage, etc. This used to be available at the executor level, but now you can query it for individual containers if you're looking for that level. Just like on the master, the metrics snapshot endpoint provides the agent metrics. That's the number of tasks and what state they're in, whether it's running or failed or staging. The used CPUs, the allocated memory, etc. So there's a lot of information available there. All of this is returned via JSON over HTTP and then state is the detailed information about the node. So on a specific agent, what frameworks are currently running, tasks on that agent, what labels are associated with those tasks, what configuration flags are present for the agent, and so on. But this was just getting to the tip of the iceberg. There are dozens of different API endpoints available. I don't have time to cover all of them, and not all of them are necessarily relevant for this talk. One of the endpoints that's available on both the masters and the agents is slash help, and this help endpoint details all of the available endpoints and a little bit information about what they're used for and what response you can expect. So you can query those on either the master and the agent, four-year specific version of Mesos, or if you go to mesos.apache.org, you can see the documentation for the latest version of Mesos. So that's just an overview of the information that we have available, and that information could be very valuable depending on what you're trying to do in your company or with your cluster. So, well, I'm just going to get right into SNAP, and really the best way to describe SNAP is from its project website. SNAP is an open telemetry framework designed to simplify the collection, processing, and publishing of system data through a single API, and that sounds great, but what does that mean in practice? It's a plugin-based architecture for collecting, processing, decorating, and ultimately publishing metrics through to an end system, whether that's something like ChirisDB or InfluxDB or some other time series database, and it really allows you to decouple the publishing of metrics and the collection of metrics, and some of the processing that you might want to do, like aggregating or taking the moving average of specific values and so on. This is pretty powerful because you could have you've really decoupled the publishing of those metrics away from the collection, and SNAP is there to kind of handle all of the translation between collection, processing, and publishing, and the SNAP workflow, just borrowing these images from the website, looks a little bit like this, where you can have varying workloads, where you're collecting all of these different heterogeneous metrics, you can filter, you can add context, you can tag metrics, you can aggregate them, and then ultimately publish them either onto a message queue or to a time series database, and you can visualize them with your tools of choice. Now all of this is happening really on each of the individual agents, so the processing for all of this is distributed. Like I touched on, it's a plugin-based architecture. These collectors and processors and publishers are running on each node. The really cool thing is that all of this is done with these dynamic updates. So there's no need to restart the SNAP daemon if you want to make a configuration change, or if you want to upgrade a plugin. You can do all of this without losing data. There's also a feature called SNAP Tribe, and this is a GASA-based protocol for handling configuration changes. So if you were to push out a configuration change to a hundred or a thousand node mesos cluster, to collect different metrics, to change the polling interval, and so on, you're able to do that without having to use a configuration management tool or without having to log into each one of those machines. SNAP handles that for you. So that's pretty powerful if you're using something like CoroS that's minimal by default. To control SNAP, there is a command line tool called SNAP CTL. There's also a REST API, and the plugins use the REST API for communicating with the daemon. I believe you can also use the REST API directly, although most people just use the command line interface. Now, a plugin is implemented in Go, or anything that understands Go's GRPC protocol, and I guess at a high level, a collector plugin looks something like this, where you have to implement these three methods. So there's the configuration policy, there's the metric types, which actually, when the plugin is loaded, getMetricTypes is called, and it populates the available metrics on that machine so that you can then build a task in querying that machine for its metrics. So without knowing the available metrics, there's really no way to know what you can collect. And then finally, there's collectMetrics, and this is called bySNAP on the interval, and it just runs through whatever code that you might need to collect metrics from the system that you're querying. So in Mesa, this would be hitting those various APIs that I mentioned, and running that on the interval that the SNAP daemon has set. So at a high level, that's what it requires to actually implement a plugin. So if we take a look at the SNAP collector plugin for Mesa's, which is available on GitHub, it's open source, we would love contributions. It looks a little bit like this. So just to break it down a little bit, we have one Mesa's plugin that handles both masters and agents. So the first thing you'll notice in collectMetrics is we determine if we are on a master or an agent, or if potentially they're both running on the same machine for development purposes. Next, we determine if the master is actually the leader. Like I said before, we don't want to query a non-leading master. If the master isn't the leader, we just don't do anything. We get information about the frameworks, about the running containers, and so on. We put all of this into a metrics array, and then we return it to the SNAP daemon. From there, we can take this collector plugin. Those metrics can be, depending on the task that we've defined, that the workflow, we can actually push those through a processor or directly into a publisher and off to a time series database. In the Mesa's plugin at present, these are the various namespaces that we have available. There's the master and the agent namespaces, along with these asterisks. The single asterisks are referred to as dynamic namespaces, so they can be built dynamically based on the underlying service that you're querying. I'm not going to read everything from this slide, but essentially what we can do here is these dynamic namespaces can be filled in with the framework IDs and the executor IDs, so that we're able to associate metrics with individual workloads, and that way you can determine how much memory or CPU marathon is using, versus how much Chronos is using, versus how much Jenkins is using, and so on, and really get a good snapshot of what your cluster is doing and who's using it. Deployment looks a little bit like this with the Mesa's master and Mesa's agent running on each of these nodes. The snap daemon is also running on these nodes, and you'll see on the standby nodes the daemon is running, and it's attempting to collect every, let's say, 10 seconds, but not necessarily doing anything if one of those machines is not the leader. And to actually configure snap, you have this concept of a task. The task contains the metrics that you want to collect, where you want to publish them to, the hostname, or the IP address of your time series database, the IP address in port that Mesa is listening on, and so on. And from here, like I said, each of these is running on the node. You need a way to get those files onto the node, but you could use snaptrod to roll out configuration changes. And then all of this is rolled into, you know, some time series database, or an HTTP API, or some other publisher plugin that you can develop, or use one of the publisher plugins that's already available in the snap ecosystem. This is a little bit hard to see. Like I said, I couldn't do the demo today because of technical difficulties, but this is just one shot of a Grafana dashboard that I created quickly that shows all of these metrics being plotted. Up at the top, we have cluster health, the number of tasks in the various states that they're in. In the middle, we have cluster utilization overall. This is how much memory and disk and CPU is available on the cluster and how much is being used. And then we have resource utilization by framework, and that's, I think, Marathon is the only framework running here, and it's using somewhere around 8 gigabytes of memory. Cut off a little bit here, you can see maybe down at the bottom, there's also resource utilization per executor. So if you have a misbehaving process or application, or you want to determine where your network traffic is going on a specific host, you can build dashboards and you can query something like influxdb for all of those metrics over time. And like I said, I really wanted to give a demo today. Unfortunately, it's just not going to work out, so I do apologize about that. But if you'd like more information, this GitHub repository contains a vagrant file, and in the vagrant environment, you can pretty much just do a vagrant up, it will provision a single-node mesos cluster with a snap with the snap mesos plug-in with influxdb, with Grafana. Pretty much everything that I wanted to demo today is available in this repo. You can just run vagrant up, assuming you have vagrant, and be able to get it. Snap is also available, it's open-source, it's under the Apache license. You can get that and check out the various different plug-ins. You can also check out snaptelemetry.io, there's a whole plug-in catalog there with all of the different collector and processor and publisher plug-ins. At MesoSphere, we're also working on a metrics project for DCOS, but a lot of this has been Meso specific, and it also applies to DCOS. We're working on a few different things if you want to go check out the DCOS metrics project. And like I mentioned earlier, we'd love for you to join us on Slack in the Day2Obs channel, or on our mailing list at Day2ObsWorkingGroup at DCOS.io. Because I didn't have time for the demo, I'd like to open it up for questions if anyone has any, or if you want to send me an email at rogeratmesosphere.com, I'd gladly accept it. I don't see any hands, so we're going to end this talk a little bit short. Thank you for attending, and enjoy the rest of the conference. Thank you.