 OK, let's make a start. So next up is day two operations best practices, Janet from SignalFX and Ben from Mesaspear. Cool, thanks. OK, so we're doing day two operations. The main impetus behind this is, so I'm primarily in a customer-facing role. Over the last year or so, we noticed that there's much more emphasis being placed on operations versus deployment and installation. So that's a great thing, right? That means that Mesaspear is finally in a place where doing the generic install is simple enough. And depending on the company size, you could be a small startup. You could be an enterprise. You could be a service rider having a team ranging from one or two individuals to an entire DevOps team. The most common theme that we saw was people want simplified streamlined metrics pipelines. So that's what we're going to be talking about today. So the agenda is we have a brief overview. We're going to talk about all the different categories of metrics in Mesos. And then we're going to take a deeper dive at the Metrics API. We've actually got a live environment. We're going to have a demo where we run a script against the API, pull in that information. But most importantly, we're going to look at the data in a back-end monitoring system. And we're going to interpret the data, look at the different visualizations, to give you guys some ideas on how you can build dashboards around what's meaningful to you. So I've got a couple of marketing slides here. I'll zoom through these pretty quickly. But generally, for most companies, it's all about data, right? So in most cases, what they want is a single platform that runs on top of any infrastructure, cloud, private cloud, you name it. And then this will connect with billions of different users and client devices, as well as, increasingly, like IoT devices. So one example is we were talking to a customer in Singapore. They're actually running a project called Smart City. So things like schedules of buses down to the minute, they want to be able to optimize. And all that requires additional sensors and data. This is largely led by the developers, right? So we're transitioning from a model where we've got traditional enterprise applications. So these are monolithic applications, single code base, very brittle. And then they're persisted into relational databases on the back end. You've got Oracle SQL Server. Into more cloud-inv applications, right? So these are microservices using kind of stateless applications running in containers that are lightweight, flexible, modular, communicating over restful APIs. And then increasingly, companies are adopting all the latest open-source technology. So Spark, Kafka, Cassandra, Flink, TensorFlow, so on and so forth. And then the second thing is, apart from microservices, customers want to build out data pipelines. So this involves ingesting data from all these different sources into something like Kafka. And then persisting that into, say, Cassandra or HDFS, running analytics on it using Spark or Flink, and then writing applications that act on that particular set of data. So the thing to note here is that these are all very complex distributed systems for the most part. And each of these has metrics that operators need to be aware of. So the goal is really to paint a picture, this holistic view of everything that's running in this particular environment. Now, for operators that are looking at applications, we talked to a bunch of customers. These are some of the common challenges that they face. Scalable capacity, dynamic architecture, load balancing. What is scalable capacity? So this refers to when you have an application, the application is elastic, it's ephemeral, it may scale up and down. We need to know, based on the characteristics of the application, when we need to add additional computer or networking infrastructure. So that's identifying what's important relative to the application, and then scaling it up and down. Dynamic architectures, this means that, so now we have this new developer pattern where we've got these modular services that talk to each other. And the nice thing about that is you can swap out services. You can exchange one component of your application, do a blue-green deploy, or whatever. But the idea here is that each application is now a set of constituent services. So instead of just monitoring for a single process that's running on a particular operating system, now we might have tens or hundreds of different applications, each of which has its own tasks. Could be hundreds of tasks per application. So we need to understand what that looks like and how to measure that moving forward. And then load balancing. This is pretty straightforward. All this says is you have an application. It might have one or more instances, typically using either an east-west load balancer or some other load balancer, like an ELB, or maybe an F5 sitting in front. How do you make sure that the load balancing algorithm is performing properly over time? So next, we're going to look at metrics. So metrics are just characteristics measurements that are used to determine the health and performance of the system. How is my cluster doing? How is it performing over time? Do I have any specific issues with my infrastructure? How many masters do I have? Is the leading master changing periodically? And then most importantly, it's really about the application performance. So part of the challenge previously is how do I get the application metrics from the container running in my Mesa's environment? We're primarily going to be covering the first two categories, so infrastructure at CPU memory disk, as well as the application metrics. But you can also capture things like users, things like login and usage for use cases, such as charge back. So increasingly, we're seeing more and more customers wanting multi-tenancy out of their Mesa's environment. So charge back is going to be key for being able to understand these different characteristics as far as how much resources each individual group is consuming. OK, who's seen this diagram before? So this is like the canonical Mesa's architectural diagram. Again, the goal is just to reinforce the fact that there's different sources for all these different metrics. So from ZooKeeper to the Mesa's masters to the agents to the frameworks that are running on top of the Mesa's environment, which are scheduling tasks, and then down to the executors who are running the actual tasks. So the metric sources, so we have the Mesa's metrics. We also have container metrics, so these are more or less OS level metrics that we capture. And then for the application, what's really interesting is things like requests for seconds, latency, active users, response time. So for the masters, the agents, and for Marathon, each of these has their own metrics endpoint. And that's what we use to capture the information and funnel it to our backend system. So I won't go through these in detail. I'm going to actually ask Janet to show off a couple of dashboards that she built to convey the same information. So we're going to take a quick look at this observability metrics page from the Mesa's documentation. It details all the metrics you can get from the different resources, like the master nodes, the agents, frameworks, allocators, et cetera. So lots and lots of data coming in. It also outlines some recommendations for things you want to look out for in your cluster to know if it's performing normally or not. So using that information, we had built a collective plugin that will grab metrics from the master and agent nodes and then send them off to our monitoring app. And then here's one of the dashboards for that. So it's the problem indicators. And as an aside, for the charts, we use digraphs and D3, the JavaScript libraries, to build these charts. So this first chart here on the left, cluster health, it gives us an idea of whether or not the agents are performing normally, taking a look into that chart. We're using the agents connected metric, comparing that with the active one and then seeing what that ratio is. So luckily for us, our percentage is 100%. But if we wanted to know if it's performing less than 100%, we've got these rules set up to notify ourselves if it's within certain ranges, how severe might the problem be. Going back to the dashboard, another example, number of leading active masters. You want that to be one, right? Too many masters, not good, too few, also not good. So we got this one that sums up how many are being reported as being active, as well as master uptime. Something for that, what we look out for is if it's below the recommended value of five minutes, or if it's below, say, a minute for some duration of time, that's indicative that it may be flopping. So we'll notify ourselves if that's a problem. So Jan, I have a question for you. Totally unrehearsed. So when you're actually building these dashboards, can you talk about how you decide what to bubble up based on the requirements or whatever? Like what's important to actually convey on these screens? That's a great question. So from users we've talked to, people are interested in knowing are my tasks running, is my workload being handled? And if not, what is the thing that's causing, preventing that from happening? So we organize information mostly around the infrastructure, so like the CPU, disk memory, as well as what's going on with the task, like what is the throughput of that? So if we want to drill down further, we have a dashboard with the clusters overview. So this shows me all my clusters, what's going on in them, my CPU memory, something maybe interesting is the number of tasks running. So in blue here we have how many are running currently, and then we compare that with this red dotted line of the week over week growth. So looking at past data, comparing it with the current data, if your growth is continually increasing, you may need work capacity to handle your workloads. So going further down we have another dashboard of specific cluster, what are the tasks running in that? What does the CPU memory usage look like? Actually let me go back one to here for the overall clusters view, something we commonly use is comparing percentiles. So in this chart we have the CPU percentiles all over host. So there's something running in here that's taking 100%, that's the max value as well as the P90, whereas the median or P50 is only around 33. So we have this other chart that sorts all the hosts by their CPU usage and we can see that the top two take 100%. So maybe we wanna log into those instances and see what's going on, if that's something we're not expecting to happen with the current expected workload. All right, so cluster detail. Per node we have the master node view, so how many tasks are running on that specific master node, the uptime, frameworks, tasks running. And then similarly for the agent tasks, uptime. Yep, so we're gonna switch it back to slides. Give it back to Ben. Okay, so thanks, Yanan. So I'm not gonna go through all these, but again it's examples where you can look at particular metrics and then based on the current value, understand what's happening with your cluster. So for instance if your uptime for the master is low obviously the master's restarted and then if your number of elected masters is zero, right, then obviously there's some problem with your cluster. Okay, so that's from a Mesa standpoint. Next we're gonna look at container level metrics. So traditionally what do we do for traditional apps? Typically you'll have your application running, you might have a specific agent that pulls metrics from that particular application. But when we move to container as world, sure you can still do this with pods, right? Just have a sidecar container, pull metrics from the primary container. But the problem that we found here was that this approach isn't very scalable, right? And it's kind of unnecessary for print. So what we did is we build this metrics module for Mesa. So the great thing that it does here is you get basically the application metrics for free, right? The metrics module is bundled in with the agent as an isolator and when the Mesa's containerizer brings up the container, it will expose these two environmental variables. StatsDUDP host, StatsDUDP port. So as a developer all I need to do is write to those environmental variables and the metrics module will capture that information and then it's also gonna tag it with context, right? So that's the second thing, is understanding the context surrounding those metrics. So you have things like container ID, you have things like agent ID and it allows you to filter through all these metrics very easily. So this is the general architecture. As you see here, we have a set of hosts. Each of these hosts has your Mesa's agent with the metrics module. We have containers one, two, and three, right? And we have applications that are writing to those environmental variables. So those are sent to the metrics module. The metrics module will then forward them to the collector, which is a system D unit that runs on every single host. The collector is also collecting stats from the operating system as well and then we're exposing a metrics API, which you'll see shortly. This allows you to easily pull the API, grab that information, and send it off to whatever backend of your choice. I'll quickly show the environment that we have here. So we spun up an environment on AWS. We have total of eight nodes, so seven private agents, one public agent, and then we've got a variety of different services. So we have Cassandra, Kafka, Kubernetes, and then we have a marathon app. So this app, all it really does is it emits stats D, so you'll see that as part of the demo later on. Okay, so I'm gonna turn over to Janet who's gonna talk about the API. So using the metrics API, we can get information about the cluster. It's like about the host, containers, applications. And this is a collection of basically get requests you can make to pull from information. So here's an example. It's getting metrics from a specific agent using the agent ID and then add some resource path and then takes two headers, the accept header, for the response type, and then an authorization token. And your response is gonna look something like this. You'll get a list of data points that come back and then an object of dimension properties. So I'm gonna go into detail now about what these data points and dimensions are. So a data point is a single value you'll get for a specific metric from a specific source at a specific point in time. So there's a few attributes that are inside that data point. First three hopefully are self-explanatory metric name, the metric value, and the timestamp at which you were getting that data point. And then there's metric type and dimensions. So the message exposes two types of metric types. There's counters, so like it sounds, it's a count of something happening. It's increasing in value. So like the number of successful tasks, number of failed tasks, number of agent registrations. In comparison, there's gauges, which is a value that could increase or decrease over time and then it's like you're getting a snapshot of the value at a specific point. So it's something like the current memory usage in a cluster, the current disk space being used or a number of connected agents. And dimensions are key value pairs that describe the source of your metrics. So it's a one set of unique set of dimensions will tell you a little bit of information like the host or let's say the region that these values are coming in from. And with these dimensions, you can help correlate your data points. So you don't just have your random values, you can actually link them together into patterns and identify trends over time. And with the dimensions, you can classify your data, helps you aggregate and group them, do some filtering. So it's a way to manage your data so you don't just have like a random collection of numbers. Here's a site comparing the metrics versus dimensions. So examples of metrics are like we've seen like CPU idle, disk operations, load, versus dimensions are descriptions of the sources. So like the data center region I mentioned. And when you combine them, you can make a time series. So the data points and dimensions define pattern or a plot line you can actually put into a chart and look at your data in a more cohesive manner. So like in figure one, we have CPU idle, one plot of that. So that's one set of data points coming from one source versus in figure two, we have two different plots because we have one that's in the data center in the east and then one coming from a data center in the west. So when we plot them together, we can compare them based on their dimensions and do groupings and whatnot. So while using the metrics API, a few things I noted that may be useful if you're using it to know. One thing is you can get an authentication token using this post request. So off login gets you that token, you'll send with every get request you make following that first request. The timestamp of the data points that come back may vary in format. So if you're trying to parse that timestamp out, you wanna look out for longer strings versus shorter strings. Another thing to check is the data point value. So it should be a number typically because that's what you'll plot on charts and compare against each other. But for example, in this data point we got back, the value was not a number. And in general for sending metrics, like if you're making your own metrics and sending custom values, some tips for that, you may wanna structure your names hierarchically, but a way to organize your metrics. And within those names use a consistent dilemma like a period or a dash, which could help with wildcard searches. Also if you're using dimensions on your metrics, you don't need to put that same information in the metric name. It just kind of clutters in, makes it longer than you really need it to be. One thing we see that our users doing that really helps what works against managing their data is using dimensions with high cardinality. So something like timestamps or really long error messages or IDs that are only used for a very brief amount of time. Those kind of things will not, you won't be able to use them to organize, to group and aggregate your data because it's always changing. There's not gonna be a way to link your data together with values that keep varying very frequently. So I just wanna show how we send the data to our app. We send post requests to this URL with the content type token and then lists of the different metrics that we have, the gauges, the counters. So I'm gonna just show that script that we have that gets metrics from our DSOS cluster and then sends them off to our app. So using that cluster that Ben had showed earlier, which is gonna kick off the script, it will do the login and then send all these get requests to the cluster every 10 seconds to get everything about our different nodes, our containers and our apps running in it. So while that is filling in into these charts that I had built before, you can take a quick look at the script. So it just loops over making these requests. We get the host from the master. So master URL here. And then for each host, get metrics about that, get the containers inside that host, get information about the containers and then get all the information about the apps running in those containers. So it's just gathering all this information up and then sending them off into something we can use to take a look at that data a little bit more easily. So it looks like that's still going on. Something I did to help myself find what's being sent in is I added a dimension for the source. So I have the source, the DCOS metrics where I could search for it and then it shows me all the matching ones that I have. And then I built some charts that I thought might be useful for myself to figure out what's going on in the cluster. See stuff coming in yet. Maybe slow, blame network. But let's take a look at some data we had collected in the past. So here's a three hour chunk that we had collecting the metrics. We'll take a little closer look at the container bytes received one. So it has this message here about there being too much data to show within what real estate I have on my screen. So I had mentioned several times we can use dimensions to help us filter out this data. So we can get a closer look maybe by host name. Maybe I only want to know the ones coming from here. So we get a little bit less data that fits more into the chart. You know, zoom in for time span. I can also maybe sum by host name. So then each bar will show me by host what is going on. So let me take off the host filter again there. So we have our two hosts mainly doing most of the work. What's going on in there? And then the different visualization types can help you also see what's happening. So in this, so that one I had used the bar chart for JVMemory with the lines. It's pretty clear there's some to the chart tooth pattern going on. Probably at these points, garbage collecting is happening. But it's very easy to see right that there is definitely a pattern there. CBA user time, heat map, visualization can very easily see what's the outlier, this dark green square. And for disk usage, this bar, sorry, the area chart, it gives us an idea of blocks of time where disk was being used by different resources. And then some other metrics about coffee usage. We had a client Cassandra app running also. So some more metrics about that. Maybe show the one with all the dimensions at the bottom. Which one? Oh, right, so just to see, get a better idea of what the data looks like. That's coming in. Everything to the right of the value column here is all the dimensions we're getting from the different metrics. So you can use that information to help you navigate through your data coming in. Okay, so we've got two minutes left before lunch. Key takeaways, right? So we touched on all three in our presentation, scalable capacity, identify system application metrics that are meaningful, dynamic architecture, right? So your app may be composed of multiple elements. So use dimensions to simplify your overall view of the application. And then for load balancing, again, this is really just keeping track of how each of the instances is doing over time. So if you want more information, we're certainly happy to talk to you, come visit us at our booths. And with that, we'll open up for questions. Nobody? Okay. Yeah, so the question is, do we need DCS? So DCS makes it a lot easier, but no, you could take that and you could roll it yourself. It's open source, the mesos metrics module. So the answer is no, but it's easier if you use DCS. Question? Right, so the question is, I believe it's about like uptime for tasks and jobs and things like that. Okay, so again, that'll be, so typically that's done at the framework level, like whichever scheduler handles that, may expose information. A lot of the new frameworks that are being built off of the SDK, they're all stats decomplying. So they will send metrics automatically and we can pick that up using our metrics module. Okay, any last questions? All right, well thank you guys for coming and we'll let you go for lunch for now.