 Good morning, everybody. Thanks for coming and listening to the talk today. I'm really grateful to be able to share with you today some of the things my team and the open source community have been doing and working on around Cloud Foundry, monitoring Cloud Foundry. I'm Jeff Barrows, and I'd like to answer a question that you may have right on the top of your mind right now as you look at this image. No, that is definitely not me up on top of that windmill. This is somebody who's crazy and does some pretty extreme stuff on maintenance on these machines. I am a engineering manager and technical lead for the Cloud Services team at GE Digital. And I work much more comfortably on distributed compute systems like Cloud Foundry. My team, the Cloud Services team, is responsible for building and running the Cloud Foundry ecosystem for a GE service called Predix. And Predix is a cloud platform. It allows developers to build and run applications that can interface with some of the things that GE produces, like jet engines. Again, that guy is not me. He's standing in a test facility with one of GE's latest jet engines. Our aviation division makes state-of-the-art jet engines that are used on many of the commercial jets flown today. And in fact, every two seconds, an aircraft powered by GE jet engines is taking off somewhere in the world today. And at any given moment, like right now, there's 2,200 aircrafts in the air carrying over 300,000 people. So it's kind of crazy to think that there's actually 300,000 people above the Earth at any given moment in time. But GE is powering some of those jet planes. Our transportation division makes locomotives. Brand-new state-of-the-art GE locomotive features hundreds of sensors and generates hundreds of thousands of data points a minute. And by connecting these rolling sensors to Predix, it can help us develop apps that help unlock new efficiencies and rail transportation systems. So it's estimated that every 1% increase in rail efficiency in the US today is worth about $1.8 billion. We also make many of the machines that make up oil and gas production facilities, things like industrial compressors, generators, pumps, all the things that are critical to the safe extraction and production of oil and gas around the globe. Efficiencies in these systems gathered through insights created by software help extend running periods of these machines, minimize production disruption, and help reduce costs. Our power and water division makes these massive gas turbines that generate electricity. And this one alone generates enough electricity to supply half a million homes with power. GE equipment generates half of the world's installed power base. And 80% of the electricity flowing in North America is controlled by GE systems. And you guessed it, those four guys working on that turbine are definitely not me. This is not my brain. But GE Healthcare makes some of the most advanced imaging systems in the world. And by connecting these MRI machines to GE Health Cloud, which runs on top of Predix, we can provide better, faster diagnostics and provide actionable information to patients more quickly, which directly impacts their lives and their quality of care. So by connecting, as you can kind of get the point, by connecting a lot of these industrial machines to the Predix platform, we're basically enabling brand new efficiencies and opportunities that only a connected software platform can unlock. It's really, truly an exciting time to be working here at GE Digital. So if you have any questions about it, please feel free to ask me about it. The reason I'm really talking about Predix today is because it's mainly built on a number of components. The GE Digital software engineers build a whole bunch of industrial microservices that can be composed and used to build these pretty awesome industrial applications. But Predix is also built on top of Cloud Foundry. As you're probably aware, Cloud Foundry makes it really easy for app developers to quickly deliver production grade applications to market. Over the past 16 months or so, our team, the Cloud Services team, has built a global Cloud Foundry deployment footprint. We've enabled thousands of developers to start writing industrial applications. And we're currently running tens of thousands of application instances across the globe today. So it's been quite a journey. But we didn't just get here magically overnight. So how did we get here? It all started pretty simply. And it probably started in a way that it started, for many of you, starting to run Cloud Foundry today. It started with a POC. We began with a Greenfield, MTAWSVPC account, and some credentials. And we were really fortunate enough to be able to work with Dr. Nick in the Stark and Wayne folks. And in a few months, had a full blown dev environment up and running in our Amazon environment. With the help of a handful of app devs at Digital, we showed that we could get a brilliant factories application and MVP up and running really quickly. So after proving the point that Cloud Foundry was indeed helping developers go really fast, we got a challenge from leadership to say, hey, listen, can you actually deliver four production applications to a paying customer in less than three months? And we were like, oh, man, it's getting serious. So it's game on. It's time to productionalize Cloud Foundry. So being the operationally inclined kind of guy that I am, I thought, well, first things first, let's get some monitoring and telemetry on the system. Let's see if we can get some kind of metrics out of this black box that's called Cloud Foundry. So being new to Cloud Foundry, it was time to start digging into that black box to really start figuring out how things worked. But have you actually looked at all the things that comprise Cloud Foundry? You're like, holy crap, that's not a moon. It's Cloud Foundry. And you realize it's composed of 12 different subsystems that really make up the underpinnings of Cloud Foundry, Cloud Controllers, NAT servers, Go Routers, Runners, Logurgators, Databases, and Bosch. So you're like, such wow, this is gonna be a crazy adventure. This is gonna be a really challenging thing to kinda get our hands on. So where do we start? We started pretty simply with CloudWatch. So we were deploying an AWS. We had CloudWatch stats for free. We kind of took a peek at CloudWatch and we were like, all right, great, we can kind of see all the different components that make up Cloud Foundry now. We have the general feel for the Linux stats that are coming off of it. But we quickly realized that CloudWatch wasn't gonna be our long-term solution. We had a number of reasons. We wanted to have a unified monitoring approach across both AWS and data center deployments. We wanted something that had a little bit better flexible metrics collection and rendering service. So we pretty quickly moved away from CloudWatch. So time to build. So engineer inside me says, awesome, this is gonna be so much fun. It's gonna be a great project. I get to learn all the cool new monitoring solutions that are out there today. And then the engineering manager inside me says, no, it's gonna be, this has like a potential to be a black hole. We may not be able to make something that's actually gonna work. I'm really nervous about kind of approaching this adventure. So what do we do? So first we started detailing some high level goals for our metrics and monitoring collection system. The system we believed it should be like a utility service. So it should be highly available. It should be ubiquitous. It should be everywhere. And it should be as easy for anyone to use as turning on a light switch. Coverage should be automatic. The system should be born with monitoring coverage. It should be able to be driven by configuration management systems. And then when the system retires or dies for whatever reason, it should automatically be removed from coverage. It should be an extensible system. So as we get good at the basics, we should easily be able to add increasingly sophisticated capabilities to the system. We knew how to monitor Linux systems really well and get the base statistics out of that. But Cloud Foundry was a little bit more abstract. We didn't know exactly how we were gonna interface with the system and how we were gonna pull data that could help show the health and wellbeing of that system. Lastly, it should integrate really well with our existing configuration management tools. So we use Bosch naturally for all of our Cloud Foundry deployments. We use it for some services deployments. And then we use Chef quite a bit for service deployments and some supporting systems as well. So we took a look at the current state of monitoring back in 2014. And after a quick bake off, we decided to build an MVP solution using SENSU as a monitoring framework, Graphite as a metrics collection system, and Grafana for data visualization. So let's talk a little bit about SENSU. What is SENSU? So aside from being the Japanese word for a folding fan, you can kind of see where the logo is inspired. It's a composable framework. And with it, you can do things like execute service checks, you can send notifications and alerts, you can collect metrics, and then you can drive all the setup and configuration using configuration management tools. So it checks all the boxes for the things that we wanted to do. Let's give a quick overview of the SENSU architecture. It's comprised of a couple of different layers. The first one is the SENSU server layer. It's an end-tier stateless system. You can deploy as many SENSU servers as you want. The SENSU server is responsible for publishing the check requests and then processing events as they come back. There's a RabbitMQ cluster. We have a multi-node RabbitMQ cluster so it can be fault-tolerant. We distribute and replicate queues so it's fault-tolerant to node failure and availability zone failure. And then the SENSU clients. There are Ruby client that gets distributed out to all of the machines that we wanna have coverage on. They execute checks and repotes back data back to SENSU for processing. And then there's a Redis cluster. The Redis cluster basically keeps track of a couple of things, but mainly help check state so then you can do things like occurrence-based checks. So you can say, if my CPU's been over threshold for the last three checks, then go wake somebody up. And then lastly, the SENSU API servers. So this is basically the thing. The REST APIs that can interface with the SENSU subsystems. It's what all of the different SENSU admin dashboards are based on so those guys can talk to SENSU APIs to pull a list of clients, pull lists of checks, get check data. It can also be used to integrate with third-party systems as well, which is pretty cool. So let's walk through how we actually execute a service check using SENSU. So first, the SENSU server publishes a check request to subscriber queues on RabbitMQ. The SENSU clients that are configured to subscribe to that particular queue see the message that gets published to the queue, takes it off, executes the check, and then publishes a response back to RabbitMQ. The SENSU servers process that check response and as I mentioned before, we can scale that tier out for scalability and resiliency. SENSU server processes that event and triggers actions if they're so configured and then updates Redis with the held check state. Take a quick look at what a service check actually looks like. So this is the anatomy of a service check. It's basically, it's really simple. It's a command or script which runs and outputs data to standard out or standard error. So if you're familiar with Nagios health checks or Nagios plugins, it follows the same standard. If you have a fleet of Nagios scripts that you've collected and built over time, you can drop those in to SENSU and run those right away. So it's all supported right out of the box. Basically the commander script will run and then produce an exit code. Zero is okay. One is warning. Two is critical and three is custom. So we use an exit code of three to indicate that we have a metrics type response for the check. And then it can also put an optional response payload usually in JSON onto the message bus attached with that particular response. You define a list of subscribers. So this is a list of the nodes that should be interested in running that particular health check. And then there's handlers. Handlers are things that take action on events if any are configured. And lastly, there's a check interval so you can specify how often you want the check to run. So this is a quick overview of what an actual check definition looks like. Simple JSON. You can see it's got the check name which is check disk usage. It's got a couple of flags in there for warning and critical thresholds. It's got the subscriber defined as production DEA. So this would run on all of our production runner nodes. Handler is configured to pay due duty. So if this goes bump in the night, it's gonna go and wake somebody up and it runs every 60 seconds. So handlers, I mentioned handlers a couple of times. This is, it's really hard to overstate the power and flexibility of the handler construct within Sensus. Handlers are basically actions executed by a Sensus server when events are received. Things like send a note to pay due duty, send a metric to graphite, integrate with FlowDoc or maybe send an email. There are four primary handler primitives. There's a pipe handler type which is external command that gets run. It can consume that JSON payload that gets put on the event response onto the message bus. You can parse it. Any language that you want can be bash, can be Ruby, can be whatever that you're most comfortable with. Process that data, transforms it and then does something with it. So send an email, integrate with FlowDoc or what have you. Second type is the TCP UDP handler type. So it knows how to make a network socket connection to an external system. This is how we get stats shoved over to our graphite system today. So it's a pretty powerful construct. The last main one is the transport handler type. So if you wanted to, you could have a second named, another named queue on RabbitMQ. You could publish a message to that queue and then have third party external resources actually watch that queue in that PubSub model and then pull events off for like extra third party integrations. The last one is really a concatenation of all of these. So you can, if you wanted to take multiple actions on an event at any time, you could say I want to send the stat over to graphite and then I want to page somebody and wake them up and maybe put a message on a message bus for a third party system. So metrics and metrics collection using graphite and Grafana. Graphite is an open source system that allows us to collect, store and render time series data. It's got a simple line protocol that basically consists of the metric name, the timestamp and epic format and the metric value. You just can actually netcat that out to a socket and get metrics persisted in graphite. And then it's got flexible storage backends that supports whisper DB flat files. That's what we use today to scale out. We're handling hundreds of thousands of metrics a second using whisper DB. We know that there's probably a runway that we're not gonna be able to support much a lot more unless we really scale that out. So it also supports influx DB, open TSDB, cyanite and others. And then it's got a really super robust API for metrics retrieval and analytics function. So you can basically execute a curl command, get data back out of graphite and you can get it to do some things like averaging or percentiles or sums and things like that, which is pretty cool. So we'll walk through how Sensu gets metrics into graphite real quick. As remember before, the metrics check is scheduled. The clients run that check and then publish the event back to the message bus. The Sensu servers are configured with a graphite handler which knows how to make that, excuse me, the TCP connection to graphite. Processes the metrics event request and then connects directly to what's called carbon relay. It's a Python process. It's basically responsible for metrics ingestion and routing. It knows how to get that metric off the wire and then into a persistence layer. So there's a number of different cool things it can do to help support distribution of storage using consistent hashing and replicas. In this case, it uses consistent hashing to send that metric out to the carbon cache and whisper DB layer. So it sends that metric out to three nodes that are split across multiple availability zones to support that level of fault tolerance. And then it's written to the whisper DB flat file. So now your metric is off the wire and on the disk. We wanna be able to look at those things because what good is it if it's just stored on a disk? So we use Grafana. Grafana is a really awesome web interface that lets you build these really cool dashboards and KPIs. It knows how to talk to the Graphite API and Graphite API knows how to pull metrics from the disk and then push it back into the dashboard for visualization. Lastly, we also have a SENSU client, Graphite Metrics Health Check that is able to execute similar requests against the Graphite API to be able to pull metrics out and then to be able to do some thresholding and alerting off of those. So great. But what we have now is a monitoring system and a metrics collection system. We can execute health checks. We can execute metrics retrieval checks. We can get that data shoved into a time series database. How does that help us monitor cloud foundry? So one of our original goals was that we'd have automatic coverage of all the things. So what we did was create a Bosch release of the SENSU client. So we bundled up all the SENSU client bits, the Ruby parts of it, and then all of the health checks that have to get executed across the fleet. And we included that SENSU client job in all of our Bosch deployments. So anything that Bosch deploys, whether it's cloud foundry or any of the tiles or things that we use Bosch to deploy, we also include the SENSU client job. So now every node that gets deployed and pushed out using Bosch has coverage. We configure it to belong to the all group. It's just a default. It's just a word. We could have called it digital. We could have called it whatever. But now it basically allows us to capture all base Linux statistics for all the nodes that Bosch is pushing out, which is pretty cool. We capture things like CPU utilization, network utilization, memory and disk and the such. Quick note on metric names. This probably can't be understated as well. Setting a naming standard for metrics is one of the most important things you can do as you plan your graphite deployment. If you do it in a rationalized way, we'll enable you to do things like wildcard aggregation of statistics, which is super nice to be able to do in a cloud foundry distributed systems world. It minimizes maintenance of our dashboards and KPIs so things are kind of self-maintaining as we scale out subsystems of cloud foundry. So this is kind of an anatomy of a name for our metrics. We base our names on the Bosch deployment. So you can see here, the first part is the, this is a Bosch deployment for our US West production cloud foundry deployment. The second component of that is the actual Bosch job name. So if you're familiar with looking at Bosch manifests, I'm sorry, but you'll probably be familiar with seeing the different sub-components that are comprising that deployment. This is runner Z1, so this is a runner job that is for availability zone one. And then this is the instance number or the index of that job. So if you want to, if your deployment has 10 runners within availability zone one, this will go through zero through nine. And then the last part of it is the actual metric name. And this is generated more by the metrics check that you scheduled to run. So this one in particular will grab E0 transmit bytes statistics. So this is a dashboard that I was gonna walk through real quick showing like how we actually constructed dashboard, but this is a basic example of what you can capture just by getting the base Linux statistics. And some of the cool things, I don't know if it's really easy to see here, but what we've done is basically, for each of the cloud foundry sub-components, we've created a high level KPI that captures these Linux statistics. And using wildcard aggregation with our metrics definitions, we can actually get like CPU utilization or memory utilization across the entire fleet of say the runner pool. And then we can do things like do some percentiles and things like that. So we can kind of see where the outliers are and where the common things are. So we can kind of drill into those if we need to. It's really easy to basically rinse and repeat. So you kind of define your standard ones for one set and then you can just go through and change the names, generate quick dashboards for all of the others, which is great. But it still doesn't really give us a lot of depth into the cloud foundry subsystems, right? So now we can monitor all this stuff like Linux is spitting out, CPU, memory, network and disk for all that stuff. It gets us a little bit further than what we had with CloudWatch because we have a nice environment to be able to generate dashboards and explore that data a little bit more flexibly. But now we really need to start peering into the health and well-being of the cloud foundry subsystems. So after a bit of research, we came across a few open source tools that helped us crack that case. And the first one we use is the collector. So it's a Ruby program, it's an open source project. It basically listens on the NATS bus for CF subsystem announcements, pulls their subsystems varsy and healthy endpoints and then it knows how to get that data and parse it out and then push it directly to the carbon demons for the graphite system to persist to disk. Unfortunately, that's being phased out and it's being phased out pretty rapidly. I think I just heard as of 236 cloud foundry release, a lot of the subsystems are no longer supporting that varsy, healthy endpoint, which kind of sucks. But there's hope, so they're changing the model and they're starting to publish statistics to Logregator and then you can actually make some nozzles that attach to the logging subsystems so you can start parsing that out. So we're actively working on, obviously moving to that direction so we can continue to get all the stats that we need from cloud foundry. But Collector gives you a ton of different data. So this is a sample dashboard that we created. It's based on a lot of the good work that the PWS guys have done, a lot of documentation up on the cloud foundry website about monitoring cloud foundry. I followed those outlines pretty detailed and then built this kind of stoplight type dashboard. It gives a really high level overview of all the different subsystems in cloud foundry. Things like total number of runners, the expected number of cloud controllers, UAA servers, total DEA memory used, routes. The yellow one is pretty interesting and it's the available memory ratio. It helps us know when we need to scale out and add more capacity to the runner layer. We could spend tons of time just talking about this one. This is kind of like a high level one. This dashboard's up in our knock area so people can kind of look at it and get a general sense of like, hey, is everything pretty green? Yep, everything's pretty green. But if you need to, you can continue to drill into and make specialized dashboards for each cloud foundry subcomponent. So this one shows some detailed router stats so it shows things like total routes. It shows HTTP response codes from both your DEAs, the applications that are running, as well as response codes that come off of kind of the core cloud foundry components like your cloud controllers. If you see a lot of 500s coming off cloud controller, maybe there's something wrong with cloud controller. It also shows like CPU aggregation and some other things that might be of interest. Total network that's like throughput that's coming across your whole go router fleet. You can do aggregations like that, which is pretty cool. And yep, that's sweet dashboard, but I'm not watching that dashboard all the time, 24 by seven. So how do we actually get actions and take actions out of that thing? And that's where we develop the SINSU HTTP health check that can query the Graphite API. So we schedule the HTTP health check on a couple different SINSU clients. It knows how to construct basically a curl statement. So you can actually go in and those dashboards that you created, you can pull the definition out of that and put it into a curl statement, fire it off against the Graphite API. Graphite can go and retrieve the metrics, do the type of aggregation that you're thinking of doing. You can get an individual data point or a set of data points or you can average or some and then give a response back to the SINSU client with a JSON payload that has the details. And then SINSU can actually take action on that. So you can define thresholds and do all the good things that you can do with handlers and wake somebody up with pager duty. And that's about it. So I'm sorry, I only had a few minutes to give you a high level overview about this, but hopefully it gives you a better idea about how we're using SINSU and Graphite as the backbone of our monitoring solution. We're working on getting the Bosch release of SINSU client up on our GitHub, so it should be publicly available. Hopefully that'll help give you a head start and you can get working with that. I know Stark and Wayne guys have helped put out a Bosch release of SINSU server in the ecosystem. So that's out there. And yeah, we look forward to putting up maybe dashboards and templates and other components on GitHub as soon as we get those all cleared from our lovely legal department. So we have like, I think no time for quite, maybe I'm in there too for questions, but otherwise thank you very much. Thank you. Thank you.