 All right. Well, I'll go ahead and get started. My name's Adam Hevner. I work for Pivotal. And I'm the product manager for LoggerGator. We have some pretty dense content coming up. So I did upload these slides to the CF Summit site. There's a bunch of links that you can get from the slides. I'm also on Twitter at Ahev. And I've put together a couple of medium posts to distill some of these thoughts in the written form as well. So feel free to reach out to me on either of those social networks and appreciate your interactions there. Let's use the fancy clicker, maybe. Before we get started, I just want to express my gratitude to the foundation and to Pivotal. LoggerGator has a lot of very interesting problems and Pivotal pays me to work on them. So it's a really fortunate and virtuous cycle that I find myself working in. And I'm really thankful to be a part of such a great community. It's really fun to be here and meet a lot of different folks. So for today's agenda, we're going to start by talking a little bit about what service level indicators we like to look at for LoggerGator. I'm going to talk, in particular, about working with Pivotal's SRE and the Pivotal web services environment, as well as the Pivotal Tracker environment. The SRE team for Pivotal manages two major environments. And we're looking at service level indicators on both of those. And we'll see how we're meeting service level objectives. We're sort of exceeding service level objectives in one of our environments. And we're struggling a little bit more to keep up with the goals that we've set for ourselves and the other. I'm also going to spend some time to talk about the future of LoggerGator and the V2 API. I was just prompted some questions about how you can learn about what LoggerGator is working on. And I'm going to share some of the resources that we're using for sharing our feature proposals and tracking our work. If you didn't attend the CF Summit in Santa Clara, I gave a talk there called Improving Message Reliability in LoggerGator. We've really made pretty dramatic improvements in our message reliability over the last year. And some of the content's repeated in that talk, but it's available on YouTube and is a great primer for the improvements we've made over the last year. So before defining a service level objective for LoggerGator on your platform, there's a few things that we think that operators should consider before trying to set some specific goals. The first thing is how many developers that your platform is supporting and what their expectations are for the CF Logstream? Are they writing tests against that Logstream? Are they expecting a certain amount of reliability when testing against that Logstream? Those are important questions to ask yourself. Additionally, are you using syslog drains on your platform? Many operators, and especially if you're using a multi-tenant environment, are going to be using syslog drains. And they're an important tool for long-term storage of your logs. And they're really the preferred way for storing app logs for long term. Also, are you using the fire hose? As you mature your operational abilities and you're trying to troubleshoot some things, certainly you're going to want to use the fire hose. And there are some tools for actually monitoring the fire hose itself with or without actually relying on the fire hose. And then last but not least is considering what compliance or auditing needs that your business may have. With any service level objective, 100% is not a realistic target. So be prepared to have that tough conversation with your business stakeholders. But also know that whatever objective you set for a loggergator is only as good as your network and your infrastructure. So being able to measure that over the course of the year and provide reporting on how good you're doing is really the foundation that's kind of the experience in the roadmap that we'll walk through that Pivotal Web Services has been on. So for each of these, we have a little bit of tooling that we'll walk through. Sort of, if you think about monitoring and defining service level indicators, much like a pyramid of testing, sort of the base of the pyramid for monitoring needs for a loggergator is to monitor your log stream. And I'm going to share a tool that really makes that easy and simple and something that really any operator or anyone can get set up doing quickly on the platform. Monitoring syslog drains is a little bit more challenging. But again, if you're using syslog drains, especially for an important business function, then it's something that you should also invest in tooling for around monitoring. And again, I'll share a repo where we have some syslog servers and other things that you can easily deploy and use to create some black box monitoring tools. And then last but not least, if you are monitoring the fire hose, there's a Bosch add-on called the Bosch HM forwarder that will monitor the Bosch health metrics of each of the Bosch deploy components on your IaaS. So this is an important tool. It does emit them into the fire hose by default. So it typically is also looked at through the fire hose. But you can look at it independently. So it's a good way to make sure outside of the fire hose that your actual Bosch deployment is up and running. So if loggergator has a problem but the rest of the platform is still functioning, you can actually account for what's going on there. So like I said, at the base of the monitoring period is monitoring your log stream. And a really simple approach to this is to use a black box approach where an application is producing logs and consuming those logs. We created a really simple application called CF logbond. And kudos to the Pivotal Labs team. They helped us put a very simple interface on this so that you can actually deploy this app with CF push. It doesn't require any operator scope or fire hose access. And you can get a very simple numerical value of how accurate or how reliable the log stream is. Now, you can do a couple of things with this. You can set it for different rates of emission. And you can also use a new service that's in the process of being open sourced by Pivotal, which allows you to take metrics produced by the application and emit them back through the fire hose. If you're interested in monitoring using the fire hose and some of your existing tooling, CF logbond is a tool that will allow you to do that. You'll just need that metrics forwarder service to put it into the fire hose. One of the things that we found working with the Pivotal SRE team is that they have needs that exceed the capabilities of a logbond as a prepackaged application. So we implement our black box monitoring tooling using a pipeline. We use concourse. It allows us to provide a consistent set of monitors across a number of foundations. So we currently monitor five foundations. And this is kind of the first look at our view into some of the service level indicators that we've been using for each of our environments. With service level indicators, it's really helpful to work with your operators to set a simple indicator that can represent the overall health of the system. And like I said, we're kind of working our way up the pyramid. And relying on Logstream as a way to assess if your logger gator deployment is accurately scaled is a really good first step. And we've really started to target a high level of reliability for Pivotal Web Services. You'll see over the past two days, when I grabbed this screenshot of our monitor, we had a perfect score. We didn't drop any logs in that time period. It's pretty unrealistic. We haven't seen that for, I think, longer than a few days. I'd expanded this out to a week. And we were, I think, hit the target that we see below, which is three nines and an eight. What we have been landing on and what we're shooting for for the first year of consistent monitoring is two nines. So 99% reliability over the course of a year is what we're targeting for the Pivotal Web Services environment. I mentioned Pivotal Tracker. That's the third one on the list. And as you'll see, we're actually not meeting our reliability target right now on Pivotal Tracker. And we're working with the Pivotal SRE team to understand why that's occurring. We try to take a bit of a hands-off approach so that they can consume our documentation, our scaling recommendations, and see if they're actually being effective. So we're working with them to see, why is it that that deployment is not seeing the same reliability as Pivotal Web Services? Rolling your own black box monitors is something that is really not too difficult. There are a couple of different repos to access. The Logurgator Tools repo is really where you can find all of these things. Logmon is included there. There's an application called LogSpinner, which allows you to produce logs at known rates. And it also has a link to RCI, which has some pipelines that will start to execute these things on regular intervals for you. So as you move up the pyramid, you'll want to look at white box monitoring solutions as well. So this allows you to peer into the Logurgator system. And we found a couple of different ways that this is especially useful. One of those ways is monitoring infrastructure metrics you're probably already familiar with. So CPU and network I.O. are especially important for the Logurgator system. So looking at each of the components and checking for CPU and network I.O. is often something that we're looking at if we're trying to investigate why something's not meeting the service level objectives we set. The other thing that we are often looking for is what is the ingress into the system and what's creating any high volume ingress into the Logurgator system. So we do that by looking at Metron ingress per Bosch job name. And typically what we see on a normal platform is that the Go Router and the Diego cells are producing most of the traffic into Logurgator. But that is often how we will identify a noisy application is by looking for a particular cell that has changed in the last 24 hours or whenever an incident occurred or a particular service that seems to be sending more into Logurgator than as usual. As a rule of thumb, if there's a service that's exceeding Diego or Go Router, it's probably doing something it shouldn't be because that's pretty uncommon in the deployments that we've had experience with. I mentioned syslog drains. And the syslog drain functionality has been separated out into a separate release. There's a little bit of an asterisk here in that we have not been able to include that packaging in the CF release. There are some challenges with Bosch links and how we have done service discovery. We've moved away from EtsyD in this release. It's not coming to the open source distribution until CF deployment goes GA. But this was a ground up release for us. And so the white box monitoring is a little bit fresher and has a mind towards scaling and indicators that are appropriate for that. You'll see on this monitor, I have a specific number display for the max number of drains over a time period. And the syslog drain release is a really simple scaling mechanism where each syslog adapter can handle 500 drains. So we haven't done it, but it's certainly possible that you could autoscale the syslog drain release off of that indicator. We're not really sure that's a good idea. Autoscaling is a little bit of a slippery slope. But we use a monitor to tell us when that deployment needs more headroom and we need to scale it. So what we found working with Pivotal Web Services is that the way to, from the benefit of hindsight, the way to set a reasonable service level objective is kind of the following steps. First, ensure that you have a monitoring solution that can give you precision down to the level of precision that you need. So you really need to be able to look past the decimal point and say that you're hitting a number of nines for your reliability. And when we started this process, most of our monitoring was based on white box monitoring solutions. And because of distributed math and challenges around that, we might see 98% reliability for a minute period. And then the next minute, we might see 101%. So you can average that out to 99, but that doesn't actually give us the precision we need to start setting service level objectives. So we came up with that block box solution. And hopefully, if you download the CF logmon application, within the first 24 hours, you can achieve that 2 nines reliability for a 24-hour period. From there, it's really about repeating and expanding the time period. So first, we achieved a one week of 99% reliability. And as we spent time on the platform, we started to see 24-hour periods where we were able to add nines to that 24-hour target. As I showed, we have now exceeded our monitor's precision in that we regularly have 24-hour periods where we are perfect and we don't drop any logs. But it may take some scaling and some adjustments to your deployment to actually achieve another nine on that 24-hour period. As you expand your time period to something like a 30-day period, you likely will have a deployment or other updates on your platform that have potential to eat into your error budget. So really, the first 30 days, if you can get that 99%, you're doing well. I think it's taken us quite a while to be able to add nines over a 30-day period. I don't think, actually, we have achieved higher than 99% over a 30-day period. As I mentioned, we have added nines to our one week target. So we're kind of reaching where loggergator team and the Pivotal SRE team is with our service level objectives in that because of moving tooling around in a couple of adjustments to our monitors, we're just at that six-month period. So we're achieving 99% reliability over a six-month period. And our goal is to achieve that over a year and to be able to confidently say that that's a scaling approach that operators can achieve. You may have targets that you're trying to hit that are above 99%. If you're considering that, you really need to consider your deployment considerations. It will limit your ability to regularly update CF. And that may be contrary to the goals you have for your platform. Another important consideration for your reliability targets is the capacity planning and what capacity your deployment can handle for a certain amount of log volume. Over the summer, I conducted a series of experiments to attempt to determine exactly how much load a single Doppler can handle. And through these experiments, which I was able to repeat consistently using some pipelines, we found that a single Doppler can handle just about 30,000 envelopes per second before it starts to exceed that 1% target that I mentioned. Based on that 30,000 per second, we actually recommend 10,000 envelopes per second as the recommended scaling indicator for the number of Dopplers. I know this image will be hard to see, but we ended up building a monitoring tool for us that tracks the max envelopes per second into a Doppler and then uses that 10,000 envelopes per second as a way to understand the capacity of each Doppler. So most of the monitoring tools will allow you to pull that max figure. And it's pretty simple math to use that 10,000 envelopes per second. And one of the reasons we are able to achieve such high reliability on Pivotal Web Services, we have plenty of headroom on Dopplers. We have quite a few Dopplers. And typically, they are not spiking even to 40%. So it's been an action item to actually start scaling down our number of Dopplers since we have been exceeding our targets. I've collected these thoughts and put together a white paper, which is called the LoggerGator operator guide. It's available in the LoggerGator repo. And it's linked on the readme. So if you want more details about this, you'd like to dig into the math. And you'd like to have access to all the tools I just mentioned. This is a great paper to check out. So I'm going to switch gears and spend the rest of the talk to talk a little bit about what we're planning for the future. I've had a number of questions about new ideas for LoggerGator and how we communicate what we're working on. So this is a good forum for me to do that. And I'm going to take the chance to do so. Before I jump into the specifics, strategically, the approach that we've been trying to hold ourselves to is really that agile approach. And we've had a couple of big ideas. I've posted a couple of ideas to CFDev around giving developers access to metrics and a couple of other custom app metric ideas that we think are really exciting. But when we've gotten together with other teams, it's been challenging to figure out where do we start? What experiments can we do to understand how we could take some of the risk out of this larger initiative for us? So I'm going to come back to the slide and talk about a few of the small things that we've been doing recently that have helped us inform what that big vision looks like. So over the summer, I spent quite a bit of time putting together lots of different ideas and feature narratives and trying to prioritize where those were at with the different users and where they were going to land for different releases. What we saw was there were a couple of key themes that were emerging. One is that with improved reliability, nozzle developers were having trouble handling the spikes that were produced by LoggerGator. So previously, we had a UDP protocol. We now had GRPC. And those spikes were making it through the nozzle. And sometimes they're disconnecting from the traffic controller. As I mentioned before, additionally, app developers don't have access to the firehose. And they can't get metrics around dedicated service instances they're creating. So this creates a monitoring gap for app developers on a multi-tenant system. There's also some missing configurations and things that we hear operators would like to do. They want to put more in the firehose. They want to put less in the firehose. So there's some additional tooling and ingress points that we've been looking at. And last but not least, I mentioned noisy neighbors. These are really hard to determine what's happening. We've developed some techniques on figuring it out. But they're pretty nuanced. They're hard to explain. And they're still lacking. We oftentimes have to actually SSH onto the virtual machine, look around at what processes are consuming resources to actually conclusively determine what's going on. We can't do it purely through a monitoring tool. So as I mentioned, I put together a handful of feature proposals. I haven't spammed all of these to the CF dev list, but I've sent out a fair number of them. And I'll show off another slide where we're starting to organize these on the LoggerGator repo itself. So another thing that we found helpful in charting a path forward is to just review briefly where we've been. And part of the transition from UDP to GRPC meant creating a new V2 API. So if you're familiar with Dropson, this is a replacement for Dropson. And it's something that Diego is using to send logs into LoggerGator, as of just a few releases ago in CF release. So this is a new way to emit metrics into LoggerGator and logs, of course. And there's some key features that I wanted to highlight. First off, as I mentioned, it's built on GRPC. So this gives us a more reliable transport. Underneath the covers, GRPC is a TCP-based protocol, so it has acknowledgments in it. It also requires mutual TLS. So it's a little bit more set up for a nozzle developer or a service developer to be able to communicate with, but it does provide secure transport. We also have gotten feedback that there's a need to be able to emit events into the LoggerGator system. If you think about metrics as something that is emitted consistently over an interval, and that's a synchronous process, logs provide an asynchronous way to conduct events, but services don't emit logs into LoggerGator. Only applications do. So events bridge that gap. They allow services to emit a specific asynchronous event into the LoggerGator system. When we were designing the API, I knew that Kubo was a new reality. I didn't have time to change the name in time for this presentation, but we've tried to take out any of the Cloud Foundry opinionated portions of the API and make it more generic for Kubo. We'll take a look at how powerful that can be. And this is now in GA. I mentioned GA ago as using it. For Ingress, any component can now use this using the Go LoggerGator client. So we're trying to also envision an egress that would be more flexible. So kind of familiar with the current egress model. Of course, we have CF logs. So you can create streams that are specific to an origin GUID, and that gives you a really small, reasonable meta log. So that's something like a syslog drain or a log stream. Or you have this other just everything, all you can eat option. We had a feature request come in on our GitHub repo to provide a little bit of filtering there. So we do have a filter that allows you to do logs or metrics. But it's still an overwhelming amount of data. In fact, this is how we see nozzle development currently with the V1 API. So we wanted to think of a way that could really improve this experience. So what we talked about was thinking about selectors. And we chose this word because of its use in CSS. So this is not syntax. But if you are interested, for example, in the event concept that I mentioned, you could write a nozzle that specified, just send me the events. And just subscribe to events for that. Additionally, you might be interested in a specific metric. So you could specify a type. Give me all the gauges. And then within that type, specify a specific metric. So again, this really reduces the burden that a nozzle would have to handle the amount of volume that it can take in. I mentioned that app developers do not yet have access to metrics. But again, this model allows us to specify a scope. And we're working to have all the metrics emitted by components tagged with an Oregon space detail so that those can be mapped to a specific scope for a space developer. And then I mentioned Kubo. If you're spinning up a Kubo cluster, you could specify the scope of that cluster. Again, a way to specify only logs and metrics from that cluster would be sent through that subscription. And the way that it's handled internally by loggerators, those subscriptions are actually passed through to Doppler. And it creates the subscription stream. So this is allowing us to really think about a new simplified world. This is what we see happening over the next year. There may be a scary moment for some nozzle developers here on the slide. But trust me, it's not going to happen too fast. The first step that we see in this process is what I mentioned about V2 ingress. So first thing we want to do is get everyone onto the V2 API and start that ingress from a consistent V2 standpoint. This is the scary moment. And trust me, we're not about to deprecate the firehose anytime soon. But looking at only supporting V2 egress eventually will be the vision of what we have for supporting a simpler loggerator. Recent logs and container metrics is another area that Dopplers currently handle that is a little out of place from a microservices standpoint. So we see that that could go away as well. And in its place, we're bringing in a V2 Doppler that will be a simplified version and only speak V2. Right now, Doppler is sort of operating in a mixed mode where it provides both V1 and V2 streams. You may ask yourself, who is consuming V2 now? The V2 subscriptions are currently handled by the CF Syslog drain release. So that's how we're dogfooding this new API standard. So eventually, that will allow us to simplify Doppler in such a way that it will really be a sort of 2.0 version of the Doppler functionality and only speak V2. So kind of back to our approach, our big vision, one that really encapsulates everything, is the ability to allow developers to consume metrics through a developer segmented fire hose. And some of the steps that we're taking is, first, actually, a suggestion that came in through the community about adding container metrics to Syslog drains. This has allowed us to start working on some of the selector specifications that I mentioned and uses some of the existing tooling so that developers don't have to worry about creating nozzles or a service or any of those challenges that we have with a full fire hose implementation. The next concept that we're working on is an event envelope type. And I mentioned that that's sort of a new concept. But again, it's allowing us to sort of de-risk the overall goal and start working on some of the selector and V2 concepts that we're working to build out. So all of those are really kind of in the theme of this selector subscription model. And that's work that we're in flight on now, and we're just starting to put into an experimental mode. The next thing that we have come up with as a good way to improve the nozzle development experience is to provide a better canonical reference nozzle. Because nozzles have to consume from the fire hose and be multi-threaded, oftentimes the implementation of them is challenging for developers. But providing a reference nozzle that is not tied to a specific metrics platform and simply post-generic bodies is something that we'll see as a powerful tool for nozzle developers to fork and create their own nozzles from. And last but not least, thinking about the developer segment and fire hose as a nozzle as a service has really helped us to think of a better UX for the developer and a cleaner implementation. And this is a feature narrative that I have written and posted to the CF Dev. I think a better implementation than how I originally described it in the developer segment and fire hose. All right, I know that's a lot. And there's probably those that want to learn more about this, read the feature narrative. So one of the things that we've been experimenting with is the GitHub Projects view. So I've taken all of the features that I just described and put them into issues that are pinned to our GitHub's Project View. So this allows us to show what we're working on at a macro level. Of course, if you want to know more specifically what things are in flight, what stories are, we use Pivotal Tracker. And if you have specific questions, we are always available in Slack. So that's it. Love to answer any questions. I'll be hanging out as well if we don't have much time for them. But hope everyone enjoyed it.