 My name is Adam Hevner. I'm the product manager for LoggerGator and I'm going to walk you guys through improving message reliability in LoggerGator and I have a fair amount of content so I'm going to get you going. Before I really get started, just a quick thanks to the foundation and also to Pivotal. I work for Pivotal and I work on the foundation project and it's a tremendous honor. I really enjoy it and one of the things just to note, we work closely with the cloud ops team at Pivotal so if you hear me talk about our production environment that's the Pivotal web services environment also known as P-dubs. All right. So on the agenda today is I'm going to try to make all the bad classic computer science jokes for you guys. We are going to walk through all the boxes and lines of the LoggerGator architecture and try to highlight some of the new functionality. This is not a technical talk so it will be pretty high level but hopefully you'll get a good idea of how LoggerGator works and some highlights of the new functionality and changes that we've made over the last year. We're then going to switch gears and talk a little bit about service level objectives for the LoggerGator system and some of the monitoring tools that we've been working on over the last six months and have some cool numbers to share and some charts to look at at that point and I'll touch on a couple of strategic and tactical items that we're looking at for the next year upcoming and hopefully have a little bit of time to answer questions and I'll be hanging out afterwards as well. So as promised there are two hard problems in LoggerGator. You guys probably know what these are. The first one is naming things and if you have dealt with the LoggerGator system or looked at a metric from the Cloud Foundry platform, you'll know that we're still in the adolescence of our naming conventions and it is hard. It is really hard to go back and change these names once they're established. We'll come back to more of that. Guaranteed order of delivery. If you've worked with the LoggerGator system you may not notice this but you probably do. The CLI or assist log server attempts to order the logs for you but there's a good chance you'll see logs out of order especially when they're produced really rapidly like in a stack trace and of course off by one errors. We'll take a look at how that applies to our challenges around white box monitoring and how we've kind of looked at black box monitoring is a better way to monitor the LoggerGator system. Before we do the boxes and lines I just wanted to do a quick overview of the LoggerGator product charter. There are some distinct and unique aspects of the LoggerGator system and the LoggerGator system provides a secure and highly available aggregated stream of app logs and metrics without affecting app behavior. That last one is really unique and is important to our mission. There are other systems that are transactional in nature so if you're an architect and you're in a scenario where you really need to disrupt the app behavior if you can't write a transaction. We do recommend you use a different system. Our system is to try to get you as much as highly reliable a system as possible without affecting that app behavior. One other thing to call out here is that secure is new for us within the last year. Chip Childers talked about being enterprise ready. Security of logs in transit is a big part of that so we'll touch on some of the protocol updates that we've made that make using mutual TLS pretty easy with the LoggerGator system. Just a reminder about what user experience is LoggerGator supports. Everyone is exposed to Cloud Foundry first through the CF push experience but that second experience is just as powerful. Hopefully you're all familiar with streaming logs using CF logs. This is a really powerful moment for developers. I think most developers, their gears start turning about ways that they can debug and monitor using log streams. The challenge to us, of course, is not a single app developer streaming single stream of logs. We support many app developers streaming many applications. We also provide experiences for operators so if you're an operator you probably want to manage and monitor all of the logs and metrics on the platform. You'll probably set up some tooling to store those logs and to monitor those metrics. Operators use Bosch so they'll deploy a nozzle to our fire hose to be able to consume all of the things. It's that simple, I swear. One more feature for us to cover, that is binding a syslog drain to an application. If you're an app developer and you want to be able to go back in time and look at logs for a longer period of time rather than just streaming them, you can set up a syslog drain and you'll store those at a syslog server endpoint. Those are the three main experiences. There are a couple of other things like flags for dash recent and an endpoint for container metrics. We'll touch on those a little bit, but let's dive into these boxes and lines. There's kind of quite a bit to go through. So we're going to start, we're going to walk through the delivery of a log from an application on the system. Logger meter handles both logs and metrics. So this flow really applies to a metric as well. This could easily be one of the Cloud Foundry components emitting a metric. But for our case, we're going to walk through an application running in Diego that's producing a log and time for more bad jokes. The metron process listens on UDP, so there's really no guarantee you're going to get any of these jokes. There's also a hidden joke in here about Princess Bride, hopefully you guys will get it. So Diego has a log for the metron process. It provides that via UDP and this is where some new functionality comes in. So I mentioned that we are secure by default now. Once the log is received by the metron process, it uses a new protocol to us, which is the GRPC standard that Google open sourced. Just a quick note about GRPC, it's really helped us in that it suits our needs out of the box quite well. It's scalable, it allows for mutual TLS right out of the box and really helps with managing a pool of connections. Additionally, it prompted us to kind of go in there and do surgery on a really important part of the aggregator system. So when we look at the improvements, I get questions a lot about, wow, how did GRPC make that big a difference? We don't know for sure that it's solely GRPC. We did find places where there were improvements around blocking and other things. So it's a combination of refactoring and a new protocol that really gave us a lot of reliability improvements between metron and Doppler. So once you have a GRPC connection, you can get into all those fun TCP jokes as well. One of the challenges that metron has is actually managing all of the metrons that exist. So because anything that wants to produce a metric also needs a metron, and then all the Diego cells need a metron, there's a pretty challenging service discovery problem that needs to happen for metrons to find Dopplers. And that's where our five-year-old friend Bosch comes in. And a new feature that Bosch is helping us with is Bosch DNS. So Bosch is now able to tell all of the metrons about five Dopplers that it can connect to. And really the key feature there for us is that we've been able to get off of EdCD. EdCD was previously how we discovered the locations of those Dopplers, and we were getting reports and had experienced ourselves problems where EdCD got into split brain. So that's really going to help with our uptime and service level objective around that. One thing to note here, there's a little bit of an asterisk on that. We are still in the process of transitioning to Bosch DNS. This is currently be handled by a console at the moment, but has plans to be on Bosch DNS soon. So metron also speaks GRPC. So this is still experimental, but we're really close to trying this out on our production environment. And this will actually provide secure transport all the way from the Diego cell into the process. That's already secure in Linux, but in a Windows container, there's still some listening on UDP there. But it also will improve message reliability and sort of that one last place that we still have UDP. So also another opportunity for more Princess Bride jokes. So walking into our next component, when the message arrives in Doppler, it's really now in our routing component. So back to those classic hard problems, Doppler doesn't tell you anything about, its name tells you nothing about what it does. If we were to rename it, we would probably call it something like a highly available routing mechanism. So the log comes in and it actually goes out in multiple places. So again, we have another new protocol here. Again, we're using GRPC and now we are connecting to our traffic controller component. So these are sort of an existing component. We've replaced this protocol from WebSocket to GRPC and traffic controller is largely unchanged. So really the steps here are in line with what has been in the past and that is, of course, you can still CF logs. You can go get your app using CF logs and that uses something called a logging API. So if you're streaming logs, you're connecting to that traffic controller and you're pulling logs from that logging API. Additionally, the traffic controller provides the fire hose. So another copy of that log ends up in the fire hose. So it's kind of important to know that there is more coming out of Doppler than there is going into Doppler. So that's largely unchanged. I mentioned GRPC has been put in place in between Doppler and traffic controller and I forgot to mention our GRPC connection between Metron and Doppler, which was formerly UDP, also has some buffering in it. So with UDP, we were getting the back pressure release for free from the kernel. UDP, if it gets overwhelmed, we'll just start dropping messages on its own. The big problem is we don't know about those messages that has dropped. So we're not able to submit a metric or anything when that occurs. With GRPC, we introduced what is called a ring buffer or we call it a diode and that allows us to handle faster reads than the system is writing out. So we've actually used that throughout. Wherever we have a GRPC connection, we're putting things into a diode and reading off of that diode at a separate thread. So the fire hose, as I mentioned, is managed by the operator and they can deploy a nozzle to read from that fire hose. And then we also have our syslog drain functionality. So this is where some changes come into Doppler. Dopplers previously had handled syslog drains on their own, but there's some challenges around that. Dopplers, from a microservice perspective, we want to see them purely as this highly available routing mechanism. They still are doing a few more things. And the biggest thing that they had been doing was managing these connections to syslog servers. So instead of managing those in Dopplers, we're now handing off to a new component called reverse log proxy. The reverse log proxy is pretty much the equivalent of what the traffic controller is, but this component only works with GRPC. So it's the updated version of what the traffic controller is. We got to name it what it should be named, which is a reverse proxy. And for our purposes, it only handles logs at the moment. As I mentioned, Doppler still does a couple of other things. One of the things it does is it handles all of the recent logs from your application. So when you do a CF logs dash dash recent, traffic controller actually has to go to Doppler, reassemble all those logs. And we'll take a look at some of the shortcomings of that. So from the reverse log proxy, we hand off to a whole new system that we are calling scalable syslog. I should say we've called it that thus far. Naming things is hard and there is another release called the syslog release. So it's probably going to change a name before it gets on to Bosch I.O. I think it's going to be called the syslog drain release. But standby, we're still finalizing that detail. So when a developer decides to bind a syslog drain to an application, that request is handled by CAPI, the cloud controller. And we have a new component that lives in our release called the scheduler. And it's pulling the cloud controller every 15 seconds to learn about new bindings that may have been set up. Once a binding is set up, it has the ability to allocate that drain to one of many adapters. And this is where the scalable aspect of the name comes in. The new release allows operators to appropriately scale adapters based on the number of drains they have. Dopplers are a routing mechanism, so they're really responsible for all the logs and metrics on your platform and need to scale with the overall log throughput of your platform. This allows operators to specifically scale adapters to suit their drains need. Additionally, a syslog drain lives out on the internet, so it inherently can be susceptible to network latency or third party tooling that may be slow or non-performance. So having things on separate drains really helps. You may have one drain that's going to a testing stack that's quite slow and filling up the buffer and losing messages while you can have a faster drain, and neither of those will be interrupting each other or interrupting the fire hose, most importantly. So that's scalable syslog. That's a quick summary of the things we've done really in the last year, and I'm going to jump into kind of how we're starting to take a look at monitoring those and understanding how to work with Logurgator from an operational perspective. If you haven't had a chance to read the site reliability engineering book, I highly recommend it. The Logurgator team all participated in a book club of reading the SRE book, and it was really helpful for us. Also, thanks to Google, they, through their CRE program, really gave us a roadmap of some of the things that they saw as deficiencies in our system and helped us to think about service level objectives as part of our culture, and we worked closely with CloudOps to wear the pager and try to practice some of the things that the book really talks about. So probably about six months ago, we had a workshop to determine what service level objectives an operator would care about when they take a look at how their Logurgator system is working, and if you're not familiar with service level objectives, they define indicators for us to look at, and we look at a set of indicators, and then we'll define objectives for the goals we want to meet around those indicators. So the first step was really to decide what we wanted to measure. We had various tools for monitoring, but they were telling us lots of different things about components within the system. We really sat down and talked about what do operators care about. They just care about getting their logs and hopefully not having to manage the system too frequently. So the first thing that's the utmost important is uptime, and we decided that we see that as the ability to be able to receive a message. It's possible the system is overloaded and that message could get lost, but the system is still up, so from an uptime measurement, we still see the system as functioning at that point. It's just not properly scaled. Message reliability, which is the title of this talk, is really the trickiest one of all of these objectives, and that is defining how likely is a message to get through the system. That really relies on the system being appropriately scaled, and it also is a challenging thing to measure when you look at some of the metrics being emitted from the system. You'll see, we'll walk through a kind of a quick demo, but that's where our off-by-one errors and distributed math can really bite us. The last thing that we decided we really need to kind of keep track of and keep an eye on is latency, especially with the CF logs command. Users, developers, they've gotten used to seeing their logs near real-time, pretty much instant when they run that command. Additionally, we also don't want to be a product that is holding on to logs for hours or days, which there are other products that can do that for you. We took a look at what we're using to monitor now. We found it completely overwhelming. When you look at a chart and it says something like, you have received 101% of your logs, you're not really sure what that chart means. We had quite a few monitors where we're trying to subtract the delivery of logs from one place and the receiving of logs in another place and use that to determine a loss percentage. Like I said, it would kind of bump up above 100% and whenever there's any doubt, there is no doubt. Why is that so hard? I just want to walk through a quick example of that another cliche problem. Keep in mind when a log comes into the metron system, we emit a metric about getting that log. One envelope comes in, it produces another envelope, two envelopes come out, and what do you know? The Doppler now is creating another envelope. We're already up to three envelopes for this single log. It's likely that Traffic Controller has a log sitting in its queue as well. In a fraction of a second, the single log that was sent through the system may end up egressing as four messages. You can understand as we started to try to apply some distributed math about how much was sent versus how much was received. We could do some filtering about what are our own metrics, but we were having a really hard time of being able to measure down to a percent or a tenth of a percent to measure reliability. We decided to take a different approach. When we were doing a deploy of the GRPC, I found myself sitting in front of the command line wondering, hey, what could I do to try to look at some of our performance during this deployment? I of course started to look at logs and I piped them into a file. I decided to just run a word count on that file. I knew how many logs that my application had produced. I just went ahead and counted them after the fact. I started doing that over and over through the test. I'd started it before the deploy and then I finished after the deploy. It was pretty remarkable. I actually did see the numbers come up quite a bit. We often used 10,000 as our number of logs that we produce. It was exciting to see that there were some tangible results that happened while we were doing the deploy. That's a pretty simple test. It really is that simple. Just a couple of command lines will get you that. Additionally, after we had our workshop to define some service level indicators, that latency idea came out of that. The CF logs recent is something that when we started repeating these tests, we actually did see that that could be quite slow. Half a second is probably our goal for the CF logs recent. You'll see in one of my future slides that we are not currently meeting that particular service level objective. One more test that we've implemented. This one's a little bit trickier, which is counting logs that go into a syslog drain. For that test, the engineers helped stand up a syslog server that would simply count the logs on the receiving end. Again, we're able to look at this from a black box, set up the drain, and then over on the actual drain count the logs that are received. What does this look like? We ended up pipelining the test that you just saw and started running them every 15 minutes against a variety of foundations. We used three sampling rates. We used a really slow drip rate of two logs a second. We then used a little bit more of a steady stream of logs and representative of what we have seen on our production environment as an average log volume, which is 1,000 logs per second. It's on the average to high side. Then we also decided to stress test the system with as many logs as we could produce. That tends to be around 2 to 4 million logs a second. You'll see some of the numbers here. It's hard to make out the two distinct lines of two logs and 1,000 logs, but the flat line across the top are the two logs and 1,000 logs. You'll see that those are quite reliable. You'll see that there's a big spike on that blue line on the top chart there. That's not abnormal. We do see that the stress test does end up dropping logs when you really stress the system with a noise event. Our goal is to be able to provide a high reliability on those slower logs per second. We also use a weighted average here so that the numbers make sense a little bit more to an operator. The weights that I selected were to weight the two logs a second at 70%, the flow of logs at 20%, and the flood of logs at only 10%. That way, if you get dinged by the flood being off for a particular run, it doesn't affect her overall average. You might be asking, what's up with that bottom one? That's a foundation that hasn't made the upgrade to GRPC yet. It was an exciting moment for us when we were releasing a deployment where we had actually configured these monitors to run on a regular basis. I wasn't just running them from my command line. This is what it looks like when you upgrade to GRPC. That came out, I believe, in CF release 253. You'll see on the left, even that two logs a second, it's a wavy line. Even at a very slow rate, we would occasionally drop a log. After the GRPC upgrade, that flood of logs dramatically improved. You can see the jump there in the spiky line. Then that two logs a second really just flatlined. We're really excited to see that. You'll see in the other charts, this is kind of going back in time a little bit, we've continued to see improvements as we release new functionality. I mentioned our latency is not what we had hoped for. The reality is we hadn't really gone out there and measured this. We started to see indications that we were having problems. Not only when we set up the monitor, but certain other monitors that the CloudOps team was maintaining, we're starting to fail. Really, that's because of what I mentioned. The traffic controller needs to collect all the logs when a log's recent call happens. That can take a long time. There's some short term and long term plans we have to address this. We are going to be making some improvements here. The main risk is that the CLI has a timeout of 15 seconds. Things start to fail in the CLI as that exceeds 15 seconds. We're starting to run out of some of that headroom in some of our deployments. Last but not least was that syslog drain test that I mentioned. It is a trickier test. It involves a little bit more waiting and a setup of a syslog server. You can tell by the dates, even right up to a couple days before the conference here, we've been really tuning that test so that it runs reliably. We also integrated this into our release process. Every time we cut a release, we have a comparison to our previous release of LoggerGator. We do that every day. Each one of these red lines here represents our latest development branch compared to the previous release version of LoggerGator. We do a comparison to see if we're improving or they're slipping on our message reliability rate. A couple of quick strategic items. Our North Star's for LoggerGator as a product continue to be apply no back pressure upon producers that is our sacred item and a waypoint that we check in on pretty regularly. We just ran a back pressure test for a new API we have with GRPC just last week. Deliver as many logs as possible. Again, that's a waypoint that we've been pretty focused on and have really been guiding us to update the protocols, write some of these monitoring tools. Those are things that we've been pretty focused on over the last year. I mentioned the latency. I added this in here because I think it's a waypoint we haven't been paying much attention to but need to ensure that we don't exceed the limits of the CLI on and metrics is another really important aspect of the LoggerGator system. I didn't talk much about metrics and we haven't done a ton of work around metrics over the last year but really improving the clarity and consistency of metrics is something that Pivotal's made investments on and LoggerGator is kind of resetting on strategically to take our waypoint from that as well. I'm going to walk through just a couple of quick tactical items and then I'll go ahead and take some questions. So some new features coming. I mentioned the GRPC endpoint into Metron. That should improve message reliability and it actually secures the log transport fully in Windows as well. We are working on an operator playbook in capacity planning tools. We've been standing up an environment in the GCP IS and stressing that. We have some rules of thumbs about how many dopplers you need for how many logs per second but there's a lot of assumptions that have gotten lost in that test and we're kind of resetting on those and we'll be publishing a new white paper that outlines that. App developers can get access to metrics. This is a big gap in something that I've been hearing a little bit of evidence for during the conference. If you are on an app development team or you're an operator that would like to get metrics over to your developers, I'd love to hear from you. So find me after the talk. The improving latency, we kind of already touched on that. And noisy neighbor mitigation is another thing that I added this last second because I've been hearing that from feedback from some of our users while I'm here. And GRPC inbound to Metron gives us some new ability to manage that a little bit better. Just a quick note, if you do develop a service tile, we are taking something out of Metron. The Arcist log functionality out of Metron is going to be removed on August 1st. So if you have questions about that, you need to use what's called the CIS log release and we'd be happy to walk you through the details of what that takes. Questions? Anyone? All right. Thank you, everyone.