 Good afternoon. Welcome back from lunch. My name is Alan Douay. I'm the product manager for Logregator. How many folks here know what Logregator is? All right, lots of people. Anybody want to do a source code review for Logregator right now? Just come on up here, dude. Only joking. Today in the presentation, what I'd like to do is go over Logregator just a little bit of a high level, explain what it is, how it works at a very simplistic level, and then we'll look at each of the different components. Logregator within the Cloud Foundry kind of broader spectrum is one of the components where you have to make choices about how much loss versus how much performance, how much resources you want to throw at it. And so we'll go over some tuning tips here, some experiences, maybe a little bit of an explanation about each component, and then some common problems. These are problems that we kind of see quite often from an inbound customer request perspective. Open it up for Q&A at the end, although if there's any questions along the way, please feel free to interrupt me. I'm happy to field some questions. And because everything's better as a CF app, this presentation is actually pushed. You can go look at it live there on that URL. So to kick it off, how does Logregator work? Well, easy. It works just like this. This is the boxes and lines diagram of our architecture document. One of the core things that we're trying to do with Logregator is that when you, outside of the Cloud Foundry ecosystem, when you have an app that crashes on a machine, you normally just log into the machine and go look at logs. You want to go see the standard errors. Maybe you have to restart that machine to get to it. Maybe the machines crash, but what you see is what the application state was at that termination phase. With Logregator, you don't really, or with CF, you don't have that, right? Because the machines themselves are ephemeral. We get rid of them. If the app crashes, we kill that container and bring up a new container. And so there's no place to go back to. There's no going back to that old machine state. And so what we do is we create a system by which we take the standard in and standard errors and we bring them out of the containers, make them available externally. There's a couple things. It makes it available. You don't have to be on that machine in order to get this information. But secondly, it really just preserves it so that that way the state of the container isn't the deciding factor about whether or not you get to see information about what your applications, what they were doing, what the environment was like at the time that something happened. It is fairly complex in that we have multiple different things sending through Logurgator and multiple different outputs. The key thing that we do though is application logs. And so within the broader diagram here, you'll see that we're sending metrics about the container. We're also sending component metrics through. There's different outputs. There's a native syslog piece. There's nozzles. We'll talk a little bit about each one of those. But at the high level here, I wanted to kind of go back down to a more simplified view and just look at app logs and show you kind of how they work at a high level. So whether you're running DA or Diego, you have the same thing here. On each individual cell, you have one component called Metron. And this isn't as part of the containers. This is just in the host cell. What happens is if you're using Diego, Diego will come in and has a piece called Executor, great name. And it will actually start reading off from the containers. It will read each line out of standard error or standarded out. Put it into an envelope, drops on envelope, and send it to the Metron component on that cell. Metron then sends that packet out over to a collection of servers we call Dopplers, right? And so the idea here is that you have, with Logurgate, you have two different things. One, sending information from these cells. And then another, people consuming from that middle tier, that Doppler layer, and they're bringing things out. That's an important concept to understand because what we're trying to do when we talk about tuning is right sizing that flow of information through. Basically egressing the data out of the system as fast as you're inputting it in or finding that right balance. You can obviously over provision on one side or the other, but that's where you really kind of run into problems and you get what we call log loss. The log loss piece here is probably a big topic for Logurgator. We obviously kind of have to answer this question of why did I lose my logs? And when we design Logurgator, the key concepts that kind of come into play here for us are that we don't want to, in that very first picture you can kind of see, we don't want to put any pressure back on to the applications, right? So the one thing that often happens is when an application is crashing it's because it's got resource constraints, it's got environment constraints that basically the app is under stress. The last thing you want to do is say, no, you have to write this thing out to this service endpoint and that's really where we would get into if we said, well, it's critical that all messages come through. So I don't want any of this stuff to produce back pressure, produce this requirement back onto the application in order to service this need. What that means is we allow a lot of preemption, right? We allow the application to be able to send this out in that 12-factor app way to just say, just log it to standard out and I don't care, that's a transaction that I don't have to manage, I don't have to keep track of, I can just keep moving on. What will happen, though, is in some cases you end up seeing post that right, things go awry, right? Like the application, that particular log can be lost in several different areas. We'll talk about those in a little bit. The other kind of tenet that we're trying to adhere to as we're producing a aggregator is that we want to get you into information as quickly as possible. So that means that if there's something in the system that's slowing down or jamming up other pieces of the system, we're just going to let that information go in order to prioritize traffic that's sitting behind it. We don't want stuff to get backed up, so to speak, and therefore we're going to try and send information as quickly as possible, find those areas where there might be some issue, and just blow past them. And then the last piece here is within the system there's this concept that we're going to allow multiple consumers to feed from this same information. So you might have a monitoring solution that's pulling metrics and information out. You might have an integration with something like Splunk pulling logs out, but you still want to see those logs if you see a tail, for example. And so within our system here we syndicate effectively the information so that it's available, the same information is available to multiple consumers. There's several tricks along the way here to do different things, but these are the tenants that drive the architecture of what we're trying to do. So when you look at those pieces and you're talking about what we call loss, or how to scale, right-tune the system so that you minimize loss, but you don't super over-provision one way or the other, really there's two things that are very helpful for you to know as an operator or somebody trying to set up the system. One is what's your application log per some time period? Like how many logs do you think you're emitting from your apps every, for example, every second? And then the second piece is like, how are you consuming these out? Do you run in multiple nozzles or are developers just kind of casually come along doing tailing logs every now and then just to see what the apps are like? Do you have syslog drain configured? The reason for that is as you look at the components each has an impact across the system will go through. So when you're looking at this there's really the three areas that you have an option to kind of do things about when it comes to Logregator. The Metron agents themselves, they're the things that forward that load off of the Diego cells and to Doppler, the Dopplers, how many Dopplers you have, and then lastly the traffic controllers. Now within the system here, kind of the key piece as a high level is to look at the Dopplers. These are the middle, really they're right in the middle of Logregator, so to speak, and they're doing two different tasks. They're receiving information inbound from your Diego cells and they're moving things out. When we see loss, I see loss happen usually at two different places. It can happen more than one place but kind of the first place I really see loss happening is when messages kind of get stacked up at the Doppler itself. So your ingestion is faster than what your egress or your consumption of logs coming out of Logregator is and that happens sometimes when you have how many people here run something like Splunk or have a syslog out, drain, right? Lots of folks. When you have that kind of syslog drain and you're pulling that information out, if you're pulling it slow then what will happen here is that you'll start to see a message in Splunk, that's what it has about Splunk. You'll see a very helpful log message in Splunk that says, hey by the way you got like 999 messages for you. You're like, why did you send me this one then? But the idea here is what we did was we went ahead and made a cautious decision to say hey look, too many of these things are being stacked up in the middle. We're getting rid of the buffers so that we can actually let more stuff flow through. It's kind of a signal that says one or two things happen. You either got a spike in traffic beyond what your buffer could handle or you just really slow what we have here. So you need to make a configuration change. It's basically telling you go look at how you can tune up your log creator system. I'm going to dive into Metron first and then we'll go to Doppler's and traffic controller. With Metron we've done a lot of research especially on virtual machines where we try to see how many messages we can pump into Metron before you start to see loss happen right there to Metron. Really we start to max out on virtual hardware about 8,000 messages per second. For the purposes of kind of like a planning area here what I try to do is I try to help folks target about 1,000 app logs per second is maybe the range where you want to look at how many Diego cells you need. So really you only have kind of two options here. One is you can increase your number of Diego cells. That the number of tenant apps on any one cell will decrease and as it decreases the load on Metron kind of a balances out it will subsequently decrease. Oftentimes that's usually the case. The second piece here is when I mentioned earlier is every single line within standard error or standard out gets its own message. One way to decrease message traffic is actually to kind of trick the Diego cell into sending multi line messages as one message package. That's the next slide I'll show you what we're talking about there. But the key there is if you're in about 1,000 app logs per second range within each of your Diego cells you have some additional overhead. You can have spikes that will help accommodate you for. There is a built in you're going to pay the price for some metrics flowing across. So every per cell emits container metrics every 30 seconds. That's also flowing through Metron. So this kind of that 1,000 messages a second is just a good baseline. You're probably going to be sending about 2,000 messages a second at that rate anyway. And again it's just a starting point. What we really ask you to look at here is Metrons don't overflow them beyond that 8,000 point. You're going to start seeing loss happening there beyond the 2.5% range and part of that is because Metron can't handle that rate on a lot of hardware. You'll start to run into things like UDP buffer size gets overflowed so UDP will start dropping it. Metron agent will start, its diodes will start dropping it through the process there. One of the key things here I think we'll just actually jump into it. How many folks have multi-line stack trace dumps? And then have you done anything about it yet? What you see with those stack trace dumps is that each line comes across as a separate message. And then even if you're using Splunk or Elk or any other tool to try and reassemble logs which you see is logs that kind of come across really choppy. In fact they're hard to reassemble because you don't even know the sequence where they got in. One of the key tenets about the system is that it runs sent to Dopplers that are within its own zone. So when you're talking about when you're talking about the number of Dopplers you want to make sure that there's an equitable number of Dopplers for every zone. They actually have three of them. Three zones for example, availability zones instead of running 13 or 14 run 15 Dopplers. You have five each in each area. What that does is it gives you a better chance of having at least all of the messages within the same availability zone of Dopplers, but the metrons will naturally try to spread the load out across those Dopplers. What that means is that you might not even receive those log lines in sequence from that. Because some of the lines might go to one Doppler others might go to another Doppler. Really kind of the strategy we've seen a lot of folks employ here and apologies we've been working pretty hard on the aggregator side to try and take care of this multi-line issue, but it really boils down to like this is a component portion of Diego so we're working with that team to try and overcome this. You can kind of trick it. This is what we've seen a lot of folks do. They use something like log back or every standard out before they standard out they'll add some sort of ruin to try and make it look like a single line. The key things to know if you use a strategy to try and send a multi-line message and then recompose it back at the consumer in is that in certain versions of CF we're going to truncate that for you anyway. We're going to go ahead and split that up because we only want messages of a certain message size. Since we use UDP UDP has 64k message send limit before it starts breaking up packets anyway and so in CF237 I think we went ahead and increased the max packet size that the executor will create and allow to send through. So the multi-line messages you're going to start seeing in older versions will get truncated but you'll see bigger and bigger messages that you'll be able to send through Metron. This obviously one of the things the strategy does is it reduces your message rate through to Metron. Metron doesn't care that the package is size X it just cares how many it has to send. That's its overhead actually and so for a lot of folks this is actually the strategy they would rather. They would rather send the message whole as much as possible. We've been working on this for a while obviously part of the limitation but the technology perspective has been the fact that we don't actually own some of the message assembly piece, the Diego team but we have been working on the logarithm side to try and make bigger and bigger packages supported. There will be some limit obviously we don't want to send in 10 gig log files through the system that will obviously be a problem but this is a good strategy for folks who are trying to who have this kind of problem are trying to reduce their message rate through to Metron and then there was Dopplers. Dopplers are interesting. The key piece here I think a lot of folks will kind of like rule the thumb say that about four to one, four Diego cells per Doppler is kind of where you want your your deployment sizing that's okay. I mean one of the things that I kind of start off with is say if you know what your average rate is through a system then the Dopplers you don't want to get them too high that's pretty much above about 4,000 messages a second that's messages total so that's the metrics coming across as well as the logs and the reason for that is that actually and technically the Dopplers can take more ingress than that but what we start to find is that on the consuming side you're going to actually not be able to pull it out as fast as that and so the Dopplers have an interesting feature they each of the Dopplers have every single sync so if you guys do a syslog configuration that's cups dash L where you send it directly to your syslog stuff the log there it'll create what we call a sync for that s-i-n-k not synchronization s-i-n-k and the idea there is that that's what we call a buffer or a queue if you will it allows us to put messages in there and hold it temporarily it's completely a temporary store while consumers start to pull information off the idea there being kind of twofold one we allow you to kind of use that for spiky traffic so if you have the same kind of egress from the system you're pulling information out of Dopplers as fast as you're putting it in and then you get a big spike it's just a place where it'll let you kind of pile them up for a bit so that your consumer can continuously take them out the second piece is we store it temporarily so that you can kind of see that backlog anybody ever run CF logs and then tail some logs out and you see like the logs that previously had will store like a hundred of them for you so we can kind of see that those are the concepts there now every single sync gets their buffer on every single Doppler meaning the Dopplers have if you have five or six different consumers there you might have a bunch of these different sync and their subsequent buffers so when you look at that you need to consider hey am I do I need to put additional RAM or ephemeral disc here for my Dopplers just to give them sort of some room if you've got really really large buffers one of the big things that that changed let's see if releases we changed like the default buffer size the buffer size is per message count not per size and previously it was 99 messages so if you're ever seeing that you lost 98 messages you know that you've got 99 messages as your buffer count there but recently that's been up to I think 9999 just because we feel like that's actually helping folks in terms of giving a little bit more head space and it obviously for the most part you don't have a ton of the messages themselves aren't very large so they're not taking up a lot of space on the disc if you do send multi-line messages so you send those logs whole then obviously you have to reconsider you know can you take 10,000 messages at 60k a piece on there and make sure you appropriately size the Doppler so they can absorb those types of pieces but I want to kind of stress here that that whole queuing that buffer piece is really not meant to be a sort of a long-term storage piece it's really meant to accommodate spikes if your consumer end is slow then it won't ever matter you're always going to end up we're always going to end up filling the buffer and for us when we fill the buffer we just drop the whole buffer we just kill it all and then let the stuff behind it come through and that's that's sort of the methodology that that we're using right now what it means is you do get that log message that lets you know that that happened but no guarantees of what we just dropped whatever was there that's what we that's what was dropped and that's per that buffer per sync so in order to avoid that what we kind of hope to get people to do is look at three things first you know look at the number of Dopplers you have per zone like in a couple of cases that I've looked at folks had four Dopplers in two zones and three Dopplers in one zone and so it just so happened that as they were seeing that log loss it was from the zone where they had three Dopplers that that's where you'd end up seeing the consumption for whatever reason the ingest effectively there was happening faster across those three Dopplers than they were in the in the other Dopplers in the other zones so try and balance them out make sure you don't have an imbalance there secondly look at your buffer size from a you know sensibility perspective if it's at a hundred you could probably do more than that but but is it really accommodating spikes or do you have sort of just slow ingestion throughput if you have slow ingestion throughput you have two options there one is you can actually kind of decrease Dopplers ingestion rate by increasing the number of Dopplers and so it'll spread the load the metronomal spread the load then the average rate of ingest into each one of the Dopplers will effectively go down and so if you're pulling it out that doesn't necessarily mean that that's going to work for you because obviously still has to go through all the information from all the areas each of the Dopplers but it will it will basically mean that each one of those buffers have more time before they're going to expire before they reach the max load the second thing you can do is if you have this opportunity of this option you can create two two nozzles and use the same subscription idea that effectively splits the traffic up so what what will happen on that front is they'll basically pull from the same syncs but they'll randomly pull half the traffic each right and so that effectively it doubles your egress of the information out of the system but of course it means you'll have to figure out a way to reassemble that stream back on your consumer in but these are some successful strategies we've talked to other folks about and we've looked at it's really that balance area and unfortunately there's no one answer here which you look at is just a combination of those different things how do I get that right balance of flow through the system and using the buffers just as that sort of that spike that spikiness type of accommodation and then lastly the traffic controllers I think some information has been out in the past about how many traffic controllers per Doppler that's okay that's kind of a good rule of thumb but in general you want to look at how many consumers you have when it comes to traffic controllers within the system if you're familiar with like a reverse proxy effectively what traffic controller is it's saying here's one endpoint but it creates connections to every single sync on every single Doppler and so as a result what you can do is if you have a ton of traffic controllers but only one or two nozzles you're basically making the Dopplers pay an extra overhead cost for managing each one of those connections when in reality you don't you don't need to do that in fact you can probably help the Dopplers out there by scaling down your traffic controllers in that in that respect there are some things obviously you don't want to scale them down you don't want to constrain them too much because they are proxying and managing that output for you but what you want to do here is maybe look at how many different consumers you have if you're just running one nozzle and people are coming see if tail usually a couple of traffic controllers are enough for you on that front so that's the that's kind of the high level pieces for for traffic controller I'm going to stop here because the next pieces I do are just common problems are there any questions or any areas where I can clarify for anyone yes yeah it's a good question so we get this request a lot so that that is actually in the backlog as a request both the prioritization of log messages from call it a classification scheme where we'll say priority one priority two priority three and then try to somehow sequence out and drop like priority threes first and then priority twos the challenge I think on our end is that there's probably a bit of work we do before that around saying can we even segment the stream for example right now you're getting message loss but you don't know whether those messages are in fact logs or if they are metrics coming from other parts of the system in fact they're probably a combination of both so I would feel like one of the other requests here is maybe to split across split apart those particular pieces so now when you have log loss you have log loss you know or metric loss and so maybe our first attempt at this will will likely be to try to split out logs and metrics into different streams and then try to prioritize all logs first and instead of introducing a you have to pick how you want you know the classification of the logs to be so that's some future stuff that we're looking at yeah any other questions yes yeah sure they're actually it's synonyms for us the sync is really that visual representation and I almost apologize to say this of having like a sync with a faucet with you know you turn on the hose and that's the that's the metron sending into the dopplers and then the drain effectively so in fact I think we even call it syslog drain just because we want to like reinforce that visual is pulling that stuff out so it's just like a sync if you're if you're not able to drain as fast as you're putting in you're gonna overflow and for us the moment we overflow we just dump the whole sync out and then give you a fresh one and that's really what it's meant for not meant for longer term storage anything that's in there is going to be pulled as soon as you have a consumer that's ready to pull it and so I sometimes would call it cues a buffer sync my apologies here but I think in the documentation we call it officially a sync SINK I'll just start here with a couple of like the most common issues that we're seeing around that come to us from you know a perspective of support issues or support escalations one of the challenges within system is when you have this many components they have to have a way to find one another we don't actually hard code it in fact what happens when we want you to roll or produce another set of Dopplers we want all the other components to dynamically find those new things so we don't want to throw everything away we we're trying to help in fact we don't want you to roll these in order to get more Dopplers we don't want you to roll Diego cells for example and so what we do is we use etcd as our service discovery component and most of the time I'd say probably close to 50 or 60% of the issues that get escalated to us end up having to do with some challenge that happens some breakage along this route here where Metron for example is talking to I'll give you a specific example here one time an etcd node came up and it didn't join the cluster so it created its own cluster and now Dopplers some Dopplers say hey I'm new and it talks to it goes to console DNS and it says hey there's an etcd IP address and it goes there and it says I'm registering and it finds itself the only one there and now Metron agents even though it's one of ten Dopplers in the system two or three Metron agents start to say hey I'm also redirecting now and then they start to flood that one Doppler so there's a classic output on this as you say well some of my apps I'm getting everything but I'm just missing tons of logs in these other apps and we end up always saying go check your etcd health cluster and they always come back and say well etcd is fine you say how many of them do you have and then they say well two that's the problem right there right so one of the things to check out is these other systems and that's the etcd one has I think a couple of manifestations there's that one sometimes the keys get missed sometimes an event in etcd so a change like hey a new traffic controllers come online or a new Dopplers come online doesn't get published as they're pushed as other pieces often times what you'll hear from a support perspective is somebody say well just go roll that Doppler or just go roll that traffic controller and then it suddenly works nine times out of ten when that happens this is what's behind the scenes causing that is it's that dynamic configuration isn't quite synced right and so they don't know of one another and that's what's kind of causing that that flow piece we're obviously spending some time here we're like like all the components from a service discovery perspective we're trying to make this a bit more robust and consolidate it down but this is probably the number one piece there if you're if you're managing the aggregator deployment and you see issues here start here because this will probably save you some time the other one is we get a lot of questions well because people do see this big long this message that comes in says hey we've just dropped nine thousand nine hundred ninety nine and it's almost always because they're a slow consumer and most of the time when we look at this it'll be you know you're using Splunk and you're using like a small syslog forwarder but you're sending a lot of information its way and so there's a couple of different strategies on that you know you can if you're using something like that you go move up to a heavy or medium syslog forwarder something that can pull information faster most folks don't know this but you can actually spread the load you can actually send this to like a low balancer and f5 or something like that and spread it across multiple consumers behind us behind the scenes for us and that'll usually speed up the right technically what's happening on the syslog drain piece is that we have a routine that's writing out and so it's waiting for the response back over that connection before it writes the next one and so that's why we know when something slow is we can say hey we just waited a really long time for these pieces and now the queues backed up and so again based on just that that simple architecture what we do is we just go ahead and drop the whole buffer and try to bring in the latest greatest the newest messages behind that in order to try to catch up when you see this most of the time if you see it we're going to ask you for two kind of pieces of information is this because of a spike do you think this was anomalous if so consider either spreading out your load across multiple Dopplers or increasing your buffer size or if it's continuous we're going to work with you on how to get egress faster out of the system even though you might spread it across multiple Dopplers the best way to go is look from the back end if you can get it out of the system faster you won't you won't get this issue this message it's a lot easier normally to try and do that's a lot better across the board yes yes it is yes that's exactly right we do writes across that for syslog drain irrespective if you're using something like the firehose the syslog nozzle that's a little different but yes it's effectively the same kind of thing if you can get it faster out the better for you but if you ever need to get a hold of somebody or you want to just chat with the logger gator team we're happy to do that we're based out of Denver, Colorado we are always interested in kind of hearing feedback about logger gator we're kind of like deep in the guts of CF so we don't always get to meet everybody so it's kind of fun feel free to slack us that's usually the best way to get a hold of anybody we have some good documents and we try to document at least in the github repository as much as we can it's probably one of the better ones if you ever want to look at some of the source code and documentation there we do put up a decent amount of information so feel free to kind of look through there reach out to the team I'm happy I'm going to be here for the rest of the day and so much more so if anybody wants to chat about logger gator let me know I'm happy to do so and right on time cool