 Great I'm excited to walk you all through log cache and I've got a bunch of content so I'm gonna get Get started. Why is my slides not? There we go So just a quick overview of our agenda I'll start with really the reasons and the motivations as to why we built log cache Log cache was built to solve some specific problems. We'll take a look at what those problems are We are gonna do some live demos, so wish me luck hopefully the Wi-Fi cooperates and I want to do a bit of feature comparison on the user experience that log cache can provide versus the user experience that logger Gator provides and The reason for those feature comparisons is for us to look Forward a little bit and talk about some architectures we plans we have for logger Gator So the place to start is Really an issue that exists now And this issue is especially apparent if you have a large foundation So when you execute really any of three of these commands either you see if push your application You request recent logs from your application or you want to see the container metrics for your application by running CF app That sends a request to the traffic controller and the traffic controller really does the naive thing here and it goes to all of the Doppler components and Retrieves the log or the metric that you're interested in brings that back into the traffic controller and then has to Order or select the most recent metric depending on your request So you can kind of see how that doesn't necessarily scale It does also have to wait on the network connection over to the Doppler And if you're in a large foundation if you ever seen your app return zero for container metrics That's the logger Gator system triggering a circuit breaker So the metrics in that case aren't actually zero it's that we couldn't make sense of the requests that we got from the Dopplers and To avoid a timeout and an error especially in CF push. We're returning zero in that circumstance So we added that patch maybe about a year ago because we saw those CF push failures starting to creep up on large foundations But when we added that patch we kind of knew hey, this is something that we need to go back and address properly So there's kind of the key evidence we had is out there We had this problem like I mentioned zero being returned in CF app is kind of the the symptom of that problem It's also the case that if you have a large foundation when you run CF logs recent The chances of there being missing logs increases with the size of your foundation There's a good chance that we didn't get to go through all the logs and return to you all the recent ones So it increases the chances that that data set is incomplete We also found as we started thinking about how we going to build a new interface to retrieve container metrics that Developers would have an easier time interfacing with the logger Gator system if they had a restful interface the logger Gator system is really based on a push mechanism and If you've ever developed a nozzle for the logger Gator system, you'll know that managing a push API Can be challenging one of the things I've said about logger Gator logger Gator is great All you have to do is be able to consume an infinite amount of data forever So that's kind of the case with any push API But we had this theory that if you could provide a restful interface be a lot easier for app developers to go in there develop a application or an automation that read from the cache and performed in action We also had this hypothesis that we could improve the command line experience The command line experience for logs and metrics is pretty limited on cloud foundry like I mentioned the really the only ways that you can Interact with the logs and metrics on on your application is by running CF logs Or that a lot of people don't even actually realize that this goes for logger Gator But that CF app command will also pull from logger Gator to get those container metrics So when we embarked on developing logger Gator There's kind of an anecdote here that I wanted to call out so we had this hypothesis around the CLI and one of the things that we found to be a really effective tool as a team was to think about our project as a full-stack project and to develop Not only the CLI but to develop a standalone boss release that we could put into our production environment So our production environment, which is pivotal web services is running CF deployment and You know log cache at the time was not part of CF deployment. We had went ahead and Deployed a new release called log cache. We configured it with the appropriate certificates and we went the full CI route like Pivotal likes to do and when I would push accept on a story in our backlog The pipelines would pick up the latest commit from our repo and deploy that to production on log cache We were the only users of that system So we could kind of take a few more chances than we might with a actual live production environment But it was a really refreshing experience for us because we got to control that full stack of developing the server side components and then developing the CLI along with it so Gonna go ahead and Pull up my live demo, so The demo I'm gonna give is for the log cache CLI you can do a Google the log cache CLI go find the the install commands I've already got it installed and you'll see that I'm already logged into a foundation with a couple of apps here, so the log Cache CLI gives me access to a couple of new commands the main command I'm gonna start with is the CF tail command so the T of tail command is designed around the Unix tail command has a similar set of flags and We'll take a look at this app called log spinner so Right away when I go ahead and run CF tail on log spinner You'll you'll kind of notice that this looks a little bit different than CF logs First the default state is different. I got sort of a logs recent type request without providing a flag. We went for We think that the default situation most Developers are interested in is tell me what just happened recently not necessarily open a stream Which is what CF logs does so kind of again thinking fresh about the user experience again It gave us that that perspective of what would we do if we're designing this from the ground up You'll also see that instead of getting logs. I got a bunch of metrics for my request and that's because This app's been sitting idle for a little bit and there's no logs that have been produced So I'm gonna go ahead and curl the application log spinners and an application we use in all of our testing really and It allows you to curl the app and produce some logs so I'm gonna go ahead and just Run that tail command again and actually I'm gonna add a type flag now and just go and look at those logs so You'll see now that you know, I've limited the result just to the to the log output It will capture the application logs and any of the associated Cappy logs or router logs with that application You can you know do things like Get a number of logs It is a cache. So there's only a limited amount of data in the cache, you know, I specified a hundred only got About a dozen logs here, but that's because that's all that's available in the cache at this time So just kind of going through a couple more of the flags I Mention, you know streaming logs. Those are always good. So let's go ahead and open a stream now to the log cache and Let me curl that again so we can watch kind of inaction what the What the following experience looks like and it looks like CF logs It's kind of a little known fact, but CF logs in the CLI actually batches and delays the logs about 300 milliseconds and it does that because Out of order logs are kind of hard to deal with and they don't provide much value Diego puts a timestamp on the log that has nanosecond precision So it is something we are able to achieve to sort those logs in a perfect order But the problem with that is you do have to wait to make sure we don't necessarily in our system Know that we're going to retrieve the logs in order. So the CLI In the CF logs if you have the right version will wait and sort that if you happen to be seeing out of order logs with your CLI Definitely upgrade your CLI. It is something that we found got undone in the CLI at one point So we went ahead and added that fix back in Log cache solves that problem by server side keeping all of those logs in order So every request to log cache has a guaranteed order with it So kind of just one of the improvements that the interface provides Uh, so that is kind of the one-for-one on how the CF tail command can replace the CF logs command But let's take a little bit a closer look at what those metrics looked like So I'm going to stop this And I'm going to tail log spinner again And this time I'm going to specify a type of metrics and I'm also going to format the output into JSON and pipe it into jq So now of course you'll you'll see the results coming out in in a JSON format And actually what you're looking at here is the underlying envelope structure that logger gator uses So, uh, you know, this is really kind of the self-documenting approach for our RESTful interface for log cache So what you see here at the command line is the same interface that you can get using a web app and log cache Will accept oauth tokens. So you don't have to worry about mutual tls You can interface with this really the same using the same RESTful interfaces that most web application developers Are are familiar with so Like I said, that was kind of our hypothesis that we could develop these RESTful interfaces and that an ecosystem of Tools could develop around that One i'm going to switch gears now from the app developer experience and Look at the experience of what an operator can do with log cache So I happen to be logged into this foundation with the cf admin user And an additional command that the Log cache cli gives me is this command called log meta Log meta will take a look at my oauth token and give me an appropriate view across the entire cache So since i'm logged in as an admin user that has the doppler firehose scope I have access to look at the firehose So you'll see when I run the log meta command I get a list of what effectively is all the all the components that make up cf deployment So if you've ever like bummed around trying to use like cf nozzle, by the way, if you use cf nozzle, don't use cf nozzle cf nozzle can Damage your logger gator system and cause drop messages This system since it's reading from a cache is harmless to your logger gator throughput And you can take a look at you know, what's in the cache how much duration you have for those components And then again you can use that same tail command to take a look at a components metrics So this is kind of a first for the platform or bringing some command line observability to operators to be able to quickly Take a look at metrics. You don't have to go bounce over to your Data dog or your new relic you can actually look at the metrics right here And one of the other powerful things is this is the same Format as is available to the app developer So if you are developing automations or You've developed an alerting protocol within your organization You could apply that to both your foundation components as well as the applications and even service instances that exist on on On your foundation so All right, that went well as it was it for the live demo. So I'm going to jump back into slides So just kind of going through those user experiences once more Recent logs we took a look at that that's kind of now our first class experience without any flags The old logger gator system was actually age limited It's kind of a weird contract, but once logs got to be I think it was an hour long We just started getting rid of them The new log cache system is actually based on the log volume So if you exceed the buffer of around 10,000 logs, I think is the the cache size We'll start dropping those older logs, but it means that we're kind of providing a fair Sharing across whether your app is really noisy or whether your app is just slow at emitting logs We kind of treat all apps the same Um Follow logs we we took a look at that. I didn't do the shoot out so you we could look at exactly how it compares But uh, I've done that a couple times and it's really hard to tell the difference between log cache and and cf logs um We took a look at container metrics. We've expanded the capability there quite a bit and uh, actually a couple more points on that if you are Using the metrics forwarder or a spring app You can also get those metrics into log cache If you bind the metrics forwarder to your spring app All the the custom app metrics that you're producing through the spring actuators will also appear in log cache um we We took a look at the Component metrics that was the log meta view where we were taking a look at the Doppler and and kind of that new command line experience I don't really see that as a replacement for the fire hose, but it is uh solving some of the same problems So, uh, there's a little bit of parity there with uh, how component metrics are sent through the fire hose And then we didn't take a look at service instances metrics. I didn't have a demo set up for that, but uh service instances that are Using a couple of particular pieces of tooling called the service metrics forwarder Uh Those will also get their metrics Sent in such a way that the app developer has access to them. Um, and that's a new feature As of the recent cf deployment versions that include log cache So as I mentioned we had this hypothesis that developing A restful interface was going to empower app developers to build more automation that Using a restful interface was easier to deal with for things like auto scaling or alerting or charting and To an extent that that turned out to be true The pivotal auto scaler that we ship uses log cache. Um, it was the first Product to go out the door and use log cache in a production capacity And it's worked great. Uh, it was really easy for the team to develop But what we found as we kind of rolled out the first iterations of the restful interface is that The app development teams they wanted a more robust way to query this time series data And uh, when we started scratching around and looking at what was out there We find that time series storing storage and querying is something that is well established in Our cousin community the kubernetes community uh, so An experimental feature of log cache Is the ability to query log cache with promql So if you're not familiar with promql promql is a prometheus query language It's a query language that is especially designed for time series data It's pretty common when you are querying time series data that you want to do a common set of Calculus or arithmetic on the results of your query So a really common pattern is that you don't want to look at a counter Going up in a slope on a hill you want to look at the rate of change of that counter So you want to look at how what is the rate of drops that are occurring in the Logurgator system for example, not what is the total number of drops? Um and promql provides you a query language that will execute those functions That hypothesis is is just being born into some of our newest products at pivotal This is a screenshot of the pcf metrics Team and and the work they're working on with log cache and this is a real dashboard of You know basic container metrics being pulled out of prom being pulled out of log cache via promql and and charted Using some some charting libraries that exist It also enables some pretty cool applications. I want to put this one out there. It's kind of uh Accolade to one of my team members who who put this together I won't tell you where to find it, but if you're good at google you can find it And it's uh, I also tell you it's not necessarily like approved in production, but it's uh A pretty cool routing technique that uses promql As a definition language for determining a canary strategy and so by that I mean You're able to Define a plan in the terms of a promql statement that promql statement might say something like as the rate of 200 requests Increases start routing traffic to the new application So this allows you to only route traffic if your new version of the application is successfully returning responses That's kind of the tip of the iceberg. You can do all kinds of plans. You can do 50 50 traffic you can Put in a custom metric into your application and and route based on that But it's just one of the applications that is popular using promql And we're starting to see as a benefit we can bring To the clod foundry community Chip mentioned building bridges and promql is really a powerful bridge to standardization of time series data on clod foundry So I'll admit as we got all this ready. We started this experiment with promql I found myself saying wait a second. Did we just build a time series database? And the answer is yes, we did build the time series database But the important question, of course, I'm familiar with the accident of don't write your own database. So There was a part of me that felt wait, what do we just do when when I felt like a little bit hoodwinked by my engineers but The important question to us was did we solve the problem? So, you know, kind of going back to the reasons why log gator was or excuse me log cash was built was really around those original problems of container metrics and logs recent and There's still a little bit of integration that's being untwined between the cloud controller and the cfc li make sure that All the hooks that it currently exists to talk to log gator talk to log cash But in terms of those two problems, we can confidently say that we have solved those specific problems I also think testing out our hypotheses of Is a restful interface a powerful building block? I also feel like we were able to achieve positive results there as well It kind of took some twists and turns that we didn't expect. I definitely didn't I didn't go into this thinking either we were building a database or that we were building a database that was going to speak the query language of prometheus But as we evolved it we've realized that speaking the query language of prometheus has huge advantages to services that Want to be portable across both cloud foundry and kubernetes. So it was a little bit of an unintentional circumstance, but More and more we're starting to look like we have a component That acts like the prometheus server so That's all well and good but if you take a step back from the problem of log gator and large foundations We're still hearing a lot of concern from the community I think actually almost everyone who booked time at my on my schedule this conference They were coming to me with concerns because they've reached 40 or maybe 50 dopplers And they're starting to see that it's really hard to keep log gator within the recommended slo of 99 reliability And there's a reason for that one of the things that is A problem with the shared architecture of log gator is that the number of connections that the dopplers need to manage Um, it has the following formula So the connections equals the number of log api vms. So this is your traffic controllers your number of doppler vms times The number of syslog drains you have Plus the number of nozzles you have plus the number of log streams that happen to be open at the time There's actually kind of like a plus one in there because log cache is kind of like a nozzle so If you kind of do the math if you have like 40 or 50 dopplers and you have 10 traffic controllers, that's 500 times another number that's Probably Could be as many as how many applications you have on your platform times another number Which could be how many app developers are using your platform right now? So the number of connections can get into the tens to hundreds of thousands and doppler garbage collection can't keep up So if you are getting to That 40 50 doppler range, there are some scaling techniques that we have seen help Kind of the mantra before has always been add more dopplers As you get to those higher numbers, there's there's different knobs that you can you can start twisting to Not stress out this equation too much, but the shared architecture is really at the heart of what's causing this problem So we wanted to think about why do we have a shared architecture? When we thought about our feature set, we didn't find anything in the feature set that actually required a shared architecture You know despite the name of logger gator, there's not actually much aggregation service that the logger gator transport mechanism provides The aggregation is actually left to the downstream consumer Um One of the places that we saw as a starting point for Shared nothing architecture is the syslog drains you go back to that equation And I mentioned syslog drains that could be as many as applications on your platform You could even set up two or three on a particular application Um, so that is a way we can dramatically change the numbers in that equation so Something we just incepted I I sent the feature proposal out to the cf dev mailing list maybe two weeks ago Is uh, something we're calling agent based syslog draining So the concept here is is similar to systems like fluent Where the agent itself will talk directly to the end destination There's some tricky engineering here with managing state between cappy and the diego cells where the agent's going to live But we think we have a handle on how to manage that state in such a way that we won't take out cloud controller Just taking away the syslog drains is going to take a ton of stress off the logger gator system So we think that's going to have huge gains on its own But it's it's not enough for us to completely move away from this shared architecture And that's kind of why I was highlighting that feature parody We have with log cache cli versus cf logs because once we add the cs The the syslogs uh to an agent based approach The next step will be for us to send things to log cache using an agent based approach as well um When you start to look at the features now In terms of a shared nothing architecture Uh, this is kind of the full feature set of what logger gator does so Syslog drains we're going to start there. We're going to move those to an agent based Draining approach and We're under development on that now. So in the next three or four weeks I'm hoping we'll start to do some private tests. I know there's there's there's some eager volunteers who would also like to execute those tests so That's something we're in flight on now. We're we're hoping to land in the next major cf deployment release and I didn't really mention this but we also plan to include a configuration that Doesn't require an app developer to specify the destination but requires it allows the operator to specify a single destination for all the logs and that's popularly uh Consumed in the open source via the fire host assist log nozzle. Uh, we can provide that same log uh transport using the agent based approach um We talked about recent logs and how the log cache Component can provide the equivalent behavior there Same with follow log same with container metrics and once we have Done really those first five I think we'll have Sort of effectively reduce that equation to hundreds of connections So the the last piece in that equation is what are we going to do with the metric delivery? Uh, we think that an agent based approach does make sense for metric delivery from components as well Uh, that said, we also think it's a long ways away from us reaching the scaling limits of just Transporting metrics from our components through the fire hose And there's a lot of nozzles out there that we will break if we completely change the the metric architecture Um, so that one's like a little bit further down the road. There may be a plug like I mentioned fluent There may be a plug in architecture. We could go for there But uh, we think these next five items on our roadmap are going to really help with the the large foundation challenges It's also an end of an era in terms of my time on logger gator Uh, some of you may already know yohannes Pivotal has nominated yohannes to take over as p.m. For logger gator Yohannes has been working on cloud foundry since 2013 actually so He's Not new to the community and has been living in germany for the last couple years. So many of you may already know him But he's been doing an outstanding job. I couldn't be handing over to someone to put it in better hands. So Really last but not least I just want to say thanks to the community and For the opportunity. I really have enjoyed working on logger gator and have Learned a ton. So That's it Any questions any questions to adam? Yeah There is a notion of counter counter A counter. Yes counter. Yeah, sorry for my Basel accent Do you need any I mean change to your code to get this counter like? Micrometer.io There There is some compatibility challenges with the counter and micrometer We recently released a change to the logger gator agent I I want to make sure I get this right. I think we'll Receive only a total now and ignore a delta So I think that is the the specific change You'll need the latest logger gator agent released to consume that change But that just came out recently like within the last month or so Okay, so I believe that addresses the problem. That was that was the intent of the change. Okay I think we have time for one more. Yeah Hi, thanks for the talk. So Since you're you have built a new database With supports promql. Did you consider using tsdb? Which is for prometheus working fine with promql We're definitely starting that experimentation now We Like I said, we we kind of came to the realization. Wow, we just built a database and not only do we build a database We made an interop with a query language that is already popular in the observability community So we're definitely looking at bringing prometheus onto the platform There are subtleties to things like counters that actually do have some impact and how do we think about doing that? but We we definitely see a future where you potentially could bring multiple different storage for your metrics Whether that's prometheus influx db or a proprietary data service as well Okay, I think times up unfortunately, but I think adam will be hanging out. So yeah hit me up with questions. Thank you very much. Thanks