 Hi there, I'm Patrick Stevens. I'm a software engineer at Couchbase and I started back in February with a mandate to try and improve the observability of Couchbase. Prior to that, I was working in primarily defense domain for almost two decades, so getting logs and monitoring there is quite important and can be quite difficult sometimes. I'm going to cover some of the issues we came across. I've got quite a longer blog post that covers them in much more detail with nice worked examples which I'll pop the links up to. I'm just going to give you a little bit of a flavour of what we've been doing and maybe some simple examples to lead you in. So go through a quick summary of what problems we had, what kind of stuff we had to solve, what we did and then a little overview of some tips and tricks I've got. So how bad could it be? What is it we need it to do? I'm not going to go over it in a huge amount of detail but essentially the main two things we need to cover are we don't want to change any of our logs, there's loads of stuff already out there, we can't change it, we've got loads of support stuff we can't change, we have to be pragmatic. We don't have the resources to refactor all our log statements to standardize them and make it all wonderful and we've got all this existing tooling, we don't have the resources to do that when we're doing other things as well functionality things. We did want to integrate into standard pipelines, customers want their logs to go into their tools and so we we adopted an approach of using CNCF tooling, the idea being if we take the industry standard approaches there will be a way to integrate them and also a bit of future proofing as we move forwards probably stuff's going to change but if we're using the standard stuff there's always a path from the current incumbent to the next rather than having your own bespoke solution which you don't have to completely barge into a completely different thing. Just on the couch base side, so the container that runs is a single container but it's multi-process that writes logs into different files in different formats and none of these logs get output by default to standard output and there's additional tooling that goes in and extracts the logs and this also has to run on-premise and in Kubernetes which is one of the other reasons as well we went with Fluent Bit starting from a nice embedded background. You know some other stuff as well, resource and security constraints vital to have in some customers domains you know financial or healthcare and we also need to support different endpoints so something that can send to S3 to cloud analytics to Datadog whatever you want without us having to write that code and so if we can target something that then lets us say oh to a customer if you want to use it to send it to a bespoke endpoint then this is how you do it and you can configure it as it's just a configuration option. So I'm going to cover what our logs look like it all comes from the open source repo which is linked there I use these examples for continuous integration so this is the repo just showing just a quick overview of it as there's a test directory with all the log stuff in and you can see there's kind of expected input known input and expected output there so this is our audit log or an example of some of the messages it's just a simple JSON format so one of the first ones we did because it worked well and here's the kind of this is the expected output so during continuous integration I can run this and I can just differ against it and it makes a very simple quick check. We've got some more logs this is a Java one but as you can see someone decided four characters was enough for the log level we've got some standard Java stack traces not too bad and then you've got humongous Java thread dump so this is one log statement and that's what they want you know this must be one log statement as you can see it's quite long how useful it is not entirely certain but we can have a look. This is one of the problematic ones this is eventing so it's got multiple formats in the same file you can kind of see there there's different timestamps the log levels different cases one of the lines doesn't have a log level and there's some multi-line stuff as well so those three lines there cover some of the problems we have to solve in a single kind of log multi-format kind of approach and in fact the timestamps in the end I just ignored and just used the time of parsing. Here's a similar log again timestamps different format log levels all in uppercase in this one but that one doesn't have a level and you can see it's kind of an example there's some bigger bigger multi-line here as well but it's very similar to the previous log and actually those two are parsed together while using the same reg X they actually go to different streams. Here's one of our large Erlang code base logs so this is where you generally get a lot of multi-line but one good thing here is there's a nice good reg X for the for the start of each line but as you can see the multi-lines I mean this is quite a small example but they can cover quite a few lines and have embedded data and stuff like that so I had some problems with some of these Erlang ones mostly to do with the reg X side of things I think on the next one I show which is another Erlang one but with more multi-line in it this one had a problem the first line parser couldn't handle the new the close of the closing bracket and then a new line as the part of the first line parser so I had to tweak it slightly so there were little problems like that which I came across during the work I was doing. The other thing we have is rebalance reports these are big big JSON you can see there you know 447 kilobytes but it's all one line and one big problem with it was there's no new line at the end of it and the tail plug-in doesn't flush until it hits a new line so I had to do some extra work to pre-process some of these logs and so basically add a new line to it and so we could pre-process it but yeah which is a bit frustrating but that's just where we are yeah we couldn't change those logs they're actually in a read-only mount point as well so what did we do? I just want to give you a little summary of what we did and how we kind of got through it I'm hoping it will sort of cover a few things so one of my first things to do was add testing you know we were doing quite a few different logs I wanted to test them and make sure that as we iterated it we weren't breaking any previous stuff and one of my key points here is if you can avoid writing brittle reg X's it's much easier to standardize your logging on the log producing side than it is to write some horrible reg X that will probably break next time someone changes the logging. I link out to my blog post there but if you make sure your reg X doesn't have the default log key which is what the tail plug-in uses by default if it can't pass things you can easily tell if your reg X is failing to pass. Fluent bit doesn't really show you if the reg X fails to pass without starting to look at metrics and stuff like that so for my reg X's I never have the log key in it so if I see that in the output I know the reg X has failed and yeah so I started with the audit log you saw that that was the simple Jason one so no reg X's so good first starting point it was also what some of the security customers wanted they wanted to forward it all their audit logs and stuff to different end points for their security accreditors it kind of put the whole framework in place and then I could iterate it. I added a load of regression testing as you add more logs you find more problems so there's corner cases in all those logs they're not real logs they're kind of examples of the worst bits of each log and so we can iterate it and also as we upgrade Fluent Bit we want to make sure we're still parsing stuff because there were a few issues with parsing discovered from regression testing. User reg X testers there's a lot of them. Calipture I've just added one which is really good but it wasn't available when I was doing this. Rubular is the one that's recommended but again they work well to validate your reg X but then transferring that and translating that into the Fluent Bit configuration sometimes there's extra escaping and sometimes stuff just doesn't work and also take examples of those logs and do them locally do everything locally don't run it in a massive Kubernetes stack. So we added a watcher process as well we had a requirement for dynamic configuration reloading which is something Fluent Bit doesn't currently do so when you change the configuration or something changes credential certificates or whatever we want to reload it so I reuse the kubesphere approach which is now the Fluent operator approach which just has a simple watcher process in there that when the configuration changes it restarts Fluent Bit and that was just really quite simple straightforward and brought other benefits as well later on where we can start adding in extra metadata as variables and I could do that rebalance report processing I touched on before and yeah one other key aspect we need to make sure for our deployment is we never restart the pods unless we absolutely have to because obviously the databases don't really like that so if we're saying oh you've changed the log configuration just to do some minor you know update you want to send it somewhere else change credentials now you need to wait for your database to restart that's not going to go down very well so again it's all about localizing those changes and making sure it all works and this kubesphere approach was was pretty pretty straightforward you know I took took their code or extended it added cache-based specifics and yeah it's working pretty well we've got it deployed now for for about six months so it's yeah it's all going pretty well right and then a general overview so what else did we do so all the logs kind of they're wildly inconsistent but they follow a similar pattern every log has a timestamp every log pretty much has a log level you know info warning whatever some didn't like the audit level but I decided they were a specific level for all of the statements and then message is just everything that's not the rep not those those other bits so keep it simple generic so we can parse everything with it um we then tag by file name so we can route the different you know we can do processing by file name we can do routing to different endpoints by file name which is which was quite key for particularly for the audit log you know a lot of the time people that I just want to see logs but actually audit logs have to go to maybe a security monitoring tool or something like that um added common processing as well so you saw the levels were wildly disparate there you know you've got info written in different cases and stuff like that uh two you know four character deboo rather debug so I added processing to standardize that to constrained enums which then makes it work really well things like refiner and stuff you can filter on one value rather than all the variants of it same with file name um I just showing you here yeah the file name is really long it includes the whole path and that's a bit of a pain when you when you're using it in grafana um so I added extra extra stuff to chop it up into smaller bits so we could then filter on that and added some extra metadata so our pods include some configuration information oh this this pod is configured like this here's some variables and it's nice to inject those into the logging so you can you know you can do more analysis on it when you when you see the stuff come through and things like version you know what version am I running you know all this all this kind of stuff so this is just showing you the file name chopped from that massive path which is hard to see in grafana to individual file names there the original stuff's all there you've also got the levels you can kind of see there they've all been constrained to a single set of values which is good now some tips kind of what we did that went well what didn't work well and and maybe some bodges as well that that we had in there so what went well community support was a great one for me I've got it at the end but a lot of the stuff I've shown you and how I did stuff came from examples from other people um Flo and bit just worked we only had one issue I found during regression testing while upgrading with with that very humongous Jason stuff it it was fairly on the really humongous ones after a pause but that was fixed very quickly because I had a nice regression test straight away I've contributed all these parsers back as well um the stuff we took from kubesphere which is in the fluent operator now great way to handle dynamic configuration reload and and a lot of other stuff as well the standardization stuff for the log levels someone on slack pointed me at it and I was like oh wow it's quite a long set of filters but they're very simple and they just sort of apply sequentially um and another one um counter space does sort of redaction in of its logging for like um basically it hashes some of the strings that could be sensitive so I I demonstrated how you could do this with whilst you know in flight using lower filters and lower filters seem to be like basically if you can't do anything any any other way you can probably do it with lower um but you know it might be better to to do it a simpler way to start with um yeah so it was it was you know everything just worked it went pretty well pretty straightforward it was quite a good experience and actually one of our customers requested this kind of capability whilst I was developing it and they were already using fluent bit so you know it just fed in seamlessly what while we were doing it and I definitely say contribute and sort of engage with the community I've had a really good time and people have really helped me out and stuff like that things like the levels but also um one came up recently I've not not added yet but making numeric versions of the levels makes it much easier to do alerting and and querying in some of the some of the Grafana or Prometheus or whatever other stacks you're doing well what went wrong or what to do when stuff goes wrong I do touch on it on my blog post down the bottom there but regex is there a nightmare you know you've got one problem you had a regex to solve it now you've got like 100 problems so try not to do them um there's lots of tools to try and help you out but none of them are perfect um the Calypso tool is probably the closest because that kind of runs the same regex engine under under the hood I think it runs fluent d as a as a container to to do it um simplify stuff people keep I see in the community channels people try and oh here's my output from Kibana and it's not working it's like no no let's start you know start right at the beginning test locally start with known input to standard out see what it's doing there before you start adding in the whole downstream stack which is can be pretty difficult to manage um is there a problem with that is it not being sent there you know it might be unrelated to fluent bit so yeah just confirm what you think is true first at you know the earliest point you can add more logging output to standard output don't use like oh I'm sending somewhere else and then I'll look at it there well just look at the log straight out of fluent bit it will tell you things like uh you maybe you've got the wrong path in your you know your your environment variable or or something like that your wildcard doesn't work you haven't got permissions these are all fun things to discover that you won't find until you look at those logs um and to be honest I don't think I had any real problems with fluent bit it was mostly what I was doing or just my misunderstanding of of configuration which I guess yeah I've tried to improve the documentation a little bit as well but it's those kind of things and generally speaking it's probably working it's the same as like never suspect a compiler bug when it's probably your code that's wrong it's it's you know it's well it works well usually there's very specific problems sometimes um with specific releases but generally I've I've not had a problem stuff's been working and then you know it's just me that had it wrong uh bodges so rebalance reports this is a massive bodge hopefully be fixed in catchway server at some point in future but we still have to support the existing ones essentially it's a huge amount of complexity to manage adding a new line to the file um we mount the logs volume read only to make sure we can't change anything and and you know protect it from that side so I have to create temporary copy add a new line manage that make sure I don't feel that temporary copies up and clean them up when when I finish with them it's just a huge amount of complexity multi-format logs I touched on briefly before yeah they've got wildly different timestamps and timestamps don't pass properly with reg exit so you can have a reg x to extract the timestamp but then it has to conform to the you know the percentage why I did a you only have one format for that um so in the end what I do because I don't really care too much about the difference between when it was logged and when I pass it I just ignore the timestamps and use the time of parsing by default um so there is a bit of a difference and it affects my CICD because I have to ignore the value of the timestamp when I do a diff rather than the other logs because they always pass the right time that the log was created you can just do a straightforward diff and I wanted to add those optional variables but if you use the wrong filter it just exits if they're not set so I you know I had a bit of a I switched from modify to the record modifier filter because it meant that I could just have if any of these variables are here put them in the output and just give me you know just ignore the warnings saying oh you haven't set this it works really well you know those variables were set as environment variables by my watcher as part of the downward API and so only reason once it starts up just keeps injecting them in it's uh it works pretty well um and uh helps me out quite a bit actually with some debugging and stuff like that if you want to see more stuff so my time's up or more specifically let's move on to the kind of the future work what we're doing in the future or now um but yeah um so one thing I wanted to touch on yeah FlukeCon uh EU I went to and there was loads of suggestions from that and like stuff feedback that was really good I didn't think I was a bit concerned about performance of it but loads of people were using it so I've adopted it more heavily on-premise deployments we're rolling out now basically using my uh container and configuration and deploying it on-premise um I want to do monitoring CI CD so I now make a change I've got known input see what the impact of that changes on processing performance and stuff like that um which is a bit tricky but Calipture have also introduced this whole custom plugin approach where you can feed it back and and it looks quite useful for that so I can do CI CD um and there's new multi-line multi-format options now in 1.8 but I'm still using the 1.7 stuff but there's loads of good ideas and customer requests we've got in the backlog that we're slowly working our way through and this is just the first part of our overall observability stuff um so I'm hoping to get like tracing and metrics in in there as well um some more details just there's some other stuff here we've got the repo it's completely open source Apache 2 license so help yourselves documentation from catchbase operator goes into a lot more detail um and that blog post on tips and tricks I think is would be quite useful it's based on this presentation but expanded for you know a lot more um detail and you can kind of work through it and see the specifics rather than me just talking through it so thanks um I think uh we'll probably be moving into Q&A now although I love I'm presenting virtually but I'll be joining shortly um just some contact details as well feel free to ping me I'm on the um Fluent Slack channel as well so yeah that's uh that's where we are and I'll move to Q&A now