 Hello. Good afternoon. So I guess, yeah, my name is James. I'm here to give a talk about a project that we did where we moved a sort of interesting system that we had over to use fluent logging. So just a quick background about myself. I work at Asana, where I'm a technical lead on the input topology pod team group. Been there since 2019, mostly working on stability and scalability. Yeah, just a quick background on the system that we had to work with here. It wasn't quite logging, per se. It was event logging, which was sort of the same thing. But basically, the goal was to allow developers who are working on the application to send events and reliably get them to where they wanted them to go so they could process things elsewhere. For example, think notifications. Like when you create a task in Asana, you want to send the notification out to a bunch of people. And that happened through this system. And in effect, the system worked like this. So we had a bunch of EC2 nodes. And those EC2 nodes were running some set of node processes, Node.js. And they would write those events into a sync, which we'll go into detail later. That's the component that we're talking about here. And this sync would basically handle writing them to kinesis. And the reason that we did it this way initially is because these node processes didn't have a very long lifetime. And so we cycled them relatively often. So having an external process here helped us get durability, helped us get some reliance during network outages, and so on. Yeah. And so the API that we exposed to developers for this interface was just you basically call to the event logger and log the category that you want. And then you'd pass the data. And then additionally, you could also configure where you wanted your events to go. So in our system, you basically passed in the name of a kinesis stream. And then you could set the event category to direct this. And otherwise, everything just went to the default stream, which was pretty cool. Yeah. So this system, as a whole, processes 3.3 billion events per day. And almost all of those events are actually really important. So we needed to be really sure to get the durability aspect here. And out of these events, for the system, we basically gathered these four requirements that the system had to be flexible. So we needed to be able to basically send anything we wanted to it without doing too much work or too much effort. We wanted developers to do minimal effort logging here. But also, we wanted it to be really high performance. So basically, we wanted it to be able to handle the load. And we wanted it to be able to get those events into kinesis as fast as possible. And ideally, this system was meant to support, to survive, as I mentioned, node crashes and outages as well, like the EC2 node going down or the availability zone going down even and stuff like that. And then also, we needed to be able to configure the system easily because we expected really anyone at the company to be able to do it. Yeah, so in the past, we had this architecture built out with Scribe. For those of you not familiar with Scribe, it's a system that was developed by Facebook back in, I think it was 2009. And they killed it off back in 2014. But we still had it. Yeah, and it was finally time to get rid of this. And we did this with Scribe, which basically handled the buffering part and the receiving messages part. But it didn't have this ability to pump things into kinesis and stuff. So what you could do there was you could implement a custom Scribe server to support this protocol, which we had implemented a custom JVM application called the Scribe kinesis sync, which handled this, taking those log messages and pushing them out into kinesis. But as I mentioned, we needed to move away from this. Mostly, this is part of a bigger project, moving this on a system into Kubernetes. And Scribe being dead in 2014 made it very hard to do this because it was a C binary. It needed to be compiled specially for it. It needed a bunch of work. The JVM application doesn't really help with doing very small services and small pods. So we ended up deciding to get rid of this. So when looking at our options, we had effectively three options that we really considered. We could just let the node process do things directly. There wasn't really, and just there is a kinesis library for node. And there's also, we could have written log files out and then used fluent D or fluent bit to parse those files and write those into kinesis. And then we could have also used the forward port. And we looked at these and looked at the pros and cons of these approaches and ended up selecting fluent D with the forward port. And mostly the reason for this was because, well, direct to kinesis, it doesn't survive node process crashes. So if there was something left in the process, if you hadn't gone it out to kinesis for network outages or something, it was going to be lost. So that was more or less off the table due to that. The log file to plus fluent D approach was a little bit more enticing. It was arguably simpler. But the main thing that made it hard for us to work with it was because of the low ease of use in the sense that going back to this slide, like this configuration of the categories, we really wanted to be able to do anything here. And being able to send various tags and so on, and it wasn't possible to have the same tag. For a file at the time, at least, it needed to be this file gets this tag. So we didn't need to have multiple files for every single category, which was difficult. So using the forward port and the fluent protocol allowed us to basically send these messages directly to fluent D and fluent bit with the category actually encoded as the fluent tag. So that's why we went with that. This has some cost to durability, because now you need to make sure to send the message to fluent D before the process dies. But hopefully, this is mitigated a little bit by the process being local to the EC2 node. So that's the approach we went with. So let's talk about how we built it. So we set this up initially using fluent D. The reason for this was just because we were already using fluent D for logging. And so it was pretty natural to just let's drop in a forward port config here and see what happens. And we set this up with disk buffering, of course, so that when it couldn't flush events, it would just save them and try again later. And basically, this turned out to be a pretty good swap for the existing system. Yeah. So we tagged the record. So as you passed in the log, it got tagged with, sorry, this is a fluent record format according to the fluent protocol. So this is the tag as it ends up. And this is the timestamp that it gets. And then this is the log message itself. And then inside the configuration, we had, just during our deployments, we had this template configuration, which we used to just generate the configuration for fluent D from the JSON setup. Yeah. So in order to test this, we really didn't want to lose any events here. So what we decided to do was we set this up just side by side. So we still had scribe and everything running. But we just added this config to fluent D. And then also what we did was we just made a test kinesis stream where we wrote everything. So basically every single event that the node process had, it now wrote it both to scribe and to fluent D. And fluent D, all it did with it was just put it into a random kinesis stream where it would get thrown away. But yeah, what this did for us was that it allowed us to basically roll out everything as we actually, as the system was supposed to work, but without relying fully on it. And through this, we actually caught quite a few bugs. So yeah, the big one that we ran into was that once we were at about 500 events per second, then we just saw the graph drop events like this. So this is what this metric is. It's emitted by the node process. So as perceived by the node process, how many events, when I died, didn't I manage to send out? So we saw random spikes of 40 events, which meant that when the process died, there were 40 events left in a queue that we weren't able to even get to fluent D. So in inspecting this, one thing we realized was that this system had really highly spiky loads. So some developers would write an action that suddenly just generated 10,000 events. And all of a sudden, this one node process would just need to send thousands of events into one fluent D cluster. And then it would just drop to zero after that. And then what we did then, we looked into why this was happening, why we were losing these events. And it turned out that the library, the way it had implemented its send queue, was basically meant that it needed to write one event or one fluent record, so one of these. And it will wait for that entire write to complete before being able to send the next event. So wait for the operating system to report this write had fully been written to the socket. And this actually had quite a huge performance cost. And so basically what we did was just decided to make a faster library. And yeah, we implemented a different library, which had a little bit of a different architecture to be able to control how many events you wrote at the same time and how many events. What type of forwarding protocol, the fluent protocol, supports three to four different ways of sending events. So you can batch them together, or you can send them one by one. And we wanted to be able to test which one worked for us. And also things like acknowledgments and a bunch of stuff. If you're interested in this, I highly recommend checking it out. Yeah. And anyway, we did this. And now Node was actually able to keep up. So we didn't see as many drop messages. So we were actually able to go from 500 events per second on the last slide to 30k events per second, events per second during peak time, which was about 240 events per second per host. What we actually ran into here was a different, like slightly larger problem, was that at this load, we saw Fluent D actually run into issues. Basically, it started using about two gigs of memory and started losing a lot of events. So effectively, immediately as we rolled out to 30,000 events, like when we're at 20,000, this graph was at zero. And then when we went up to 30,000, this graph went up to 80,000 over 300,000 events. So we're dropping a relatively huge amount of events over 20%. And this just turned out to be lower performance of Fluent D. So yeah, the real bunch of approaches that we could have taken here, we could. Fluent D has this thing called workers. So we thought about setting that up. But before we did that, someone just said, let's try out Fluent Bint and see what happens. So we just did that. It actually took less than a day to just throw it out to our infrastructure and try it. And we set up, like we had the same setup, file system buffering. The only thing that we needed to do, especially, was that the default plugin in C didn't have Fluent Kinesis aggregation. So we needed to set up the custom Golang plugin to do that. But we didn't see any performance issue. So we rolled out fully. So all of our events were getting double-written to both Scribe and Kinesis. And we saw no drop messages from the perspective of Node. So it managed to flush out all of its events into Fluent Bit. So at this point, we were a little bit sad that we had to introduce a new system. So now we both have Fluent D and Fluent Bit. But we decided to go with what we had. And there's still actually a pending thing to dig into if we can improve the performance of Fluent D. But I'm actually now pushing for Fluent Bit. Yeah, so now that we had done this, we need to actually roll this out, so start relying on it. What we did with this was we set it to a per-category thing. So for category A, we could set our configuration to write it to Fluent Bit and write everything else to Scribe. And then slowly, we increased the size of this set that gets written to Fluent Bit. And then we worked with the consumer teams to ensure that the rollout went well. So they were monitoring their downstream systems to make sure that the log volume didn't drop massively all of a sudden and so on. And this went pretty well. We had a couple of issues with Fluent Bit. It wasn't quite free. We had basically the main problems here were actually the Kinesis plug-in. The Kinesis plug-in had some issues with how it did KPL aggregation, which actually ended up causing corruption bugs in Fluent Bit. So we ended up with corrupt chunks in there. So turning checksiming on for Fluent Bit when you're using disk buffering is really cool. What it does then is it just checks, hey, is this chunk? Like, it checks a C or C checksim. So hey, is this chunk still OK? And if it's broken, then it just throws it away instead of what we saw happening was just a crash loop. And then we fixed the upstream bugs on how it computed partition keys for KPL aggregation. And with that, we managed to roll it out fully. It's just running in production now. And we like this boosted our, we've actually since managed to also move this most of the system over to Kubernetes. So that really helped with that. And just a quick timeline for this work. And yeah, a couple learnings of this work for me personally was that being able to do this double right thing allowed us to be really calm in the rollout. So like, oh, no, it broke. And we just turned it off. Getting actually high performance had a lot of fringe cases that ended up being sort of just, I guess this is what's broken. It was like some of them ended up being kind of hard to reproduce. And also just going in and fixing the open source code was just really easy. And it also helped us get a very deep understanding of the system. Yeah, thank you. There's time for questions, I think. Yeah, no, there isn't. So I'm on slack if you want to ask me anything. Maybe one question if anyone has it. Anyone have any questions? Maybe do one real quick. Yeah. Yeah, I was interested in the, maybe it's not, it's only a part of the big picture that you explained. By the way, great work. And I was interested in the disk buffering. So you are on Amazon, right? Yes. And disk buffering, how does it work? So basically, I can actually follow up with you better after this. But in effect, what we did was we just set it up to cache forever. So if anything happens, we collect chunks indefinitely. We do have a cleanup process that cleans these up eventually. But most of the time, we wanted to hide your ability. In Kubernetes, of course, we do regular deployments and so on. Yeah, I'm not sure if that answers your question. Yeah, it's OK, it's OK. Thank you. Anything else? Anyone else got a question before we wrap up? Wrap up session will take about five minutes, so. OK, awesome. Thank you so much.