 Hello, my name is Matthew Fala and I will be our presenter today on discussing the flip it event loop systems and also some of the optimization efforts that we took recently. So just a little bit about myself. Yeah, I've been working on this PR that just got approved and merged into flip at 1.9. It changes up a little bit of the event loop system transitioning over from a FIFO event loop to a priority queue event loop and we'll get into what exactly that is just a little bit. And so yeah, so we just wanted to share a little bit about kind of our findings, the improvement metrics and statistics and also kind of what you can do with the flimbit community and what AWS can do to join in on the optimization efforts. So a little bit about myself. I work for AWS, Amazon web services. And also I am on the ECS observability team. So we work heavily with flimbit and flimbit integrations, trying to build sort of images that are on top of flimbit to help our customers send logs easily to Amazon services and also wherever else they would like to send their logs or metrics. And also a little bit about me is that I studied at the University of Southern California where I got my bachelor's in computer science. And yeah, I'm just thrilled to be presenting to you kind of our findings on the flimbit event loop today. So yeah, the title of this presentation is flimbit event loop, demystify it and join in on the optimization efforts. So I hope that after this talk we can all be kind of up to speed on what the event loop is and also, you know, what optimization efforts are sort of being taken to improve flimbit and also what can be taken in the future. All right. So yeah, let's begin with I think the best way which is the customer impact. Why should we care about the event loop in the first place and optimizing it. So yeah, let's just start with statistics. So our customers are kind of sharing with us that we're, they're getting like 50 broken pipe errors and 49 connection timeout errors within a span of five minutes. And they're seeing this just kind of appear in their flimbit logs. And this is not very good. So each time you see one of these errors that triggers a retry in flimbit, it means that there's a genuine network error that's occurring. And this is quite a bit, it's like 100 errors in five minutes, which to degrade performance. When does this happen? So you can see that in low throughput cases, we're seeing it's very relatively stable. As soon as you start sending logs at like 40 megabytes per second plus for some, maybe some of the bigger companies or, you know, for, you know, like maybe one kilobyte logs, 40 megabytes per second. So in those cases, you'll be starting to see the errors are on the right hand side. So connection timed out, broken pipes, and connection initialization errors. So we kind of took to the whiteboard trying to figure out some ideas of what could be causing these problems. At first we thought maybe it's API issues. So maybe we're thinking that the place that we're sending our logs are having network problems like firehose. But then we realized that, you know, other customers are using firehose without flip it, and they're not seeing these problems. So we chalk that out. The second is that it could be like a timeout problem. So maybe firehose is working is just we're not waiting long enough. But it clearly says that it's taking 10 seconds plus. And yeah, so we said, no, there's no way that firehose could genuinely be taking 10 seconds for the network request to complete. And sure enough, we we verified that, yeah, firehose does not take, you know, 10 seconds to complete. In fact, what we did when we profiled zip is we found that the network calls complete within around like 100 milliseconds or less. However, when you have some code that makes a network call and flip it, it actually takes around, you know, 10 seconds to complete the line of code that's that's responsible for making the network call. So what we'll see that the network call itself is actual network activity completes in about 100 seconds. But the code when it gets to that point, takes, you know, 10 seconds to to move to the next line. So what's going on? There's a little bit of hanging whenever you make a network call, but we found out that there's some some hangs, you know, it hangs for like 10 seconds on these problems appear right afterwards. So this is just a small diagram of what we found is happening. When some code is running, code gets suspended to make a network call, just kind of wait for the network call, the network call completes really quickly. But then it takes over 10 seconds for the code to get back into the running state. And that's the problem. So just a some metrics that we were collecting. Yeah, so you can see the blue and the silver, the blue represents broken pipe errors, you see the very heavily clustered. And then the silver is the connection timeout. Also, there's a lot of clusters you can see, maybe more clusters. And yeah, but but yeah, this is just a lot of errors that are occurring. So in order to really get to the bottom of what what the issues are and how we can resolve it, we needed to really look into how Fluid Bit manages code, right, how Fluid Bit manages manages suspending code and resuming code. And so to do that, we have to get to the very base layer. And that's the event loop. So that's how Fluid Bit handles running code. And so yeah, that's what this presentation is going to be about. It's going to be about demystifying the event loop so that we could resolve these problems and then joining in on optimization efforts. And we hope that we can guide you through that so that you can also join in as well. So kind of a quick summary of things that we want to cover. Event driven programming, which is kind of fundamental of how Fluid Bit's paradigm sort of works in terms of events and code. Then next is a thought framework to think about the event loop that we've been using here at AWS. And then second is there it is the event loop improvement. And then next is going to be networks to build the improvement charts, so just some metrics and then priority event loop problems. So we might have a chance to take a look at some of the problems with the current solution and then some future work that we can all sort of work on to further optimize the Fluid Bit event loop. All right, great. So let's begin with the event driven programming paradigm. So this is a paradigm that Fluid Bit really uses where you can make sort of blocking network calls without hogging the CPU. So if you're looking at some programming language like Java or Python, when you make like a network call or when you do some sort of blocking operation, what you'll usually see is that the code or the thread will sort of block and not allow anything else to run for a certain amount of time when you make a network call. So that's kind of what we're seeing right here. It's a synchronous networking operation. So the code just blocks, waits for the network call to complete, and then the code will resume. And in that time, maybe the CPU, the thread will not be doing anything. It'll just be kind of like sleeping, waiting for the event. In Fluid Bit, however, we see something a little bit smarter. So this is closer to like maybe like a JavaScript, like Node.js style networking, right, where what happens is when you run some code and the code stops on a network call, what actually some other work begins to run on the CPU. So the CPU is not being idle. And once the network call completes, the code gets triggered to get resumed. And yeah, when it's able to resume, it'll just start running again. So yeah, this is very, very good, very beneficial for heavy IO operation code. So when there's a lot of network calls taking place, you're not just going to let the CPU be idle for every time you make a network call. All right, so yeah, let's take a look at how this is exactly working. So the code's running for a little bit. The network call gets made when the network call finishes, this triangle thing appears, right? So that's your event. And the event triggers the code to run. So yeah, that's this is kind of like how your standard network call will be made. So key concept of event driven programming. So this is maybe a little bit more of a complicated scenario that shows these events is that there's always these events and the events trigger the code to run. So whenever you have code that you need to run, you always make an event first event gets triggered and the trigger event starts the code. So here you can see that we have two events. We have the start code event and start other work event. So same as the previous slide. Great. So the start code event gets triggered. Both of these events are triggered at first. Only one of them gets to be processed. So it starts the code. The code hits a network call. So it gets suspended. And it's waiting on that network call to complete. The other work event now flip it realizes that, you know, there's no work being done because the code has been suspended. And so it decides, okay, let's go look through all of our other events. Is there anything else that's triggered and it finds this other work event? This is great. Okay. This other work event tells us to start up the other work. So it goes ahead and runs the other work. Now, while that other work is running, the network call finishes and so that generates a new event called the network event. That doesn't really mean too much for Flupy at this point because it's just processing the other work until the other work completes. So other work completes and all of a sudden Flupy says, oh, no work is being done. Let's take a look at our events and in this case it sees, oh, there's a network event. Great. What does this tell us to do? Well, it tells us to resume the code. So it resumes the code. And that's it. That's our more advanced scenario. That's a more advanced example of Flupy event loop. So you can see that we ran some code. And the key concept is that all the code is being triggered to run by these events. If you want to run some new code, just make an event, let it get triggered, and then when it gets triggered, then the code will run. And a really key part of this is that all of this is running on just one single thread. And so it's kind of an awesome paradigm for managing asynchronous operations just all on one single thread without having to manage multi-threading or anything complicated like that and not allowing for the CPU to be idle. So, yeah, all this is why Flupy is extremely efficient. And you can take a look online for other documents on adventure programming to learn more. Okay, so event-driven programming. What is the event loop? So this is kind of the next section. So diving in a bit. The event loop is kind of like this layer that everything else, all the code in Flupy, runs on top of. So you kind of think about it as like this fundamental based layer that schedules everything that runs that sort of system that we talked about where events trigger code to run. Let's take a look at the different responsibilities in Flupy because Flupy has an event loop and also the code that sits on top. So the code's responsibility is to code essentially all Flupy. So all the plugins, the core of Flupy, everything is sort of in this code section. The code also is responsible for triggering and not just triggering, but registering events to the event loop and also removing those events from the event loop once they get triggered. The event loop, however, its job is to monitor those events. So maybe like a network call event. And when those events get triggered, like the network call completes, it needs to run the code that's associated with that event. So you can see there's kind of like a bidirectional connection. The code puts events, the event loop puts events into code. All right, great. So this sort of bridge that's connecting these two sections is the events themselves. So the events bridge together code, pointers, and also event file descriptors. So file descriptors are just kind of like a way that the operating system can monitor different files or maybe not files because in Linux, everything is a file. So you can monitor things like network calls, timers, timeouts, anything really that the operating system needs to take a look at. On Windows, this might be something sort of similar like sockets, but in Flintbit, we just have some kind of abstraction and call these things file descriptors. These events are added to the event loop. All right, so for the code, the only thing we really have to care about, because there's different types of code just to be aware of that, there's like callbacks, core code that's embedded into the event loop, and coroutines. We really care about optimizing the coroutine case. So that's mainly what we're going to be taking a look at in the next few minutes. Yep, so the coroutines include all the plugin code. It includes sort of input and output filter code, and that's mainly what we need to optimize. So there's a bit of an issue with discussing how to improve the event loop, and that's that things can get really complicated very quickly because you're talking about coroutines, yields, connection timeouts, ready lists, and a lot of different things that are sort of working together to create the system that everything else can be built on top of. So at AWS, we found that the best way to think about the event loop is with an analogy, and this allows us to really just sort of scratch up or spitball ideas towards improving the event loop without having to dive into the implementation and the nitty gritty details, but looking at it from a high level that allows us to conceptually target the best solutions possible and really optimize the event loop from a conceptual perspective, and then allowing for those implementation details to get worked out whenever we decide to choose a different path to optimize the event loop. So let's take a look at this analogy. So the analogy is the following, and we're about to explain it. So this is it, but we're going to go through it one by one. There's some people at desk and they're filling out these forms, and they're making calls to their friends when they need answers for the forms. So going through it one by one, there's these people, they're just kind of waiting around, and these people represent the coroutines. Then there's a desk. This desk represents the CPU, so the person needs to be at the desk to do some work. And then there's forms. So forms kind of represent the code. When you're at the desk, you can be filling out your form. You can be adding answers to the forms. But what if you don't know an answer to the form? Well, then you have to step away from the desk, call your friend, get the answer. So that's kind of this phone to friend section there in the bottom. All right, where are they going to be doing this? Well, they're going to be doing this in the phone booth. So the phone booth represents a not ready list. So stepping away from the desk represents, and calling your friend represents doing a network call. This phone booth represents the not ready list. So there's just kind of a list of these coroutines, these people that are making network calls, and they're not waiting in the desk line. And lastly, that desk line represents it first in, first out, event loop ready list. So when you finish making your network call, when you finish calling your friend, and you have an answer for your form, well, you can't write it down immediately because you don't have the desk. You don't have the CPU. So what you need to do is stand in line for the desk, and once you get to the very front of the line, then you will have access to your code, which is the form, and you can write down the answer and continue on your work, filling out other questions on the form. All right, so yeah, going through an example, we have one person who's filling out his form. He's at the very front of the line. He gets to a question that requires him to phone a friend. So that's like a network call or something. So he steps into the phone friend line away from the desk. The code is still around. It's just no longer being processed by a coroutine at this point. Someone else comes, steps up to the desk, who's waiting in line, the next person, and he grabs his code, his form, and begins filling it out. And when he gets a network call or finishes, then he'll step away from the desk. And then next, someone who is phoning their friend gets an answer from their friend. So that's like the network call completes. And what they do is they need to stand back in line. Great. So that's the whole example. So yeah, in Flipbitland, we already kind of covered it, but essentially all key four coroutines, they're waiting in line for the desk to get to the CPU. When they make a network call, they have to step away. And then when the network call completes, then they can step back in line and then wait for the desk to write down their answer and continue on their work. All right, great. So obstacles, there are a couple of different things that we have to keep in mind. The first is that these coroutines have a problem remembering their answer once they get it from their friend. So when they stand back in line, if the line takes over 10 minutes, then over two minutes, then they will forget their answer. So in Flipbitland, this is kind of like saying that if you finish your network call, and then you are starting to wait to go write down the answer to process that network call to complete it, if you wait too long, there will be a broken type pipe or connection time out there, which is not a good thing. So what we need to do is make sure that this line can be short enough that you can remember your answer when you get to the desk after you phone your friend. The second constraint or obstacle is that person switching takes time. So when you stand in, when you change from just a line to a desk, and a new person takes a desk that takes a bit of time so we don't want to add more person switches, what we have right now is fine. Okay, so there's a problem with the first and first out event loop, which we're looking at right here. So you can see the first and first out because when you stand in line, then you have to kind of wait in line for all the people in front of you to get to the desk, to get to the CPU in order to begin running your code again. And here's the problem. Once this desk line gets really long, all of a sudden when you finish phone your friend, you have to stand in this really long line that's maybe like 10 minutes, and when you get to your desk, you forget your network calls result. So there's a broken pipe or connection timeout problem. So where are these people coming from? How can this line get so long? Well, the answer is that all these people are kind of like inactive coroutines, right? So these coroutines are sort of queued by the input plugins. So in FlintBit, you have these input plugins that are generating the data, they're generating all the logs to get sort of flushed by the output plugins. Well, with the old FlintBit, so FlintBit 1.8, what would happen is that you would just have all of these sort of data, all these sort of coroutines waited to be sent over to the desk line, right? So they're waiting to go from inactive to active. And unfortunately, the old solution, which was in FlintBit 1.8, is to take all these people immediately when you see them, every time there's a flush, move all of the people who are inactive, who just got generated by the input plugins, to active, so to the desk line. And what that causes is for the desk line to get really long. So you can imagine that if you are creating like 100 different input chunks of data, or 100 different coroutines on the input plugins, immediately those 100 coroutines get turned into active coroutines, then that's a lot of coroutines be processing at the same time. And that's what we're seeing. We're seeing sometimes around like 100 or more coroutines being processed at the exact same time. And you'll see that because there's so many coroutines being processed at the same time, this line gets extremely long. And people forget their answers by the time they get to the desk. How we can resolve this is with a new policy. So the new policy is that we need to have people in the whole line, so the input plugins sort of output, we need to have them wait until everyone at the desk has cleared. So what that does is it makes sure that the desk line is as short as possible. So if there's someone kind of like in the first row who's waiting at the desk line, then we will not allow anyone to be admitted from the whole line into the desk line. However, if you look at the next section, if the desk line is empty, so that might mean that there's still people processing network calls on the phone booth, but if the desk line is empty, then we can start admitting people from the whole line. To do this, we need to implement a priority queue. So yeah, just going to the new policy. This is the same slide. Yeah, so we can see that now we're keeping the desk line extremely short. So you can see that people get to the desk, they make a network call, when the network call completes, they go back to the desk line, and we're not allowing anyone else to join this sort of group of coroutines until all the people on the desk are clear. So you keep this desk line very short. And to implement that, we can just make really, there'll be two priorities. So inactive coroutines have a lower priority to hit the desk than the people who are waiting at the desk. So the active coroutines, those ones can get a higher priority and get prioritized. And just by doing that, we can implement this policy. So how does this look if we revisit the event loop? Well, if you remember before we had these events, they were all having the same priority before. Now we're going to add priorities on top of that. So you can see that we added numbers to these events and added some priorities. So the people, so kind of looking on the right hand side and the left hand side, the people represent coroutines. So kind of remembering to our adventure of programming, these events trigger code or coroutines to run. So each of these events have kind of people attached to them. So now we can prioritize the people. So the people who are inactive, we give them an event that's linked to them with a priority of two. So they don't get the priority. And then the people who are active, who are making network calls, who are already standing in the desk line, they get a priority event of one. And so these events, when they get triggered, will always make sure to prioritize the higher priority. And you can see that when you do choose an event, when an event gets triggered, well, then you send one of these people to the active coroutines spot at the desk, right, which is this top section above the event loop, which is running the code. All right, so let's take a look at the changing responsibilities. So the code responsibilities, everything's the same except the code also needs to set a appropriate priority to the events. However, this is not mandatory because this was implemented such that you don't have to choose a priority. If you don't choose it, then I'll just do a default priority, which is what we had previously. And everything will work the same. And then next is the event loop. So the event loop, everything's the same, except we're going to make sure that if a couple events are triggered, a couple events are active, we're going to process through the higher priority events first. And lastly for the events, everything's the same, except instead of just having code pointers and file descriptors that we're monitoring for the events, what we need to also add is the priority, which is optional. Great, let's take a look at the improvements. Now that we implemented this, so this PR was implemented and merged into 1.9. So, yeah, we're doing a lot of testing on 1.8.11. And we're also doing it under a bunch of different slowdowns. So we're adding some delays to flip it and see how it would react, how the network problems would get impacted by slower machines. And you can see that all across the board we're seeing tremendous improvement. So, you know, from 9096 to 100% improvements on the connection timeouts and the current things are being reduced by 50 to 81%. And this is in terms of active curtains. So this is, you know, tremendous results. All right, next up is just some tangible error data. You know, we can see that we had 51 errors in a span of five minutes. That got reduced to two errors for the broken pipe. We had 51 connection timeout errors in a span of five minutes when we're sending logs 40 meg bytes per second, one kiloblatt logs, that got reduced to one in five minutes. So tremendous results. The coroutines went down from 15 to 7 on average. You can see these errors. This is the sort of graph that we showed four tons of clusters of errors. And you can see that the clusters are always preceded by these large amounts of coroutines. And this is all in 1.8. So high number of coroutines yields high number of errors. Well, now you can see the number of coroutines went down. So the axes is a little bit different. So the old axes was up to 60. Now the new axes is up to 18. So these coroutines are being kept at a much lower level. You can see that because of that, there's much less errors. In fact, we only see three errors. You can barely see it down here at the bottom right hand corner of the graph. So yeah, so again, just looking at these two graphs, you can see there's a tremendous improvement from 1.8 to 1.9. So just this quick summary, the network errors got reduced from 96 to 100%. So really good, really good stuff. The active coroutines got reduced by about a factor of three. And then error clusters were removed. So all this is really great stuff. Okay, so really quickly, let's take a look at, we're going to skip through the improvements and problems just so that we have time. All right, so we're going to get into the related domains of development. So yeah, so what can we do in the future for the event loop? So we covered so far the, you know, some adventure in programming paradigms. We covered so far some, you know, problems with the event loop and solutions with the event loop. And also, what is the event loop? And also, how did that, you know, how did those changes, how this changes the event loop impact loop in a profound way to make networks stability, improve and increase. But now let's get into some related domains of development. You know, what's left to be done? Well, if loop it really wants to be as performant as possible, in order to do that, we need to utilize as many, you know, threads as possible. So recently, there was a transition from, you know, it's just kind of single threaded application to multi threaded application with workers. So that means that every worker gets run sort of on a different sort of for output plugins, you can run output plugins on different workers, which which get run in different threads. That doesn't work though, however, yet for input plugins or filters. So kind of on the future roadmap, but there's a lot of talking in D about adding worker support for input, and also filter plugins. And so to do that, we really have to take into consideration the event loop. Because you can kind of think of every single worker is having its own event loop. So every time you have thread, they can run asynchronous code, they need to zone event loop. And so in order to make these changes, we'll have to have this fundamental sort of concept of the event loop in our minds. Which yeah, hopefully this discussion sort of allowed you to sort of gain insight into the next part is the reducing events. So another key to this event loop system is to not have too much event pollution. So if we have too many events in the event loop, then it becomes very inefficient. So there are some events right now that that don't really need to be there that we can reduce duplication of, mainly like the DNS timeouts, some of the other timeouts. But at the time when the DNS timeouts were implemented, there wasn't really a good mechanism to coalesce these events. But ever since, there was some code those written that brings together all the cleanup functions. And that's used for like the network timeouts. So some talk that's been happening in the community is to bring together all these DNS timeouts. So you don't have all these random events sitting everywhere, one for every single DNS call, one for every single network event, in order to really make it a bit more efficient. Because the less events that we have, the better. That's the less person switching for the desks that we talked about. All right, so that's all that we have time for. So I'm really looking forward to the Q&A right after this. And also, just thank you so much for your time that you're spending to listen in on this discussion. I hope it really helps to give you some fundamental understanding of the split-bit event loop and how you can join in on some optimization efforts with this mental framework in mind. Now, if you would like to stick around, we'll have a follow-up section for this to go into the details that we skipped over. And also some deep dive discussing kind of how exactly we implemented some of these changes. So the bucket queue, some task pipes, some multi-threading between different threads and the different workers getting into that just a little bit. But we don't have time for that in this discussion. So we'll post a link maybe in the Q&A for the full talk over there. If you have any questions, my GitHub handle is Matthew Fala, his first name and last name, those spaces note all our case. Please feel free to reach out to me. And yeah, I look forward to what flip it has to offer to the community and to all the customers going forward and with all of the network improvements that have been made and can be made in the future via the event loop being fully optimized. And so thank you so much for joining us today. And I look forward to our Q&A right after this. Okay.