 Good afternoon everyone. Thank you all for joining with me today during this session. In this session I'll be explaining to you how we use Fluentbit and OpenSearch to implement a live logs feature for one of our products. My name is Nilushan Costa and I'm working at WSO2 within an observability team and our team is responsible for implementing observability features for one of our products called Corio. During this session I will explain to you all why we came up with the need for the live logs feature, the architecture that we use for it and our experience in using Fluentbit and OpenSearch to build this feature. First of all let's look at the enterprise software development landscape at the moment. We have so many different companies in the world. Although all of them are not software companies they have to use some sort of software product to provide a digital experience to their users. If we take universities, airlines, insurance providers they are not technically software companies. Nevertheless they still need to build some business applications in order to provide a digital experience to their users. Now when they are doing that there are two things that they need to consider. They need to look at the applications that they are going to build, the business use cases and also the platform where they are going to deploy these features. Now what happens in today's world is that many organizations they tend to focus a lot on the platform where they are going to deploy it rather than about the application features that they are going to implement. Let's take a small example. Say a company wants to provide digital experiences to their users. Suppose they want to deploy this as microservices on top of Kubernetes they need to pay a lot of attention to the platform. If it's a Kubernetes deployment they need to think about the Kubernetes cluster itself, how they are going to manage the upgrades. Then if it's a production grade deployment they really need to think about the security aspects of it, the CI CD pipelines in it and if it's an API that they are going to deploy things like API management and so on. So when they focus on all these platform specific things the resources and the time that they can spend to develop their application would be directed towards the platform. To solve this problem there is a concept of platformless. What platformless says is not that there won't be any platform it just says the organization would be given a platform which another company has already built. So they don't need to focus a lot on the platform. They can just focus on building their own applications. Now this is something similar to serverless. Even when it comes to serverless it doesn't mean that there won't be any servers. There are servers but developers don't need to worry about them. Developers would write their program and hand it over to a serverless service to deploy and run it. Similarly when it comes to a platformless system an organization can focus on their business applications and simply hand it over to this platformless service to run and manage it. Now our company we have developed a product called Corio which is a service for going platformless and our team we are managing the observability features of Corio. Now as a platformless service we expect the users to come and deploy their services on Corio and we as a platformless provider we have to provide all the platformless features and one of the features is observability. As you all know when you're running an application in production you need to be able to observe it in order to understand what is going on within the system and if there are problems in it to identify how to solve it. There are many pillars in observability and two key things are metrics and logs. Within Corio we already provide metrics and logs to users. So the way we do this is when they come and deploy a program on Corio we don't ask them to add any libraries or change their code or add any agents to it. Corio takes care of it and provides these observability features out of the box. So one of the observability features that we provided was logs. Since we started Corio sometime back we provided logs to users when they deployed their products when they deployed their programs on top of Corio. However there was a small problem when containers when users come and deploy their programs they get deployed as containers within Corio. Now when a container generates a log message it took a long time for us to be able to fetch and show it within the Corio UI. Sometimes it took somewhere like 30 seconds or a minute for us to be able to fetch the logs. So this was making debugging on top of Corio very difficult. As developers we tend to add logs in order to debug our programs. Let's say we deploy an API we add a log to it we send a request and we try to see whether it's providing the expected outcome and we may use logs for that. But imagine a situation where you send a request and you have to wait for like 30 seconds to one minute that makes debugging difficult and that's exactly the problem that we faced on Corio. Now what you see on screen right now is a high level image of how we collected the logs. So when users deployed their programs in Corio they deployed them as components. These components got deployed as containers within a Kubernetes cluster. We then used a monitoring agent provided by our cloud provider in order to collect these logs and push them to a log service. This log service was also something that was given by the cloud provider itself. Finally we use something an internal tool an internal API let's say called the logging API in order to fetch the logs from the log service and show it in the Corio console to their users. Now the delay happened when the logs were collected until they appeared in the log service. So when we were debugging this problem we went through the documentation provided by our cloud provider and according to them the average latency to ingest data could be something between 20 seconds and three minutes. So this was where the problem was and due to this users did not get a good experience that we hoped they would get. Sorry about that. So users didn't get the good experience that we were hoping they would get. So during an internal meeting we decided that we wanted to provide a live logs feature where we could provide a tail-miner safe-like experience to developers who use Corio. We had two goals. First one was that the logs should load within something like five seconds. We were targeting one to two seconds maybe but it should anyway be less than five seconds and then in order to develop this feature we evaluated several solutions. We checked several cloud providers. We checked observability solutions given by other vendors and so on. So the goal was it needs to show the logs within less than five seconds and the logs had to remain within the data plane cluster itself. Now what is this data plane cluster in Corio? In Corio we had two planes. One was called the control plane and the other one was the data plane. When someone deploys a program on Corio it goes into the data plane. When we were concentrating on logs we wanted the logs to remain within the data plane for compliance reasons. We didn't want the logs to flow through other data planes because they were sometimes in different regions. For example we had a data plane within the EU region and then we had a control plane within another region. So we didn't want data to flow between regions in order to ensure compliance. And finally we evaluated several solutions. We looked at observability vendors providing solutions for logs. Then we evaluated several tools and finally we decided to use Fluent Bit and Open Search and deploy it ourselves. The current architecture that we have on Corio is something like this. We have the Corio data plane which is a Kubernetes cluster and within that Kubernetes cluster users would deploy their programs. They get deployed onto different Kubernetes nodes and Fluent Bit running as a demon set within these nodes would collect all the logs and send them to Open Search. Now when a user requests to see their logs in the Corio console the request comes from their browser directly to the data plane through an API gateway and we have an internal service, a Corio system service called the logging API which will again talk to Open Search to fetch all the logs. All this remains within the Kubernetes cluster and we use Fluent Bit and Open Search to handle everything. As I explained earlier Fluent Bit is run as a demon set and we mainly use the tail and Open Search plugin to implement our solution. We also enrich the logs with metadata because it helps us with filtering and when we are enriching it the default option was to talk to the Kubernetes API server and get the metadata but in order to prevent overwhelming the API server we decided to use the cooblet in order to get this metadata for enriching. We have also implemented some sort of throttling log throttling by using Fluent Bit. The reason is in Corio we have several pricing tiers and within these pricing tiers we have a fixed amount of logs that users are free to use so if they if they generate more logs than that we want to throttle those components and in order to do that we use some settings within Fluent Bit itself. Furthermore this live logs feature sort of acts as a temporary cache within Corio so in order to make the cache work we are publishing logs to a new index every hour. So let's say someone has deployed a program right now and it is generating logs so these logs would go into an index called 19th March time 1.4 so if they publish logs within the next hour they will go into a new index where the time is 15. The reason that we do that is to delete old indexes and we do that within OpenSearch. OpenSearch has been deployed as a state full set and most of the time we run it with a cluster of three pods and in order to manage the indexes we use the index state management service. Now as I explained earlier we are keeping logs within OpenSearch as a temporary cache so logs that get published from Fluent Bit are kept here only for 80 minutes. When the index is older than 80 minutes we delete that particular index and start querying logs from the other indexes that is simply because we want this to be a temporary cache and there and we use the index state management feature for that. Now all these have been automated using post start scripts that we can provide to OpenSearch so once an OpenSearch cluster is set up this post start script will set up the index management features and it will set this expiry time of 80 minutes to all these indexes. So what were we able to achieve by using Fluent Bit and OpenSearch? Based on our experience we were able to get these logs to show up within the console within about two to three seconds. That was what we wanted to achieve and we were able to do that using Fluent Bit and OpenSearch and then we were also able to ensure compliance. Logs that are generated by Corio components they never leave the Kubernetes cluster where they are generated. Even when the users try to fetch logs the request comes from the browser to the Kubernetes cluster directly and they go through an API gateway but still the logs do not flow through any other Kubernetes cluster. We were also able to implement some filtering features by using OpenSearch queries whenever users wanted to filter their logs we convert all those to OpenSearch queries and we fetch the logs. Finally within Corio we have this thing called the project level observability but this does is it provides logs from multiple different components running within a Corio project. I will show it to you shortly in my demo. The way it works is it fetches logs from multiple components and it makes things like microservice debugging much easier. So what has our experience been so far by using Fluent Bit and OpenSearch? This live logs feature that I explained we deployed it to production somewhere towards the end of last year and at the moment we are running it in more than 15 Kubernetes clusters these are spread across Azure and AWS and the largest cluster that we have over here has more than 50 nodes when I checked it today morning it has scaled up to 55 nodes and in all these clusters we have OpenSearch and Fluent Bit running independently and the major reason that we are continuing with this is the stability of Fluent Bit and OpenSearch so far throughout this period of using these two tools we never had any major problems once we deployed these to our clusters it kept on working and we didn't experience any major issues with either of these two softwares. So let me quickly show you this live logs feature in action now this platform called Corio that I mentioned to you earlier is what I have loaded up on my screen right now within Corio there is a concept of organizations and a concept of projects when we want to deploy a program within Corio we need to create a project and I have already done it over here and within a project we can deploy any number of components so these components can be things like APIs, some manual tasks and so on so I have already deployed a simple program over here now what this does is it just implements two API endpoints and when it receives a request into it it logs a message this was created simply to test this logs feature and show them to all during this session now I have already built this program and deployed it so when it comes to deployment there could be multiple different environments I have deployed it into the development environment within Corio and within Corio we have this observability feature and this is where we use the live logs feature now initially this will try to fetch logs from our historical logs API which means it will switch to the cloud provider's logging service to get the logs initially but after the logs are loaded the console switches the API automatically to a live logs endpoint now to show that to you in action let me split the screen and get it side by side within Corio we have a feature to test API so I'm going to use that to send request to these endpoints and on the left hand side I have loaded up this runtime logs view so when I send a request to one of these endpoints it should generate the log message and within few seconds we are able to see the message on this runtime logs view so initially from the time that we generated this log message it took a long time for it to appear over here that means on the left hand side but now with the live logs feature we are able to see the message the log message coming up within like two seconds so let me send a request again and show it there we go so within this runtime logs we are actually getting logs from multiple sources we are getting logs from an API gateway and the program itself so it's easy to debug and let's try another endpoint the failure endpoint so by design this should give a HTTP 500 message and you could see the logs appearing in the runtime logs view so all this is powered using a fluent bit and open search and by using these two tools we were able to achieve the log latency that we wanted to achieve and by using them we are currently able to provide a better user experience to courier users thank you all for listening to this session do y'all have any questions I could take them up right now