 Today, we will talk about the case study of EO4U platform and how we try to achieve the multi-cloud observability. This was a work for three individuals, but unfortunately, Lucia is not with us, so who we are? Next slide, please. I am Arman, and I am coming from ECMWF. I worked as a cloud computing engineer in ECMWF. And I have my colleague here, Francesco. Do you want to? Hi, everyone. I'm Francesco Maria Contrera. I work for Cineca, which is a night performance computing company. We have also cloud facilities, and I work mainly with the cloud. So our journey started with the EO4U project, which is an European-funded initiative for creating a platform to let people access Earth observation data. So you can search process and also visualize this data with advanced tools. It is a very large platform, so you have to also, we decided to have a complex architecture, which uses a multicubernet cluster upon two different clouds. So one is hosted on Cineca, and the other is Wikio, which is managed by different partners. One of the key partners is ECMWF. So you know that when we have a very distributed platform, you have to face many challenges. So you have to, for example, if something can go wrong, you know that it will. And so you have to face errors, faults, but you have also to allow people to, developers, for example, to debug their applications, their microservices, in a very distributed platform. So you know that observability comes in, and you know that there are some pillars like metrics, logs, and traces. And when we started our journey, we found that there are some standards, both in terms of architecture, in terms of tools that we can use in our journey. So starting with the metrics, we have, for example, in the simplest case, you can think about a single cluster case, where you can use tools like Prometheus, which is the fact of standard in retrieving metrics. And then you can use also Grafana for visualizing them in a very nice dashboard. Obviously, when you have to scale in this case, you have to consider high availability. You have to consider maybe also how to keep this data for a long time, to retrieve for doing some analytics. And also, you do not want to go to every single cluster to see the metrics separated. But you want to have a single access point to have visibility in the entire platform. So in this case, we move to the multi-cloud when we have to add another tool, which is Thanos. Thanos is extending Prometheus while keeping compatibility. So practically, it is solving all the previous issues. And while you can, for example, use in Grafana the same Prometheus data source and just point to the Thanos deployment, and it is just working. So Thanos is adding a Sidecar with the Prometheus server so that you can scale your Prometheus servers. And at the same time, it is able to upload the time series data inside an S3 storage. So you can retrieve this data when you need it. And you can keep all the other Thanos components in a separated cluster. So you can keep all the, for example, the courier, your Grafana instance, and all the other components inside an observer cluster. So you can, in a sense, ensure some high availability. For example, if one entire cluster goes down, you can continue to have visibility on the system on the entire platform. For example, a few days ago, we had some maintenance on our cloud. So some machines were not available, and we were able to see that from our dashboard. So at the same time, we can keep the data, as I said, before for a very long time. And we can add also the possibility to down sample the data for the oldest data so that we can speed up also the visualization of all data in Grafana. So obviously, there are some drawbacks in this platform. For example, the sidecars are consuming more resources. And the sidecars are uploading the data every two hours. So you have to directly query the sidecars from the courier. And to do that, we have to add another sidecars, an employee sidecar in front of the courier so that we can also encrypt the communication between those two endpoints. So this is for the metrics. And then we have the other two pillars. But for this, I want to leave the stage to my colleague. Thank you very much. And as we talk, this was the metrics. And the next pillar is the logs. And the single cluster case, we also talked about the standards. And the standard was pretty clear. It was often referred as the EFG stack with Elastic Search Fluent D and Kibana. But in our case, we just stick with the open search instead of Elastic Search and use the Fluent D as well with Fluent BitAsians. So the standards for the single cluster case was more or less clear. But moving on to the multicluster case, can you next slide, please? We had to define some standards in order to keep things tidy, because now we have an observer cluster which hosts TANOS and open search. And we have the production clusters that are sending logs to it. And in order to keep things a little bit tidy on the open search cluster itself, we came up with some index naming conventions to send specific logs to the specific indices. You can see examples on the downside. We also used some field mappers, state management, and templates based on the application types, such as, for example, Ingress Nginx. And we kind of tried to adopt a dynamic approach where we have the logging operator managing all this Fluent D and Fluent Bit and cluster flows so that when we are pushing a new cluster flow or a flow to the clusters, it will automatically propagate it to the clusters that we have so that we can see the logs on the open search itself. And obviously, we have the backups on object storage. And we also use open search as a data source in Grafana. Next slide, please. And the next pillar in our story was the traces. This was the, let's say, hardest point in order for us to understand coming from the humble beginnings. But we realized that the open telemetry gains a lot of support. And it becomes the de facto standard right now nowadays. And it also has some processors that enrich the data that are coming from. We are mostly working on Kubernetes, so this was a nice add-on. We already had some experience with Jager. But the good thing with open telemetry is that it sports broad variety of pecans. So you can actually write data to Jager. You promise to use Thanos. And these are the tools that we were using anyway. So this was a nice selection. And you can see the architecture for the single cluster case, where you have the open telemetry operator managing the open telemetry collectors as a demon set, which gets the data from the ports that are emitting tracing information, sends it to the Jager collector, where we can restore that. We can keep the data in memory. And we have the Jager courier to visualize the data and everything. And this is also managed by the Jager operator. But next slide, please. But in the multi-cluster case, this was not scaling very well. So we just said that, OK, we can deploy the Jager collector in each cluster and the open telemetry collector as well, so that we can send the tracing information to the Jager still. But we can use OpenSearch as a backend storage so that each Jager collector can write data to OpenSearch, where we have it in the observer cluster. So it fits really well. And then from there, we can still have Grafana and Jager courier so that we can add Jager as a data source in Grafana, which enables a little bit specialized dashboards and visualizations as well, so it was a nice add-on. And basically, we also set some standards there in order to keep things tidy in OpenSearch so that we can have query speed in a considerable speed. But there were some issues like multi-tenancy in Jager, which we have an open issue. You can see it in the downside, where you cannot actually write different indices. So it is a little bit less tidier than the logging itself. Next slide, please. So our journey, basically, start with the humble beginnings, as I said. And we figure out the best standards for the single cloud. And we try to apply that logic into a multi-cluster use case that we had. And we had these challenges about integrating the several tools, because we talk about OpenTelemetry, Jager, Thanos, Prometheus, FullEntD. And there is a lot of tools that you need to consider. And you need to deploy those tools. And also, there is a management phase of managing all those tools and version updates and day two operations. So this is still an open topic. Another challenge was the multi-tenancy, because it's always hard to deal with it. And we also had some issues with the documentation. There are amazing documentation out there, but there is still an effort needs to be spent there. And we actually are honored to contribute a couple of documentation available on the open source environment. And we also had some issues with the networking, like Francesco mentioned, for example, in the Thanos case. And in order to tackle them, we came up with some design principles. And we put some ground rules again. So we set open source ecosystem and open licensing so that we can plug and play this toolkit that we have. We also tried to automate everything with GitOps and infrastructure as code so that we can deploy them and must and manage them and must. And we have this Kubernetes first approach, although that we have some resources on OpenStack and as a virtual machines. And we have some kind of a way through for managing Kubernetes resources. So for example, if you have an operator, official operator for something, we start with that. If there is no operator or it's not mature enough, we go with the official hem chart. And if not, if those two are not there, we implement our custom solution only then. In order to keep things tidy and efficient to manage. And I must say, we started with GitLab CI CD and Terraform, and we tried a lot of things. But it was not agile enough for us to deploy all this. So we ended up using GitOps tools. Next slide, please. Introducing OpenObservabilityStack. So by leveraging all this open source ecosystem that is available to us, we came up with this idea of observability stack. I must say, this is not a product, but an umbrella project which has documentation, some code available that you can use so that you can both administrators and developers alike, they can build their own solutions just like us. And this is automated into multi-cloud deployment. So we also had this multi-cloud, multi-cluster case as a use case. So we tried to keep that in focus so that you can also have the same methodology that we did and apply it to the multi-cluster case that you have maybe in your use case. Next slide, please. And how we do that? By using some tool called Fleet by Rancher. It's a GitOps tool, and it has a couple of components, but I want to highlight the three of them, which is called cluster labels, cluster groups, and Fleet cluster, controller cluster. If you are using Rancher to manage club communities clusters, this Fleet controller cluster comes embedded inside, so it was a nice add-on for us, which we were using Rancher anyway, but you can install Fleet controller cluster by yourself as well. The other thing is the cluster labels, which is basically a label that you can set to the clusters so that you can build groups out of those labels. And in our case, we were using two distinct labels called observer and observerV in order to identify which clusters are the observer clusters and which clusters are the observerV clusters that needs to be observed. Next slide, please. And by adopting this logic, we actually simplified a lot of things because when we need to deploy the things from scratch, it's actually quite easy because the cluster is created with a specific label that will instruct Fleet to deploy the specific parts of the observability platform that we have. And once you build a new cluster, it will auto-enroll this cluster as an observerV if you have the label so that the newly provisioned cluster observerV will have the tools that we mentioned in the previous slides automatically. It also automates the configuration changes and the upgrades as well. So if I decide to have a new cluster flow for ingress logs or specific application logs, this will be propagated to every cluster that I have with the labels on it. So it is basically, all I need to do is to change the git object and push it as a commit. And it also ensures consistency because I don't want a cluster flow in one cluster, sending different logs than the other cluster that I have the newest config. And the last pillar is the plug-and-play architecture because I know people were invested in different technologies at some point and they want to stick with it. So this approach allows you to plug-and-play your own set of tools and apply the same logic to them as well. Next slide, please. And this is how we ended up with the EO4E Observability Platform, which you have the Observer and Observer logic and you have the Observer clusters running different things and you have the centralized Observer cluster which has the monitoring, logging, and tracing data that are sent by the Observer. And this gives us a single access point for operators and developers alike. It has the following tools that we mentioned and we actually provisioned a cluster just now to show it in the next slide, please. Live demo. Hope it will work, but if not, we have some backup images anyway. So it will be fine. Thank you very much. And I will give the microphone to Francesco for the live demo. Thank you. So, okay, this is our, okay. This is our home dashboard. So practically, as you can see, just a summary of our system. On Rancher, we have multiple clusters, okay, in this case, multiple clusters and we can control fleet from just NAS dashboard. So we can just see what, okay, just loading, but it is, we can see what is happening inside the system when we update our configuration which is propagated toward the clusters, as Armakan said before. So from this, we have just a summary of all the clusters and we have a platform. We can, for example, see, we have dashboards for visualizing the data more in a more, in a finer way, so we can go to some clusters. And so, we decided also to add something new. So we tried also to deploy a new cluster a few minutes ago. And so by just seeing Rancher to create one cluster, then the GitOps are automatic because as Armakan said, they are integrated in Rancher. So the fleet agent and all the other components are automatically deployed. And so you have just to create the cluster and as you can see, all the metrics are flowing and logs, et cetera. So it is straightforward to keep consistency as Armakan said inside all the clusters. And then we can see, so this is just a general dashboard for viewing general information for the clusters. We can have also more finer way to look at our system. We have, for example, here for the NGINX, we have both log and traces. So for example, now, okay, this bit processing. So for example, we can choose one cluster and then we can see logs on one side and on the other we have traces. We can select, for example, one trace by using a request ID. So for example, you can take this. Oops, sorry. So by just selecting one trace, so we can see that here there is now just one trace and I can go and see, so the trace the data. So we have visibility on the type system by looking also at the specific components for the observability components. For example, I can see about the logging. Oops, something didn't happen here. Sorry. Maybe it was the new cluster, maybe. And yeah, so just refreshing, okay, no, okay. Sorry, this is the beautiful of the live, okay. And yeah, we have also something about the Thanos and yeah, something is, and yeah. This is just an overview of our platform, okay. So we have some minutes for some questions if you want to ask for something. You want to say anything? Can we come back to the presentation? Yeah. So yeah, okay, thank you. Yeah, thank you very much, everyone. Thank you. Testing, testing. Hi, thank you for the presentation. Given that a couple of the components you mentioned, Fluentbit and the Open Telemetry Collector are going to both be supporting the full Open Telemetry protocol, how do you foresee the evolution of your stack? Would you consolidate functionality into just using one of those, or would you continue to use both for the different tracing metrics, et cetera? This is a question that I was asking to myself as well, like I use Open Telemetry Collector and it can use Prometheus, it can use Open Search as a data backend and we had some investment on Jager and therefore we said, okay, let's keep it because this is a tool that we know of, but I also see a lot of evolvement go into that direction, but not, I don't want to pick any tool but I also see what you saw earlier as well and I can see in a future tense, we can evolve into that part, but it also depends on the use cases and the requirements that you have because Jager has some also nice features that you have like dependency graph on the services, et cetera, which are really easy to visualize for a developer perspective. So I must incline to say why not if that road goes well and we can migrate to that, why not? But as for time being, we have some investment on Jager anyway so we decided to keep it, but this is a question that I asked myself as well, thank you. Hello, great talk by the way. So my question is how do you make sure that your observability cluster is still functioning properly? There's like two parts here because one is that obviously if it's still running, you can check it from externally, but if you misconfigured say alert manager for example and it's down there's, you basically lost alert, your alert manager to notify you that something's gone wrong, you want? Yeah, I know. Yeah, yeah, I mean, you can have another observer cluster to observe the observer cluster, but this is like a chicken and egg and egg and a chicken question. Like you need to observe the observer as well and you can achieve it with a simple logic. Maybe you can have an agent based monitoring solution enabled so that they can check the nodes itself or the end points that you have. But at the end of the day, yeah, I mean, this becomes something like a pet that you need to take in an account that the lifecycle of that cluster is vital for all the other things that you observe. So the observer cluster doesn't need to be a single cluster, this I can say. You can scale the observer cluster as well, but you must take into account the complexity of doing that as well. So it is kind of a trade-off. You can say that, okay, I'm gonna keep it simple to observe the observer and if something goes wrong, I have a problem and I have something that I need to take care of, maybe just a simple webhook to Slack channel or whatever the alert mechanism that you have so that you can have an alert so that you can alert the CIS admins or you can go wild and you can have multiple observers in different network segments so that you can have some kind of a high availability in that sense, but it depends. We keep it simple. If this will answer your question, we keep it simple. If something goes wrong, we take it as a pet and we try to fix it. All right, thank you. Hi guys, great talk, by the way. Just a question related with the Thanos architecture. So any technical reason to use the Thanos sidecar feature instead of, you know, the Thanos receiver and rely on permitted remote, right? Yeah, we thought about that. Obviously, when we started this project, we didn't know exactly how many cluster we have to manage. So we're trying to save some resources on the observer side, which is heavy on using resources. So in this case, we decided to go with the sidecars which are well-supported yet in the primitive operator that we used and we just deployed all the other components on the observer cluster. So in this case, we have just one query or two if we want to replicate on the observer side and then we leave the sidecars to do all the processing of retrieving all the metrics. So in a sense, we are reducing the pressure on the observer, at least for the metrics, because logs and traces are taking lots of resources, in particular logs, are taking lots of resources on the observer side so we can reduce a bit the pressure. So thank you for your question. Hello, guys. Thank you for the presentation. I want to know how long did it take from the humble origins, single cluster, single cloud, to the multi-cluster, multi-cluster you are now and how many people are in the team? Well, OK. Go back to the first slide. Ah, OK, if you go back to the first slide, you can see we have three people working on this dedicated. Unfortunately, Lucia is not with us, but it took three of us to build this from scratch. And there was a lot of time trying to figure out what is the correct path to go with, and this actually took longer than the actual implementation phase. So we invested a lot of time on figure out what we need to use. And this took like a couple of months, maybe, in order to do the research, proper research to do that. And then we spent a couple of weeks to actually implement it and use it on the fly. So one week plus two months. Thank you very much. Just one thing. Yeah, we worked from Chinaca, both Francesco and Lucia. And there is something very important that we are a very heterogeneous team. Also in the project, there are many partners. So these technologies give us the possibility to configure and develop our platform without replicating all the roles for accessing the clusters every time. So this is very important because Fleet gave us the possibility to develop very fast, very quick, without thinking about all the roles and the access to the clusters. Any other question? Are you able to share the amount of cost of this stack compared to the overall clusters you are monitoring? Can you repeat? In fact, you have 100% of costs for the wall clusters and the availabilities and everything. Only the availability stack, or what does it represent on the overall cost? In a sense of the infrastructure cost of maintaining that, we are using community clouds in this project. So this project is funded by EC Horizon Program. So we actually have, let's say, quote, unquote, unlimited resources that we can use. So we can use the other cloud and VK resources, let's say, without paying anything. But yeah, for that reason, we don't have that kind of a metric available right now that we can give you. But in order to give you a scale, the observer cluster is 10 nodes, three masters, and the remaining are the worker nodes. And every node has something like eight cores and four cores and eight gigabytes of RAM, or something very similar. Sometimes we scale it up. Sometimes we scale it down, depending on the load that we have. But this is more or less the size that we have. The biggest problem is the storage, because you need to allocate a lot of storage. So I don't know how that will scale in a public cloud where you need to actually pay them. Thank you very much. Thank you, everyone, for having us. Thank you so much. Thank you.