 Hello everyone, my name is Augustin, I'm a principal engineer from Amadaoos, I'm also a Prometheus team member. Today I will co-speak with Albon, he is the observability architect from Amadaoos as well. And today we will talk about a story about observability in Amadaoos. But before talking about that, I have a boring stuff to do, presenting the company. I will be super quick. Who takes the flight to come here? Just quickly. Alright, thank you very much, you made my salary. Why? That's because Amadaoos is behind every booking flight, every ticket, every baggage you put in the flight. It's Amadaoos that is placing your baggage. So thank you very much. That's it for the company, you can read the slide later if you want. Now let's talk about observability and Albon will talk about it. Right, thanks. We're here for OpenShift right, so back in 2016 Amadaoos decided to look at cloud, like many of us right here. And just before that, so Amadaoos, we exist before internet, so we still have some super weird protocol in TCP etc. So observability before that was a no made solution. And basically the observability pillars, logs, we master logs, we ingest today 3 petabytes per day. But metrics, well we count logs to create metrics, that was not good. And traces, well we try to filter logs about the same transaction and display it in traces like UI. And so this is completely impossible to deploy on the cloud because it's a huge omate solution that nobody knows how to install from scratch anymore. So yeah, we have to build something else. And so we started to have an exploration let's say with four OpenShift clusters and we deployed Prometheus on it with, well, the basics exporters. And this was fine. So if you can see at the very bottom, we are using some quite old versions, so OpenShift was in v3.5. Prometheus was still in v1 and Grafana v4. And we had four data sources manually added to Grafana, that was well. And then we started to instrument our code base, the famous protocol. So we started to get, well, way too many time series. And another goal was to also monitor our private data center and the legacy workload in order to be able to compare our response time in the application running on the cloud, but as well compare the same stuff so with the application running on, let's say, legacy v apps. And so we started to implement Thanos on top of Prometheus stack. So it's to horizontal scale, let's say Prometheus, and we add more and more clusters. And so we have a single data source thanks to Thanos and yeah, this kind of the hub cluster monitoring a few spoke clusters. And in 2021, Amadeus, we signed a Microsoft partnership, which means that we are now deploying a lot and a lot of clusters. So today we have about 70 OpenShift clusters on Azure public cloud and about 10 on premise. And this was made possible with OpenShift v4. That was the big enabler because of the automatic installer of OpenShift and this operator all the way to be able to deploy that many platforms. And so across all of these platforms, we are now having one billion active time series. How is that possible at scale? Well, we have everything as code and also we've built what we named an end to end bootstrap stuff. So we have, I will go through the tools we're using for this, but we have at the very top an operator that is triggering the installation of a full Azure subscription OpenShift clusters and as well the internal middleware and application that Amadeus needs to deploy applications. So the first thing is under service now. So that's an operator that is giving few inputs like which region you want your Azure subscription to be located on, the cluster size, how many OpenShift clusters you need, how many nodes, things like that, few inputs. And then everything is automated with Ansible. So we have an internal tool that we named AUD, but basically this is to enable private, well vNet peering, network automation and firewall automation. And then we're using TerraGrant and Terraform to build all the Azure managed services we're using, so Key Vault, subscription, etc. And the OpenShift installer as well. And once OpenShift is installed, we register the cluster into some Argo CD servers and thanks to cluster generators and Mchart in Argo CD, we're able to automatically trigger deployment of applications on the newly OpenShift clusters. And so as I said, we have everything as code. So on this hub cluster, we have also built an operator in order to deploy alerts on this, deploy, yeah, push the alert on to this new cluster as well so they get automatically monitored. And so, yeah, we've built a UI to create some Prometheus rules and we open source, let's say, only the parser, so Augusta made the parser and we gave it to Prometheus, so that's why you now have nice auto-complete in Prometheus today. Unfortunately, this is not open source, it's too big and well there's too many amadus-specific stuff inside that, well, we can't open source this tool. And for Dashboard, we've tried to have Dashboard as code, so there's few tools that exist, so Graphonylib to build Dashboard as code. And also we followed a presentation done at Promcon 2019 in order to have change management on your Dashboards. But in the latest Graphana release, there is a regression on this snapshot API and Graphana, let's say, said we, well, this is not a good use case of what you're doing is not good for Graphana and we won't fix it in the next release. So that's why I would like to introduce Perseus. Thank you. So I will talk about Perseus, but why, well, Albon already talked about why we are moving to the open source to have a Dashboard solution, but I will talk more about the roadmap and the motivation, why we are doing that. So, the first motivation about Perseus is if you are looking to the C&C Flanscape and the Observe API area, there is absolutely nothing to display your data. You have many backends to store your metrics, to store your logs, to store your traces, but nothing to display them. So that's the first idea. And also I would like to thank Chronosphere and Amadoose to sponsor this project because without Chronosphere and Amadoose, I won't be there to talk about it, of course. We want to be Github friendly. What does it mean? It means we want a full static validation of the Dashboard for, well, that is representing by the JSON or YAML file. For that, we are providing CLI and CULONG schema. And with that, you can fully validate your Dashboard in a CI whatever the technology you want, Github action, Githlab. Yeah. It's all what you want. We want to be fully compatible to Kubernetes. What does it mean? It means deploying Dashboard with the CRDs, the customer source document. And later, we want also the data source discovery which means, in case of creating another HTTP configuration, we will just provide the Kubernetes discovery and then hopefully Perseille should be able to connect to Prometheus, Thanos, whatever the data source you want. And also because it can be painful to edit a JSON, even if the data model, I hope you will find it great, you should be able to run Perseille locally to edit your JSON file with the UI and then it will, of course, persist the data directly on your local file that you can then commit, create a public request, et cetera, et cetera. We want to be embeddable. Why? That's because, well, we all have the same issue. We all want to display our data but it's always a pain to install another tool to display your data. So we said, okay, let's create a bunch of package that you can use, that you can import in your own UI to display the metrics, the traces, the logs. Today, Chronosphere is doing that. It's importing the package. Well, because they built it, so it's super easy to import it. But now, Red Hat is considering this package to improve the OpenShift console, web console, sorry. And also there is an ongoing discussion to add them in Prometheus and PromLens. And why not in Thanos and perhaps in AlertManager as well. The full run-up of Percess is to support everything in the observability era. We started with Prometheus to display the metrics. Later, we want to display the traces and the logs after. If you are interested in this project, it's available on GitHub. We have a chat available on metrics. And if you are interested in this project, we have a chat available on metrics. And if you are eager to contribute, we have a guideline. And, yeah. Thank you very much. Just one. Sorry about that. Sorry. Just to conclude. Well, end-to-end automation is key if you want to manage a lot of clusters. And that was our journey. But if it starts for you now, it's way easier. Basically, everything has code. There's the Prometheus operator that is available in OpenShift directly. OpenShift monitoring user defined project for your own applicative metrics. If you have multi-clusters, well, that's what Red Hat has named ACM, Advanced Cluster Management. That's the exact same technical stack that I've just explained. And, yeah. Observability is not an amadeus business. So that's why we would like, well, we try to contribute to the open source project and we would like, of course, to, well, advise you to do the same. And, yeah. That's it. Thank you.