 Hello everyone, so my name is Martin and I'm a software engineer for Red Hat and I work for the Red Hat 2.0 team. Okay, and my name is Julia, I also work for the Red Hat 2.0 team. I'm a software engineer at Red Hat, can you hear me? Yes, okay. So the purpose of our project is to, let's say, automate and improve the release pipeline process in Fedora and internally in Red Hat. And the one that you see on the slides are the services that we develop and maintain, some of them the major at least. And we just want to explain a bit what our project is doing so that we can explain what we did in the last quarter for the monitoring to monitor those services so that you can maybe understand what we are talking about. Yes, please. Some of those microservices that we have are open source, they're also included in Fedora, but not all of them because we have some also internally. And so we can maybe say that the project is divided in two big branches and we decided to call them today the container release pipeline and the gating pipeline. So the first one is the container release pipeline. So when containers started to be big companies started to use them and wanted to incorporate them into their releases. So the same didn't also be. And this process pipeline is designed to rebuild containers when there is a CDE, which means a security pack found in some RPM. And then we need to find out which containers are affected and then rebuild them. So that's the main gist of it. So also not only to rebuild to start the process to rebuild them. That's mostly done by the first microsoft. But also to move those artifacts in our release pipeline. So create tickets for the QE teams, make sure that everything is tested and then move them to release. Yeah, this is just how you can see how it works. I won't go into details because we don't have time for that. It's just for that you can see to visualize how it looks like. So instead of the gating pipeline, we just call it this way. And it's the process to build and release RPMs in Fedora and internally in Red Hat. And so basically you just have many steps through the pipelines and you just want to have some tests between them. So you just want to be sure that your RPM or your software artifact can go from one step to the next one. And our services, the one that we implemented in factory, makes you sure that they can go to the next step. So it's like we call them gating. I just see I'm not sure it's actually true, but I just see it as a gate. So like if the gate is closed, it cannot go to the next step. And yeah, so all these services talk to each other through messages, so through a message bus. And when we started putting our services in production, this is the graphical representation. So at the bottom you see the pipeline and at the top there are our services. And when we started putting our services in production, we started noticing that messages were lost, like getting lost and maybe the cluster isn't working or the API is not working. And people are starting writing to us on IRC in the chat like, this is not working, do you know that? And we were like, no, like why? And what you have to do is like to connect to the cluster, check the logs and you don't know what is happening. You don't know when that started. So it's not nice if the user tells you that there's something wrong with your services. You should know that it's your service. And so we decided to put in place monitoring and to have some detection of the issues and also to know when that happened and why and to have more information to debug the problem. So in the end we noticed that what we were missing was what it's here. And we tried to find the best technologies that could be suitable for us and to cover all these cases. And we are going to explain now what we did. So, yeah, the first one was health check and metrics. So we were thinking what to use for that as, you know, if you've already done monitoring, there's like really a lot of solutions that you can use for that. Mostly what we thought was like, okay, so all our microservices are mostly containers and they are running on the OpenShift. So they were like, okay, so Prometheus was like really the best choice you could do. What are the benefits of Prometheus who were mostly that it's really easy to integrate with containers or microservices as the only thing that you need to do to get your metrics out of your application is just to add a new restful endpoint and just configure what statistics you want to extract from the applications. Prometheus stores all the metrics on the side of the applications are stored in memory and Prometheus uses the pool mechanic to get this data out of the microservices. Another thing was that it's like really good friends with Kubernetes or OpenShift. You can have like several exporters which are designed to not only monitor your applications but also monitor the insights of your Kubernetes cluster or your OpenShift cluster. Yeah, another thing Prometheus has so-called exporters which are additional microservices which you can do like a middleman between your microservice and your Prometheus instance where you can connect to your container and get additional metrics like MySQL, database metrics, system metrics, storage metrics, a lot of stuff. Everything already exists. Mostly it's quite fast written as well. And also Prometheus has a lot of client libraries written I think in 10 languages already. So the next part was okay so we need at least to know to have some information about our works. So for that we use an Elk stack but to tell you the truth I will just go really quickly through this because right now we are just using the Elk stack for the two of our services. So it's not much. Most of the aggregates there are a lot and looking for aspects mostly. So what's Elk? Elk is an acronym of three OpenSource projects and that's Elasticsearch, Elasticsearch and Kibana. Elasticsearch is a search engine where you can just put your data like logs and then just search through them through different statistical functions and I don't know what more. Logstash is a data pipeline where you can add multiple sources and it will just crunch and store your data and Kibana is for persuasion. We noticed that after those two technologies we still didn't have completely what we wanted. So mostly we have a lot of services that talk through messages and we didn't have a way to find out if the messages was working like if messages were sent and received in a proper way. So we have the message bus and we didn't know how to check that. We couldn't do that with Prometheus or maybe you can see if your consumer is up but you don't see if the messages are going and if they are correct. So we decided to implement a couple of E2E checks. We called them this way and we're just going to explain. The first one is about the pipeline that I showed you before and we are just trying to send a message on the message bus with... Yeah, it's a Jenkins job and that tries to do many steps and like the most important one is emitting these messages and so these messages are fake data but we are realistic ones so actually they are about some fake package and we try to send the message and see if after this message we have what we expect. So for example I emit a message about a new result about a test and I expect to see this test in the database in my software after this message and I check this tab. I don't have the item or I don't see the message in the message bus there's something wrong. The good thing of this is that you have all the stages separate so I can see exactly where the problem is. This is checking several software so I immediately see which one is not working and this runs every 10 minutes and it sends an email if the job fails and maybe the bad thing about it downside is but I think this is like in every technology this is developed with Jenkins it runs in Jenkins so if Jenkins has a bug we cannot avoid that like for example sometimes we get an alert that it's not true just because Jenkins wasn't able to pull the code for the test so we get an alert on that but that's not a true alert and there's nothing we can do about it. Maybe I will just add that when we are running this Jenkins job and sending the data this is done on production. Yes. On a stage or there. Yeah so you see an alert from production you actually get worried and it's nothing so it's not nice. So this is the second one E2E monitoring or I call it E2E probe this is like more or less still a work in progress but the main idea of this was that like when we have the segment test we are sending artificial data to our measurements and we see how it behaves but I was thinking like maybe we could just like use the data that really flows through a pipeline so the main idea was to write reusable test playbooks test or playbooks and monitoring solutions with Ansible and then just schedule them in a right way to execute on your pipeline when they are triggered by some message so for example as already Julia said we have a new test result this test result will come in and the test result is published on a message queue and will tell us like here we go here's a new message so this will trigger my schedule which will then run the Ansible playbooks in a configured way and will wait on other evidence that will appear in our pipeline and then you can check every point in your pipeline and see if it's working or not with the real data Another benefit what I found out is that when you write playbooks you want to describe your pipeline or your process in your pipeline by an Ansible playbook you find out if your pipeline is too complicated because if it's hard to write then it's mainly not a good idea for example I found out that some of the configuration that we have is hardcoded in the code so when I wanted to replicate it I found out that it's hardcoded in the code so I had to look into the code I think there is an initiative to move this configuration to a normal level file or something where you can like easily see what it is The downside of this is that it's written in Go it started as a Prometheus X partner the downside of this is that it's not really good if you have a lot of high volume queues with all the messages because then the application will not be able to follow up with that so that's maybe the downside for that but what I tested is like 30 to 50 messages per second are still fine Here we have some links to the services that are open source and you can find them in Fedora like GreenLady and FreshMaker I would say they are both the biggest services in their respective pipelines and of course you have also some links to Prometheus and actually today we just explained the open source solutions that we decided to use we also used another one but we're not going to talk about it because it's not nice to use non-opensure stuff don't use non-opensure stuff This is the easiest way for containers to get monitoring up and running I would like to mention that some other team helped us with that and they are having a talk later at 12.30 about holistic monitoring and more SLI, SLO so if you're interested just go there I think that's... I have a question When you look at the example with like this thing, the room messages and it's not something that you trigger by some timer Yeah, like in Jenkins you have the cron thing and you can set it up like, it's like a normal cron type And the job is in every... every 10 minutes, actually I think I changed that maybe it's 15 now because we had the issue that like sometimes the messages take some time to be delivered like sometimes there's some lag and like it took maybe one minute and we just put some retry some more retry and so sometimes it can take a lot of time to run the job Also, that's like... we have a problem with prototypes that in OpenShift you have... if you have multiple pods for one application they are behind the proxy so the problem was that you have multiple replicas and you wanted to recheck like the metrics from all the replicas but then you have a load balance and which is like then there was a problem that we didn't know which one it was OpenShift locally inside its namespace when you have a new project it has its own namespace and it has its own local network every project in OpenShift that's the same in Kubernetes so we were thinking how to like resolve that so in the end we went with a solution that we just make the project where the Prometheus instance was running to be able to see into others local networks like to order other projects in our cluster Of course, when you are just deploying Prometheus you don't need to have admin rights but if you want to do this you have to have admin rights in your cluster so that's the little downside but there is another solution but it's not so good it's a worse I would say is that you can, Prometheus uses the pool model so it always takes the data so you could just do the opposite and you just, the pods could push the data but then you don't know if one of the pods would go down so that's like just one thing that we have problems with Yeah, whatever Yeah Yeah, maybe So, the pipeline starts from left to right so when Anarata 2 has like a fixed CD it will send a message and then you go to the other microservice FreshMaker will listen for messages from Anarata 2 it will, if the correct message from FreshMaker will start on containers that he knows which should be built and the idea of the E2E Pro was that Okay, so I have an exporter which is listening to the UMB and trigger for the whole check will be a message from Anarata 2 so when a message from Anarata 2 will be available the scheduler will listen to it it will identify the correct message that he wants and then he will start, he will just take the configuration that he has and there are several Ansible playbooks configured and for different points, like this is the first point this is the entry point, this is the trigger the second point is did FreshMaker start the rebuild according to this message so there is a playbook which will always repeatedly query the rest API of FreshMaker and it will listen, this is the idea of the message did FreshMaker start the rebuild according to this message if yes, of course there is a timeout so you can configure that and yeah, if it will start and it will be successful it will wait for that, then that's fine and it will go to another point the point, another point of this is that when FreshMaker finishes the rebuild it will send another message to the UMB to the message queue so I'm checking for that message another listener will listen to that the listeners for messages like this they listen to some queue when the message comes the message will be taken and input like it will be injected into a playbook and the playbook will just evaluate it yeah, why I used Ansible is that I didn't want to write it in some normal language because Ansible is standardized so it's not easy every time but at least I have like and another thing when you use Ansible is that you can use Ansible and just write without anything just write it and you will just check your pipeline and you will see if it's okay or not you don't have to use the same schedule or anything yeah and so on and so on it will repeat itself all the time and of course when the whole thing will go down successfully you will have a record all the logs, all the messages will be saved if it will fail or the time it will go out then it will fail and it will also have the record I will stop at the point the problem was that I had a lot of work so I wanted to get it out before this talk but... yeah this message bus, what do you use? it's like an internal name in Red Hat this is like a normal message queue like 0MQ or IMQ ActiveMQ well sorry, yeah, ActiveMQ in Red Hat we have ActiveMQ and I think in Fedora it was 0 yeah is it okay? it was not a short one but I was going to say if you wanted to talk a little bit about how the second test works what it stands for what it triggers and then what it works for yeah I can try in one minute so it's checking green wave results the driver, dv, data grabber and the message bus at the same time so in the first step it checks if green wave is like okay if the API is working and the data are consistent then it emits a message on the message bus about the new result and then I expect to see I query the results to the API to see if the result is there then I check if green wave saw that new result that was passing so green wave will now say all your tests are satisfied and then I just create a new result but this time it's failing and so that will trigger again green wave saying now your tests are not satisfied so you cannot go to the next step and then I create a waiver that ignores the previous failing test and green wave now should wake up again and say okay now everything is fine again so you can go to the next step and that's it is it good?