 Welcome everyone. Thanks for joining us for this session. My name is Guillaume, this is Etienne. We are both DevOps engineer at a French tech company called Octo Technology based in Paris. So does anybody run Prometheus into production in this room? Okay, a few of them. Does anybody run it on architecture with no container? Just a simple architecture. No one? Okay. So today we are going to show you how we run Prometheus, how we configure it, and how we use it in our daily job at a company called what we call a French FedEx company. So this is the equivalent of the FedEx here in the US. So that's us. You can find us on Twitter on LinkedIn and you can find us at the event the rest of this week. So a little bit of context about French FedEx company. So the main goal is to move a package from point A to point B and to track every events of a package lifecycle. So every time a package is moved or touched, this is flashed and an event is generated and injected into our architecture. Then our architecture computes and aggregates these events to generate a relevant business data. For example, you can know where is your package almost in real time. Then we store and serve this result for our customer. So in terms of figures, it's almost 2 million package delivered per days. This package generates around 20 million events into our architecture and our architecture is running almost 200 processes running 24-7 to compute these events. So to sustain this goal, we use basic messaging architecture, messages-based architecture. It's composed of almost 90 servers. So it's not container. It's all instances running on a private cloud on a Nutanix infrastructure. So every time an event generates a package, generates an event, it is going to go to Kafka topics. Then you have Spark Swimming application reading these topics into Kafka, computes the events and store it into a customer databases. Then behind this, we have some front-end APA for our customer to retrieve the results and the event and to re-inject some messages into the Kafka topic. So the whole infrastructure is managed with Ansible. And at the beginning of the project, we didn't have any monitoring architecture. We were blind on the infrastructure. We could not see if it was working properly if any of the components was failing or not. So the first thing we have done is to begin to set up our architecture infrastructure and to retrieve some metrics. The first layer that we began to watch was the system metrics. So we used a primitive node exporter that we set up on every server of our architecture to retrieve system metrics like CPU, RAM and everything. So we do this by Ansible again. Then the second layer we watch is what we call the middleware layer. So this is where we are going to monitor every of our components like Kafka, Spark, Kassandler or the Keeper. And we have several different exporters. We have JMX exporter for Java-based middleware like Kafka or Spark. We have a specific exporter like Elasticsearch exporter or HAProxy exporter. And of course we have some all-made exporter that we will ourselves. And the last layer that we end up monitoring was the application layer. So that was the Spark streaming application and the APA that was developed in Scala. So this is where the developers are running and publishing their own metrics for giving us business data. The simple architecture for monitoring that we set up is that we only have one primitive server which is plugged to an alert manager and a graphana for dashboarding. Then beyond this we have all of our communication tools and DevOps tools like Slack for receiving alert manager notification, mail and some Jenkins to try to remove the jobs. Our primaries is scrapping all of these endpoints and in terms of figures again we have almost 250 endpoints that are scrapped every 15 seconds into our architecture. In terms of volume it's more than 15,000 metrics into our primitives and it's around 15 gigabytes of data every day scrapped by the primitive server. So as Guillaume said the first step when we want to start to build a monitoring infrastructure is to have some basic metrics about your system. Some metrics will achieve to your CPU, your RAM for example. To do that primitives provide different exporter and the first exporter is the node exporter. This is a binary returning go and just simple as that you are running it into your server and you can have some basic metrics about it. You can have RAM, CPU or file system. The exporter you can have the ability to extend the component of your exporter by using collector flags. In our system we are based on CentOS so everything is wrapped using system D. With that flag collector system D we are unable to scrap metrics relative to system D. For example we are able to have the states of different units. It's very useful to have monitoring on system D. So the second layer that we want to monitor is the middleware layer. To do that you have two different ways. The first way is to use specific exporter. We have LXC search exporter, HAPoxy exporter or Zookeeper. These exporters are scrapping metrics and then translate metrics into primitive metrics. And you have the second way and when you have when you have some Java component you need to use a Java agent. This agent is provided by primitives. You stick it to your component and when it runs it exposes your metrics relative to your component and then primitives will scrap your metrics. And you will have metrics from your Java middleware into your monitoring system. So now we are monitoring systems, we are monitoring middleware. The last layer is monitoring your application. So it's your dev job. Primitives provide different client libraries. So you can find libraries in Ruby, in Go or in Python. And this library gives to your developers the ability to have our own metrics to build their own business data directly from the application. In our case our developers are building applications on Spark Streaming. So they are using Scala client library and in using that client library they are able to inject metrics directly into the Spark. And then when we scrap the Spark middleware we are able to have developer metrics into the system. So in our stack we are using the Halap Manager. And the Halap Manager rule is to fire events relative to primitives rules. So when the primitives rules is fired the Halap Manager will match this Halap using labels. And you can send the Halap to different receiver. You have receiver for Slack, for example. You can send email. And you can send to custom Webuc2. Using that example we are triggered Jenkins jobs remotely. And if the Jenkins job for example runs some playbook or Chef playbook you can have a beginning of self-healing directly into your platform. So now we have metrics among about the third layer in our infrastructure. We have thousands of metrics. It starts to build some dashboard. To do that we have to query your primitives. And primitives provide different functions to query your system. So we love particularly two functions. The first one is topk. Topk is very useful to extract some data when you have a lot of end points. The second query is pretty cleaner. Pretty cleaner. This is very useful when you want to extrapolate data using simple linear regression. For example, using that example you are able to say in six hours my file system will be full or not. This is very useful when you want to prevent outage into your system. So everything was, it was a lot of theory. Obviously we don't have access to the production environment from here. So we prepared a little demonstration. You can see the schema here. So we have a little piton application which is reading and injecting into a single Kafka node. But then behind this we have all of our monitoring infrastructure. So we have one primitive server, one graphana, again from dashboarding. The last manager is plugged to Jenkins and to Slack system. And on this architecture we have several end points. Some to simulate system metrics. All of the code for this demonstration is available at this GitHub repository. And it will run you through the demonstration. So the first step when you define, when you set up your primitives is to define end points. So you have different ways to define your end points. And on that demo we have defined our end point using static configuration file. So you can see the configuration. We have defined different end points. So we have metrics from the different layer. We have metrics from our business application or Python application. We have metrics from the middleware Kafka. And we have metrics for node exporter. Now we have different end points. We have metrics. We are able to build some dashboard using our graphana. On one dashboard we are able to have different metrics to have different curve to have different graph relative to the each layer of the system. In this demo we have a little Python application. So I will show you how it's simple to have some metrics directly into the Python application. So don't judge me. I'm UB in Python. So I used a Python client library. And then I defined a metrics like that. So I'm able to have my metrics. I'm increment my counter. Then the library will expose metrics on an HTTP server. So in my primitive system when I look up for metrics I can find my metrics directly from my Python application. This Python application is injecting data and reading data into Kafka. So we are able to see the behavior of the Kafka when it is hit by my application. And the application every 45 seconds raises an error. This error is catch by my primitives. It raises an alert. So we can find the primitives. So every 45 seconds the alert is triggered. It sends to the alert manager the alert. And then the alert manager will trigger remotely a Jenkins job. And this Jenkins job simply will restart the application and send an alert into our Slack. Yes. So you can see Slack is the application is running. And it's triggered some higher. So you can see that it's very simple to have an advanced feature. So right now we have a full stack monitoring on our architecture. So we are monitoring the three layer application system and middleware. We have some alerting and self-healing. But it's so much more than just watching dashboard and having metrics. It allows us to do some other thing in our daily basis jobs and to change our way of working. So as you saw, we are self-healing and alerting. So we get rid of some manual action and we can focus on delivering more value and to improve the overall architecture. With some of the queries we show you, we can predict failure and correct it before the usage is present. So it means less business loss. Of course, this means faster troubleshooting and would-go-spin pointing because we can use the dashboard and the metrics we have and not log in to the server and do some manual steps. So it helps troubleshoot faster. The fourth point is to tune and improve configuration of middleware parameter because we can see all of our middleware working properly and alive into our dashboard. So we can see which one is overloaded or which one is less used. So we can tune the configuration to improve the overall platform performance. And of course, because we can see all of the logic flow of an event into our architecture, we can see which component is a bottleneck or not. So maybe your Kafka is slow, but this is because the reading into your Cassandra are really slow. So we can improve the overall platform performance. So this was the upside, but the dev side and all of the team are also using it. So as an ops, obviously I use my metrics and I use my dashboard every day for everything. But when we build the system, we decided to open the dashboard for everything, for every people involved in the project. It was for the developer and for the business team too. To give to your developer the ability to watch your dashboard, it will bring some benefits. The first benefit is for your developer to improve the testing of our application. As a developer, I can see the impact of my release directly on the system by looking at the dashboard. It's a good way to improve the testing and for your developer to benchmark the application. It's a good way to improve globally the quality of the code into your company. The second way is when I am a developer, if I see the dashboard, if I see the status of the infrastructure, I am more confident and I have more visibility into the platform. Having the relevant metrics, the relevant inside from the platform, it's a good way to involve the top management, to involve every people to give a confrontational view of your system. And then, using Prometheus, your developers are able to inject directly their own metrics and they can build their own dashboard. So this is a good way to give to your developer more responsibility. So everyone is using it. It's easy to set up. But we run into some problems along the road and we wanted to share with you some tips and tricks that we learned. So the first one is it's easy to set up, as I said, but it's more difficult to master because you will have a lot of metrics and you will not know what to do with it. It's time-consuming to continuously improving your metric, your trigger, and your alert to match it to useful things. So maybe sometime when you trouble shot something, you will not have the metrics in your dashboard and not the right dashboard. And maybe an alert will not have been configured. So this is a work you need to do continuously to have the right metrics, to extract threshold and prevent also what we call the alert fatigue. So maybe at the beginning, you will want to have an alert on everything. So you can saw some red things blinking everywhere. But at the time, you will not watching it anymore because it will be useless or not meaningful alert. So you have to continuously improve and change what you have already. The second tip, it's an obvious one, okay? But all of your applications, servers, and monitoring platforms need to be in sync in terms of NTP. Other way, you will have some problems in terms of time stamp on your metrics. But the second point is more important. This is when you are watching your Grafana server, the browser you use and the Grafana server, the clock must not be too far apart. Other way, if you want to query a specific timeline and the clocks are not in sync, maybe you will have no data point available because the requested timeline will not be the same on your browser and on your Grafana server. So we need to be careful with this and you need to mind your NTP. So this is true story. It really happened on our project. So you just thought everything is done but okay, just the clock that's far apart. The third thing is what we have called the scraping interval trade-off. So again, this is a true story. But on the left, you can see that at the beginning, we set up a scrapping interval of five seconds. So the scrapping interval is, yeah, of five seconds. So the scrapping interval is, or often, our prometerious query is our endpoint. So our 250 endpoint. So you can see that the consumption in terms of CPU is at almost 75% and sometimes peaking at 100%. So this is because we use prometerious 1.5, not prometerious 2.0 yet. Then what we choose to do is to set higher scrapping interval and to put it to 15 seconds. And you can see that it reduced almost by half of CPU consumption. So this is why we call it a scrapping interval trade-off. If you have a low scrapping interval of five seconds, we will have almost real-time data because the lag effect will only be of five seconds between the data on your server and the data on your prometerious. But it will be CPU intensive if you have like us only one prometerious server. So we need to have a robust prometerious server or have a bigger architecture like using aggregation, HAPoxy, and everything. On the contrary, if you use a higher scrapping interval, of course, you will have what we call the matrix lag effect. So with 15 seconds, the data that we have in prometerious are 15 seconds late on the real data on our server. But it uses a lot less server resources like you can see on the graph and you can sustain more endpoints with the same amount of resources. So one primitive server for us. So you need to choose wisely your scrapping interval based on your business requirement and your number of endpoints that you have on your architecture. And the last tips that the golden teams and that's what we want you to remember of this presentation is that graph and prometerious are not used only by ops people or infrastructure people. It is used by everyone, the dev team, the business management team, and everyone on your product team. So you need to open it to every of your team directly at the beginning so you can have the right metrics for everyone. You can have the right alert and you can have accurate data into a dashboard into your primitives and that your queries are correct. And so everyone can be involved and use it. So the next step in the roadmap of our project is obviously it's to upgrade to primitives 2.0 to improve the performance of the global system monitoring. The second one is we don't have high availability. For instance, we have already run several primitives, so we want to explore advanced features and especially high availability. Hasgium said every day we are continuously improving, triggering, improving threshold or HALA. So we will keep that way. To improve the investigation, we will start to set up Grafana dashboard swelling. So it will be a better way to troubleshoot faster and with more efficiently. We have a beginning of state filling with our alert manager and the Jenkins. So we will keep on to improve it because we are lazy people. And we said that our infrastructure is running on Nutanix infrastructure and yet we do not monitor that layer. So when you want to benchmark application, when you want to benchmark your system, it's better if you have each layer. So for instance, we don't have that layer. So we will add some exposure on that underlying infrastructure. Sometimes the underlying infrastructure can make you some problem too. So if you don't watch it, you will never know what happened. So true story again. So the takeaway for our presentation, it's the first one is using primitives. You are able to have a lot of metrics. You are able to have a lot of data. So do not hesitate to instrument your whole infrastructure and you will have a full stack monitoring. The second way is this is very easy to start in an exporter, to start in primitives and to have beginning of data. So it's very fast to deploy, but it's very long to master. Continuously you have to improve your dashboard, to improve your trigger and improve to threshold. Work with your user. Do not hesitate to share your data with everybody, with every people involved in the project. You will have a lot of benefits. People will be more confident to your platform, business team, and top management will be more involved into your project. And it will improve quality of the code of the developer because the developer will be more confident into your platform and they can see directly the impact of the realism of the code. So having a full stack monitoring, it will change the way of working for the hop steam and for the dev team too. Primitives has a lot of exporter for each component. So obviously you will find your exporter that you need. And if you don't find the exporter that you need, you can easily code your home because primitive format is just a key and value and you have a matrix. So if you don't find your exporter, do not hesitate to write your home and to open services for the whole community. So thank you. That's all for us. If you have any question, we have, I think, maybe five or ten minutes. What triggers the job remotely? This is a left manager which is matching the rules on the primitives. And then Jenkins is running an uncivil job. So uncivil. No, the application raised an error. So the system status changed to fail. And with the node exporter, we can watch the service status. So when it failed, an alert is triggered into the left manager. And this alert triggered a stack notification and remotely jobs into a Jenkins. Just an uncivil playbook to start it again. No, that's just basic. No, we, as we said at the beginning, we only use uncivil to deploy the architecture but also the monitoring infrastructure. So we use ginga templating with the uncivil inventory to populate it with our target and to configure all of the configuration files in YAML. In the demo, it's just a system here we start. So, but in the real life, we are running uncivil playbook for different, for example, we have a job that check if the file system is full or not. So we have playbook that run on the server, on the target server to clean the log, for example. So again, it's uncivil playbook behind it. Windows? No, nothing. Sorry. It's all of the architecture is in a Sun2S7. Okay, we don't have any expand. Yeah, inside. Yeah, inside, but deploying. Not no, but we are asking ourselves this question. Like, maybe we should deploy it outside and not on the same platform. Thank you guys.