 Let's start. Hello everybody. So we're here to talk about Depsy, which is a tool made in OVH to compute the QS of our customers and our architecture. So I'm Nicolas Croquefair, I present you Antonio Lea. We are two Python developers in OVH and we built Depsy. First of all, the good news, the project has been released this week on GitHub for the first time, so you can retrieve it using this URL on the OVH account. Don't hesitate to take a look and give us your feedback and if you want to, we can improve it using it. So during the next 25 minutes, we are going to explain you the objectives of the project, why we decided to build the Depsy. Then we'll talk about why we need a graph dependency in this project and what is the link between computing the QS. Then the most interesting part for you as Python developer will be how can we compute the QS, our customers using different methods like the world QS, the operation QS and the aggregation QS. And to finish, we'll see how we can use her flow to schedule tasks every night and compute the QS for our customers each day. So first part, the objectives. Antonio and I are Python developers in the web hosting team of OVH, a French code provider. And we manage around our team, manage around 5 million websites, which are distributing between around 14,000 servers. Most of the case, everything is okay, of course, but sometimes we can have a crash of one or more services. And even if we have some log-bonon systems, maybe sometimes some users can be impacted. So the objective of the project is simple. We need to compute the QS, so the quality of service of every website, of course, the 5 million websites. The quality of service is QS is a simple percentage, which is saying if 100% everything is okay, it's not the case, it's not okay. It's not in us. In case of bad QS for a particular day, we also need to find the root causes of the root case. So we need to find what servers between this 14,000 servers have impacted my customers and why. So it's the objective. And the next part present by Antonio will present it, will be around the graph part of the project. Okay. So the first thought for the project was to get the code of every website hosted by OVH, but it's a bit complicated. It's not really ND. And sometimes it's even the customer's code, which is not working. So it's not our infrastructure. And of course, in DevC, as Nicolas told, we want to know the root causes. So the idea of DevC is to use dependency graph to find the root causes. For example, .com, we can imagine this kind of architecture. When all nodes are working fine, the website is up. But if a website, if the DBS is not working anymore, the database is not working, and the website will be done. So we can both find the root causes and also get a QoS for example .com. In order to get a graph in DevC, we are using No4j, which is a graph-oriented database. And we are using Kafka, which is a queue, in order to handle the messages, especially for the web hosting team. We are populating the graph in real time. We have orders constantly, and the infrastructure is moving. So we have a Python consumer, which is our currently handling messages, and which is our validate, and we format the message in order to populate the No4j database with this kind of graph. We have currently myelons of nodes and relationships in DevC, and it is growing. So this time, it's about the root QoS. Thank you. Okay. So, Anthony, explain why we need a graph dependency in the DevC project, and how we can handle the messages and transform it in nodes in No4j. Now you can ask us, what is the link between a graph dependency and the QoS? How to compute a QoS from some nodes in our No4j database? In fact, there is two cases. We have lots of nodes in our No4j, and for certain nodes, we have probes that are running and checking something like the CPU, like the RAM, like the PNG, the respond time from the PNG. Lots of things to check and to see if server, so the node, is okay. These things are sent to time series database. So we have graph like, I presented it later, sorry. So you already know what is the time series database is. We can have some information about metrics to see if our servers are okay or not. It's the first case, and we will see together how we can use QoS for it. Another case is the node cannot be monitored. For example, a website cannot be monitored by a probe, saying, checking the HTTP static code because we cannot guarantee the code. Maybe the user itself can raise an error. So in this case, we will not use time series database, but we'll use its own parent QoS to build its QoS. In this case, we are talking about operation QoS and aggregation QoS. So let's start with the word QoS and using this simple schema. We want to compute the QoS of the website here, which depends on just two dependencies, which are a storage server and a web server. Next, we'll see that we also want to compute the QoS of a group of website using a label name offer. We are talking about the filer part using a word QoS. It's the graph I wanted to show you before. You know it. It's a graph and a picture showing some dashboards, taking from the official website. And the ID of the word QoS is already simple. We need to analyze some data points. And for example, if these data points are above a given threshold, we can say during this moment, the QoS is not given for our user. So we are going to decrease the QoS. Here is a server named filer01. It's the response time for the ping. And we can see there is two problems. Here it's a representation of some data points. Of course, we are computing the QoS every day. So there is just a few, I say 10 data points. But there is lots of data points in one day. And we can see for six data points, the QoS, the response time is above a certain threshold. So the QoS must decrease during this time. We can have another type of check between the interval and it's the same method, but we also analyze a bottom threshold. So the ID, we'll see some Python code. The idea is to convert the previous picture, which is time series containing values into a time series containing Boolean's value. So just two values, true, false. And I don't know if it, yes. Here, when I have a problem, I transform it into false values. How to do it? It's the table representation. Not really interesting, but you can see. We transform some values into Boolean representation. And it's the module we use. Certainly you know PANDA. It's a module to analyze and manipulate some data using some data structures like series and data frames. And here, we can use it to simply transform. It's the first step. We also show you other steps later. The first step is to transform, you can see over there, a dictionary of timestamp values. In the case, it's time stamps. In the values, it's the response time of the ping. We transform it in Boolean values. Just by here transforming the data points into series, a PANDA series. And then we apply for each data, we apply a simple function. Here is just a lambda saying everything that is above the given threshold is 20, will be converted into false values. So we transform time service values into Boolean values. And it is the result. You can see in the picture of Dapsey, we analyzed a metric in time service database and converted it into a QS. So here, you can see me, I just have Boolean's data and it's not a QS, it's not a float number. And how can we transform it in this kind of number? Anthony, we'll just show you how to do it. Okay, thanks. So in order to get a QS, we have several steps. First of all, just imagine the website. We don't have metrics actually. We have metrics for Filer, Apache, but we don't have metrics for our website. But we want the QS because these are customer websites. So the QS of the website is a combined between the QS and the Filer and the Apache. Here is the end operation between both of them. So when the dependency is not working, one of them, sorry, is not working, the website will be done. We have the Boolean's for the Filer and the Apache, but we can compute the Boolean's for the example.com like that. So in Pandas, we have the two series for the Filer and the Apache and we want to merge the two series into a data frame. Just consider this subset of data. We don't have the same timestamps for the values. So when we merge the two, the both series, there are some gaps. Sorry. Yeah. With the non-value. So because the props are not sending the metrics in the same time. So it's really a real life example, actually. We don't want gaps. In Depth C, we are applying a forward feeling, a back feeling operation between the rows. And as you can see, in the previous table, you have a non-value and here, we just put the force in the next rows. And when we have a non in the first rows, we can put with the back feeling operation the true here. Then we don't have any more gaps. It's good. And we can apply the end operation between the columns. And we can have this for example.com. So in Pandas, you have data frame with full rows with no gaps. And then operation. No, we want to just keep the changing state, actually. Because this is just true and false. So we can just reduce the amount of data, especially for our next computation. I will speak about that later. And in order to compute the QS, we want the duration. So we apply the group by operation and the sum. And we can get duration for both false and true. It's a very small subset here. But in DEPC, we mostly compute for 20 hours. So now with the true period and then with the total of the periods, we can compute the QS, the real deal. And just a few minutes ago, I talked about the whole QS. And we just finished showing Boolean's time series and what is the link between Boolean's time series and a real QS. Anthony, just show it. In fact, when you want to check the QS of a server, you don't just check the ping like we show. You also check lots of other things. And in this case, you can have lots of metrics. You can transform all of these metrics in Boolean's metric and then apply a hand, for example, on it. And in this case, it's the exact same algorithm we applied to compute a QS for a real QS. Sorry. Okay, so we have also the all operation available in DEPC and the ratio and the at least operation. No, the aggregation QS is for the offer because sometimes we don't want the end or the all or the at least and the other operation that showed you. We want the average operation, for example. There is no value if a website for the premium offer is done, or the website for the premium offer should be done. It's not a really revamped. So we can compute the average for the nodes and also the mean and the max for the QS. That's for offer. It's very useful. No, I will let Nicholas to talk about airflow. Okay, so we've seen together how we can push data into a graph databases and how we can compute the QS for each of the nodes depending off if the node has data to analyze in time series database or not. And in this case, we analyze it's parent QS. Now it's time to assemble a whole of it and use a scheduler to compute the QS of our customers every night because in the web hosting team, we compute the QS for each day. And so we use her flow as the scheduler. Her flow is a task scheduler. It's like a front place place. Why turn in Python and made by Herbie NB. It's now in the Apache foundation. And it's a workflow or pipeline, if you prefer. It manages some dependencies. So you can view to the right of the screen here. We have some tasks to be done before some other task. And we can also have some tasks running in parallel. And this task is waiting the result of the other task. The best practice is to have hidden button tasks. So here is the web server of her flow. Why we decided to use her flow to build our QS is for many reasons and the future of her flow, which is if any task fail for reason or another, we can fix the problem and retry it using a key or using the web server. We can also have access to this web UI and have the logs of everything because during the night we sleep, of course. Excuse me. No, not yet. We didn't have a call, but we hope to have it. Excuse me. Yeah, of course. I don't sleep. So, okay. How her flow can know the task to launch and if sometimes he must to launch work QS or sometimes he must to launch operation QS or algorithm. In fact, it's the user which tell to Deb see how to do it. Using this simple json configuration, so in the interface of web UI or using the API, the user telling us he need to use a work QS for Apache and file an aggregation for the offer and an operation for the website using the file and Apache nodes in the web UI representation. And to finish just a few notes about how to create a DAG in her flow. A DAG is a group of tasks which will be done the one after the other. Here it's really a simple DAG. If you want to have the complete code, of course, you can go to the GitHub repository and see the slash scheduler folder. Here we just declare a DAG. The start date is the first of the year. So, if I launch her flow today, her flow will fill back every task for every day until the first January. Then I declare my DAG here, naming a computer QS. And here I think you recognize it. It's a cron format to say my DAG must be launched every night at this hour, one hour and 15 minutes. Here it's not really important. It's just an example of some task. But in the repository of GitHub, you can see the complete code. It's for a computer rule QS, operation QS and aggregation QS. And here, if it's okay, yeah. Here it's the interesting part of how to create a DAG using her flow. Over there, here, I create four tasks for my four labels. So Apache, file, website and offer. And over there, I create a Python operator. In fact, when you use her flow, you can want to launch tasks, doing some jobs. For example, you can want to launch a bash script. You can want to send a mail to happen a row in a database. And for each kind of need, her flow provides us an operator. In our case, in DevC, we just use the Python operator. You can see over there. I create my task here using the Python operator. And Python operator provide me an attribute named Python Collable. In this case, the Python Collable here is computer rule, computer rule, operation and aggregation. Okay. The cool feature of her flow is that DAG can be created dynamically using Python code. So here, it's sort of static because I created the Python Collable, the computer rule, the computer rule, etc. Let's imagine that her flow parts our code, our DevC code using her flow, parts the user configuration and create dynamically this code. And to finish here, it's how we can say to her flow the order of the task. We can see here, these two tasks will be launched together. So in parallel using some worker, for example, and once they will be done, the website task can be launched. And then after the website task, it will be the offer task. Here, we use some bitwise operators, but you can also use some methods like set up stream or set down stream. And the result in DevC is the following. You can see over there, it's the her flow interface. And here, we see our DAG using four tasks, Apache, file our website and offer. And the DAG is launched every night for every team. Here is just for the hack me team. But we have the web hosting team, we have the storage team, etc. in Hoverh. And each of the team has its own DAG. And here, it's the result of the DevC interface displaying the QS, the evolution of the QS of the file of the Apache, etc. So we are done with DevC. And if you have any question, we'll answer it with pleasure. Thank you.