 The next talk starts, it's from the topic of graphite on scale. Welcome. Hello everybody. Today I'm not going to talk about graphite for the first time, and what we are doing with it at FITO. So first, let me present myself. I'm working with graphite with the OCRVDT at FITO. Before that, I was working at Google on the table. So yes, like the two previous speakers, I was also working with, but it's not as prosperous. So, who here is running graphite information? So, still pointing to people using it, and that might be useful for you, or I might also use it for your spectrum. Today we're not going to talk about graphite, but about big graphite, which is what you may want to use when you have a lot of data. As you might have realized, graphite is nice tool to use when you just want to get two metrics from to several applications. What you do well is, given a small set of metrics, you can roast metrics in the graph, on display, drafts in them, you can use graphite on top of it to build that both basically perfect. And then it works internally. It's important to dig into that to understand the rest later. Graphite is divided into two components. The first one is carbon. That's the application that you send metrics to. So, all of your applications or servers are periodically, like every minute, sending metrics to carbon, which will then purchase them to disk. The default database for carbon is called Whisper. And the Whisper, each metric, so metric we have here, for example, was 1, 3, 6, 0, 0, plus band of data. So, the metric is the first part. So, each metric in Whisper will be 1 in 5. Which means that periodically, your host is sending, like, 100 metrics, 1.6 U, 1.9, everything, and carbon will receive them over TCP, UDP, all the things that you want to, and you can write them to a separate stack. So, that's what it does. You also have carbon relay or carbon aggregator that are used to do other things, before carbon cache, which is the number that you write into this. You can use that to, for example, duplicate the set of metrics to put from cluster. On the aggregator, you can use it to do a sum over 1 minute or some things like this. But we won't go into these metrics. So, now you have an application that is sending metrics to carbon and how do you make the graph out of that? The component that does that is Graphite Web. Graphite Web is basically a number of applications that will retrieve these files back from Graphite System on either display some data that you could use somewhere, that you graph it out, to the graph, or directly to the entity, that will be your graph. Graphite Web has a UI that you see when you graph it, which is the metric tree on the graph area. And it also has more importantly an API. The API has two main on-points. The first one is the one that is used to find metrics, which is particularly useful when you want to to complete something, for example. The second one is the one that will get points, the points that you ask. In this example, you ask for the sum of 9.metric.com in the last 10 minutes, which will give you back the sum of all these metrics in the last 10 minutes. So that's Graphite Web. And that's it. So how do we use Graphite? We, relatively, not as big as we were on Amazon, but still we have six data centers, ten of our center servers and applications. We just break it off the metrics, and at almost one million metric now, we are at 222.metric per second, and we write 800.metric per second. One more important thing, when we try to state Graphite, we already had thousands of existing data ports on it. And that's something that was written by 30 teams, distributed on two continents. And you can't just say, okay, Graphite doesn't work anymore. You will just switch to something else. That's impossible. So the constraint for us was we need to fix the effects on the ECY, but we need to keep these data ports on the last port, on the supreme port ones. So we've seen the two components of Graphite, and that's the overall architecture. If you have a simple setup, you'll have just your application, carbon, Graphite web, and that's it. Since we're pretty big, we can't just do that. So what we've done is that in each data center, there are three carbon relays that are listening for points from each indication in this data center. And then sending these points over the one to the main Graphite cluster. And this main Graphite cluster can relay that we distribute the points across multiple carbon gashes and write that to the database which will be then be read by Graphite web. So why do we want to centralize cluster? It's because when one of your users wants to see how one service will be failing, you don't want the user to go to see if it runs UI or anything. So you want to centralize everything in a place, which makes things harder because you have bigger cluster to take care of. Also, that's a little different thing, but what do you do if one of your data centers gets failure? You can set the second data center. So what we do is that the carbon relay that we have in the data center is just duplicating the traffic to two different data centers. And the user will just create any of these data centers that works. So what's wrong with that? That's the architecture that we already had one year and a half ago, but it did not work so well for us. The software was working, but we were spending a lot of time trying to make it work for a lot of reason. As we've seen, this pair is writing one file per metric. That's what do you do when you want to have two servers? You need to move some files somewhere else to measure them back. Everything is complicated. There are two to do that, but it used to take us one week to do that to add one server. And that's not a good deal of time. The clustering at the time is better now, but at the time, it's not very nice. So let's imagine that you managed to shard all your data on tenors. Then one of them starts answering and you have a 500 for all the requests, because this one is not working. It's better now, but it's just to be that, and that's not optimal because the more nodes you have, the more likely you are to have one thing. And some queries that might look simple at the beginning will just overload all the nodes. And I think that we have, as you've seen, that we have two different data centers, and sometimes one of them is done, and the next one is back up. We wanted to reconstruct all the data so that the user would not see a cap that they would not see before. And the tooling to do that exists, but again, it will take one or two days to manage everything back, even on SSD and very nice. So that works, but we did not really like spending all our time working on that on just starting scripts. And the thing is that most of these issues have been solved by actually distributed databases. If you look at somewhere at React at age of eight and everything, they've all already solved these issues, so why should we try to do it again? Well, people have done it with data lines already. In most of these, you have application, you have coordinates, and you could just sum so the idea is that at the time, we could just write your points to the database, get them back, and just be done with it. We want most of them to just add them not on the database, we just, in the background, and we won't see anything So that looked good, and that's why we started the pgraphy project, what it is exactly. So, first, databases have solved that already, but maybe there are somebody that has done that before, and yeah, people have tried. So, hopefully this will be a more time-serving database, but we could have used, but there was no profiles compatibility. We only have one, even typically, we have a single, and we want some kind of data center residency. There was CEMite, which also used Cassandra, but at the time used Elasticsearch, and we wanted to have to maintain only one database. So it did not exactly behave like the vanilla artifact, and as we've seen before, thousands of databases, we want the exact same behavior. We don't want something to change, and CEMite always not a big difference for us. There was to be, I claimed that there was compatibility, but again, it was not exactly compatibility, and we looked at maybe 10 time-serving databases, and we found nothing that would cover all these. So we started on working on something. So the goal was to get something like that. We get carbon, again, to get Netflix, and then direct to Cassandra, and then graphite will get Netflix from Cassandra. And the idea is that once one of the components becomes over-ordered, we just want to be able to use these instances without adding to do anything wrong. So let's say that we just watch. So let's say that we need more UI because a lot of people have increased on even this cache, which is going to be a lot of secure error. We can just add one more UI or one more FX web, or any Cassandra load, any carbon cache, any Cassandra load, and the idea is that for the operator you will just add a machine and that's it. You don't deal with the weekly rebalancing instead. So that's the idea. How do you do that? How do you change graphite's database to use Cassandra? The good thing is that graphite.NET 0 came with plugins. On writing plugins, for practice, not super hard. On the carbon, on carbon, you just need to support basically two versions, or three versions, one to check if the metric exists or is in the database, another one to correct the metric if it doesn't, and the other one to add data points in the database. So basically you have a plugin class where you need to implement these three methods. And for graphite web, you need to implement two things. The first one is to be able to find a list of metrics based on the globe. So let's say that the user wants my.metric.wacar you need to return the list of metrics that match that. And the second one is given one metric on the time window to return the points for this metric. And then graphite will take off all the rest. So it's not simple, right? You just implement this thing and then you have graphite and whatever on Cassandra. So what I said is to store two things in your database, the points. The points are the first of a timestamp and value, value being the first. But by this multiplication you need to store these points at multiple resolution. You put store one point per second for one year, but that will take a lot of space. So what we do instead is that we store one point per minute for one week, then one per hour for one month and then one per day for one year. On this way people can have long-term data and it doesn't end up costing one bit a day. So that's the right path to get points, you'll write them. On the read path you get a metric, a time window on your autonomy support. It's important to optimize the storage for the read path which is not exactly as previously the same pattern as when you're out. The only thing that you need to do is to store metric names to be able to browse there. It's not only the metric names, it's also the metadata that you associate with metrics. So like the resolution, when was the last time you wrote this metric to your rice families or whatever things which are available. So yeah. So we decided to go with Cassandra because we already had Cassandra too but we could have done that with Reag or any other database. But Cassandra worked well for us. And now I will try to explain how you can do that with Cassandra. It's not as simple as you would think. So four points. The main thing that you store in a time series database is the points. It could be simple, right? You could just have one table register a metric on a set of points. So if the portion key would be the metric, then you would have the timestamp which is the clustering key and then all the values. That works. If you don't have a lot of points for the same metric and if you don't try to expire all the points because if you try that, then you will have a lot of tone stores that I won't describe that here now. But basically that would have the parking meter one year and we didn't want that. So we did something that is slightly more complicated but still not bad. What we are doing is that instead of adding one line metric, we divide the points in blocks kind of like two meters we divide in blocks of 12 which we have something that we can configure. And we do one line with a certain duration of points and we just create a new line after that and we get another new line which allows us when we want to remove all points just to remove the whole line completely or not remove sales inside this line which is more efficient. Another thing that we do is that we have one table per resolution which allows us to put some specific test settings for the points from today we want to catch them more. And for the points for the last year we might want to put them on a store of this or something or to use an electronic compression so we have one table per resolution. And in the arms it looks like this. You have the metric which is here as you're raising because you don't want to repeat the metric name all over the place you have the start of the block the offset within the block and that works pretty well. It's even more complicated than that when you want to add application and want to add multiple cluster writing to the same database but that's the idea. So that's four points on that list we know how to write, we know how to read let me go back. So when you write you just add one line at the end and when you read you select the metric the block that you want and the range of offsets that you want inside that block and that takes you to the point that you want. But as you might remember the plug-in system of our fact also asks you to return the metrics matching one block and for that you need an index somewhere. Basically what you're allowed to do with our fact is so you have components that are separated by points and you're allowed to put white cards on things. For example if you do fish.star.nemo.star you would match fish.something.nemo.foo and you need to have an index where you can do that kind of thing. Looks easy but when you have 100 million metrics that's probably 10 gigabytes or something you can just do a graph inside the fact you need to do something smarter. The first idea was to use elastic search and I've worked well on as a matter of fact we have tools in the registry to import all the metrics in elastic search but it was hard enough to migrate to one database but we didn't want to learn how to use two of them so we tried to write a test on write directly. Test on write is not the best but it has some support for them. Starting with 3.5 sassy which is a secondary attached sstable attached which allows you to index some columns on two prefixed searches in these columns which was of course what we've done is that given the path let's say we would just put this path into components so in this example we have 3.0 second one would be third one would be a time and we would march the end so the fourth one would be unmarked here you can do that or you can just talk about the metric so we split each into components in the table and then we index each of them individually which means that if somebody wants to start they just look for an column where the component 0 is equal to 3.0 and because the length must be 2 where the component 2 is the unmarked here and that's a simple path and then you just express constraints on all the indexes and Cassandra will write on the intersections of all these constraints so you can learn about that in the design doc that is here on how to set some basic work so more examples of that for a simple metric which is pretty spanked you can do that just express the constraints of every component but to distribute since the metric can also be partitioned you can create a view on that but for others we start you just exclude the one where you don't know what would be there so it's a white card you don't add something on it and that will give you results the thing is that Graphite does not only support white cards it supports all the things like races or it can be one of these characters or it could be characters from 0 to 9 or any character and since it doesn't support that so what we've done at the time is that it's time to begin the data we will just replace the completed screen by a try fix if possible so let's hope that the first component is matched it's either matched by or matching and there we will do select where part 0 starts with match only ignore the rest and do that as a post filter it's not a partition that's why it's not at the beginning later what we did is that in this case I will just expand the price only for 21, 3 where it starts with match let's do one where it's matched by another when it's matching and run both in parallel that's the second thing that we've done and it worked pretty well so as long as we had I think it was 15,000 metrics that worked well after that stopped working so well sometimes we spend multiple seconds finding metrics it's not that long but when you have 100 graphs to display on each of them it takes 5 seconds that's not really optimal so what we've done lately instead of using Sassi we're using Lucine in Cassandra instead I'm just converting this expression into a Lucine expression and letting Lucine do what he knows how to do so we're still not using elastic search we don't need to deal with all of these but we have a better interesting next in your team and that's 64 Cassandra and then so what can you do with it so we wrote that because I'm just right here and then we deployed it and now all the graphite clusters are already in the graphite it's practically we have a lot of capacity we don't have all yet it's absorbing 1 million points per second on 20, not 10 with 10 k per second which is a lot what it's not yet good for is efficiency we still have 16 bags after compression mostly because we did not try to optimize that yet as you can remember the most important for us is not to lose 1 week to add 1 node that worked because here you can see that we are using 20 k per cluster at some point we realized that we partitioned wrongly but this we just wrote a new Chef recipe the third is it re-installed all the machines one by one and then we do it which was pretty nice it's a lot of adding to spend 20 weeks to organize everything we don't have 2 clusters anymore we have 3 because let's imagine that somebody is by mistake doing drop key space we will lose everything because it's suffocated we have 2 replicated clusters on one that is isolated but because it's a full-back it's something you wear your own first 2 or 2 hours how can you use it? first of all you just re-installed the graphite create the schema import the data from Whisperer set the storage renderer to graphite and move the graphite web on the carbon when you are done that's not the most efficient to convert your Whisperer based cluster to the graphite one but what you can do about that is just write to Cassandra put 3 from both Cassandra and Whisperer there is a plugin that is bigger graphite plus Whisperer that you can use and it will just merge the 2 and you can just do that and then import into background the data and as soon as you are done remove the Whisperer 5 and you go to Cassandra so it's pretty easy to use it some links that you can use that you can recreate of what's most interesting here in the project that explains how it works and why we use Whisperer what should we do and to be informative here we use Prometheus 2 and we want our users to use ito graphite or propane 2 but to be able to find their Prometheus metrics in graphite and like that so we create a tool which is called the graphite from the editor as we just write metrics and allow Prometheus to read back from graphite metrics so it's kind of a component for users and we want to use writes specifically that bring safety for we need a bigger part buffer we don't have it currently because we add it to increase it today between the addition of points and the time where the user can read it and we need to work on that we could do it but it would take 3-4.5 but it would also take the user to write 3 and we need to optimize rates but it works better just a thing that might be interesting because we are monitoring talk how do you make sure that you are monitoring so we are already in your graphite system and we wanted to make sure that it works what are you supposed to do to make sure that it works so we did two things with two things for graphite web we carefully, every minute try to create metrics and we see if it succeeds or not and we also store the latency for the read back on the write back every minute we send points from every data center and we check that they are right correctly and if they don't we mark them as lost so we get the percentage of points that we are using we also use the timestamp as a value of this price and we can find a lonely take for one point to get into the database and that's what is in the right button which is the time it takes to reach the database so that's the estimate that we have and basically we target value 99% availability to your graphite I think one second of latency we can come to 25 things 0.5% of the points on the delay or the target delay with probably two minutes after that we get the patient any questions? so thank you very much we have three minutes left so please stay seated and be quiet so that we can hear the questions and answers first thanks very much for all that participation really cool with using graphite again which once was available and then didn't work I just wanted to know did you take a look at the stuff the guys from Booking.com like the carbon C relay and the carbons of all their stuff so we talked to them and when we talked to them it looked like what they did has a lot of differences and did not behave exactly like the linear graphite so we did not go with it but if you don't have as many as existing that part as we do that's probably something that works well and I think that works well but it doesn't behave exactly like the linear graphite so I have one question you mentioned 20 terabytes of data points which is amazing how many different thanks series that you present and do you have any way of quantifying how many thanks series or return but never give a single time so we have 20 terabytes before application it's more like 80 we have 4 times and that's 100 million thanks series and we do know which ones are being used because we mark the last time the time series was ready so we change the result sometimes we mark it because we don't want to do it every time we write but we do mark that and we have the time stamp so we are not doing the stats yet but we could do it and we could try to maybe some advice teams to stop some of these metrics because there is this