 Hi, my name is Vladimir Smirnov. I work as a system administrator at booking.com. And today I want to tell you about our graphite stack, how we store millions of metrics per second in our graphite infrastructure. So, first of all, why you might want to store your metrics for some longer periods of time? First of all, one of the reasons I can think about is capacity planning. So, when the end of the quarter, end of the year comes, you might want to understand how your service grows, what needs to be done, what to buy next quarter or so on. And in the case, you might want to have some historical graphs about how the performance affected by users and so on and so on. Another big reason is troubleshooting and post mortems. So, you want to store your data to understand what's actually happened if something broken or it doesn't work really well. And another reason is sometimes to visualize the data of your business or your service or some business kind of metrics. And those, I think, the biggest reasons, and of course there are much more than this. So, and at booking.com we choose Graphite several years ago, I think in 2011 or something like that. And, well, what is Graphite? Graphite itself describes is that it can do three things really well. Kick-ass to bubblegum and make it easy to store and graph your data. So, basically Graphite is a set of components quite tightly tied together. It provides you an easy to use API so you can send the metrics using text protocol which you can just run bash, echo, blah, blah, blah and cut and it's sent. You can query those data using HTTP API basically. You can get a JSON so it's, that's one of the reasons why Graphite became popular and developers really like it because of its simplicity. And, of course, it's modular so you can replace or tune your components individually and in case you need to replace something you can easily do this. But when you try to store your performance data or monitoring data, you'll come up with some really complex setup because of the high availability reasons you might want to end up using several data centers or availability zones. You'll have a set of front-ends, a set of back-ends, a set of storage servers and some metrics producer. So, the scheme will be kind of complex and most probably you'll end up with something that looks like this. But there are several problems, at least for us, there were several problems with the schema. So, first of all, carbon relay was a single point of failure so you need to have some sort of centralized one or you need to modify your applications to do the failover by themselves. It was also really hard to scale because at the point in 2011, the fight clustering was not working really well. When you add more servers, it became slower and slower and so on and so on. And in case you have a server-wide failure or data center-wide failure, you'll end up having servers with different set of the data and it's up to you how to synchronize them. And sometimes it was kind of tricky. So, and we decided to fix all those problems one by one for ourselves. And also, all stuff we do with the graphite is actually open source and available on GitHub but links will come later. So, first of all, we decided to deal with the single point of failure. We wrote a demon called carbon serial A which acted as a load balancer. It's understand the graphite line protocol. It can do the failovering, some load balancing. Later, we implemented aggregation parts of the carbon stack also in carbon serial A. It also can do some more advanced hashing stuff and so on and so on. We also decided to put, to simplify our workflow, we decided to put carbon serial A as close as possible to the actual metric producer. So, all our servers have carbon serial A installed locally and we just tell our developers to like, if you want to send the data, just here is the file. We just send it to local host to this port number or use this unique socket. Send it and we'll take care of how to reach the storage even in case of failure. And also, then we set it up as centralized carbon serial A boxes, several per actually environment which handled actual load balancing and they also do the copy to the second data center and all the replication stuff. So, what is carbon serial A? It's a not really simple demon at this moment written in C. It's also very fast because it can route more than one million points per second using only two CPU cores and because for our cases, carbon relay was scaling not really well and it was like two, more than 10 times faster, carbon serial A was more than 10 times faster than carbon relay in our cases. It also can do aggregations and in case of some network failures, it can buffer your data for some amount of time. You can limit this amount of time to prevent or om to come and kill some of your demons. But it sometimes helps. Another problem we decided to fix was the difference between the data. So the main problem with the difference is that if you have some sort of geobased load balancers and you have a failure in one data center, your users in one region will see one set of the data. In the other region, it will be a little bit different and people don't really like it. They come and bug you like, why my graph is broken? So we created a set of demons that try to heal the data and present the user the first full set of data. The demon queries all the storages that contains this metrics in parallel and then it gives user the first set of data when no nuns are present. So users won't notice any of our problems and we will have more time to actually fix the storages. This actually required more than just writing one demon. We have one demon running on the front end which do the querying stuff. And at first we were trying to use CarbonCache's own cache, querying it, but we also saw some performance problems with this case and we decided to, at first we wrote CarbonServer, another small demon that just reads the data from disk and presents it to the CarbonZepra. Later this approach got one problem because it was a separate demon, we got no access to the cache. So we got a small delay in terms of data. So for users, for most of the users, it was not really a big problem but when you start to do monitoring, you want to see your data as fast as possible. So last December we wrote a module for Go Carbon implementation of CarbonCache in Go, open source one wrote a module that implemented the support of CarbonServer and allowed us to use the cache for the data. So this tag is written in Go. It's also very fast because when we were querying the CarbonCache's cache, we got something like ET request per second per server and now we have, in our benchmarks, we saw values more than several thousand, around two thousand something request per second. So another problem we saw was metric distribution. So we set it up several small clusters of from four to eight or maybe 10 servers each for different purposes. And at some point when we started receiving a lot of metrics, we saw a really interesting pattern that the most busiest server receive around 20% more load than the least busiest one. We started to think about how to fix this and we found that the root cause was that our metric names was not playing well with the CarbonCache algorithm and so we started to looking around to find something better than a CarbonCache. And we actually found an interesting white paper from Google about jump-consisting hashing algorithm. We basically implemented it and it allowed us to get even distribution, almost even distribution. The difference was less than 1% in this case. I won't go into deep details about how JumpCache works. You can read the white paper. It will be much easier and much faster for you. So another thing we also tried to address was monitoring related issues. So monitoring can generate a lot of render requests so you can, well, people tends to do some sort of heavy queries like I want to know about all my servers and check something like CPU usage on the busiest server. And when they have thousands of servers, it became a problem because we got several thousands front-ends running CarbonWeb, GraphiteWeb, sorry. And when we started doing monitoring, we saw that people do a lot of requests and GraphiteWeb uses really a lot of CPU. We tried to fix it, optimize it, play with it. But in the end, we started also writing our own implementation of GraphiteWeb in Go called CarbonEpi. And that allowed us to serve really a lot of requests. So one of the significant pros of switching to CarbonEpi was that our average response time on the same amount of the machines reduced from around 15 seconds, it was average, to around less than one second, basically. It also, because it was much faster than GraphiteWeb for our cases, it also allowed us to do more complex queries and we actually started in courage developers and other admins to do aggregations afterwards. So send the raw data and try to do aggregations using the GraphiteWeb's math functions. Because with the GraphiteWeb, if you have thousands of metrics, it can take some time to actually even render, even to fetch them. And also at some point, we decided to split up our code base and all the expression stuff, all the Graphite functions are also available as a library, so you can also try to implement your Graphite-compatible EPI without, so because other parts are quite tightly coupled with our zipper stack and requires a zipper, but you can use the library to implement your own storage mechanisms on top of it. Another thing we started to thinking about is how to actually replicate your data. So for example, you have eight machines, you have in one of the center, and you want to have some sort of redundancy. You're starting sending metrics and you're looking to distribution and in case of replication factor two, what will happen? That your relay will calculate the hash exactly two times, and your metric will end up on two different boxes. So, but we decided to actually find out, maybe there is a better way to distribute those metrics. So another way for those eight machines is to just create two clusters of four inside one data center and forget about the replication factor to replication factor one, and just to copies and the same. We also tried to play with just doing the replication factor one to separate clusters, but having two different hashing algorithms in each of the cluster. So we were actually wondering what will be better in terms of like data safety, probability to lose the data and so on and so on. And we created a small program just to simulate the failures, like fed it up the list of metrics we've got, simulated several failures and created some graphs. So for example, this is actually the amount of lost data in the worst case. So if two servers that contains actually some amount of the overlapping data will fail, if you use replication factor one, of course you will lose 25% of the data. But if you use replication factor two, you will lose only 3% of the data. But another graph is the probability to lose data in case of server failure. And with replication factor two, if any two servers fails, you will lose something. So you will lose this 3% of data. And with replication factor one, the probability of losing two same servers is well around 14%. So in the time we got some also problems with hardware, you always have some problems with the hardware, if you have some large scale set up. And that's why we actually switched to having replication factor one and two clusters per data center. So what we currently have is a quite complex and big set up, we have 32 front ends, we have several hundreds of requests per second on the front ends. And because they are kind of complex, usually one request contains around from 10 to 1,000 metrics. That's why we have around 14 individual requests per second at peak. We have more than 100 terabytes of data. And because of this, we have several hundreds of storage servers because, well graphite really likes SSD and we, well you can do anything with this. So, and SSDs are kind of small and that's why we have this many servers. And in our set up, at this moment, we have more than 2.5 million unique metrics per second and because of replication factor four in total, so two copies per data center, two data centers, there are 10 million metrics per second that's actually stored. And actually this value is also growing really fast because half a year it was less than two millions. So, and we also have a lot of plans. We actually currently we're working on getting some sort of metadata based search for the graphite in graphite compatible way, so having tags. Well, you want tags in 2017. We also want to find some replacement for Whisper because, well, Whisper is a good balance in terms of read and write performance, but in terms of amount of, so it always use around 12 bytes per point to store your data. And well, in 2017 we have a lot of white papers about some compression algorithms, a lot of other systems that implement them and we want to play with something that might allow us to produce this to at least eight bytes or maybe even less. We're also thinking, so one of the problems we still haven't solved is aggregators. We can't use aggregators to do monitoring stuff because they are not backed up, they are not redundant and that's kind of a big problem for us at this moment. And we're also thinking that maybe later in future we might want to replace the graphite line protocol between the components with something more efficient because, well, it makes no sense to convert the data from text to binary then back to text again and then back to binary to store there. Also, as we are working on the urban search at this moment, we already have some implementation which have some small limitations at this moment. So the syntax is kind of, might look weird, it's also very coupled with our zipper stack. But at least we can do some simple queries like give me metrics received for all the graphite storages where the status life or this particular data center. But, well, there are a lot of limitations we can support and syntax or XOR or something like this. We also do not store the history of changes at this moment, but we are thinking about how to implement this. And the stream of the text for us is a separate thing. So we run a demon where you can feed it with the text using either Kafka or HTP API or something else and it will store them in memory and, well, at this moment, this is how it works but we are working on improving this. We also have some interesting tests in terms of backends. I got a really big list of potential replacement for Whisper. I was trying at some point to just count all the time series or databases that can be used as a time series and I think there are more than 30 of them at this moment in open source. And I decided to get a top five of them, try to experiment with it. And at this moment, I'm experimenting with a click house and I get something like 2.4 million data points per second on a single node machine in terms of storage, which is really nice. But, well, there are a lot of different things to solve. I also tried to play with InfluxDB but as we have a graphite compatible stack, we need to have a really fast graphite compatible receiver in terms of on storage and InfluxDB's receiver was not playing well so I postponed testing for quite some time until I have more time to do this. And because, well, for click house, I just found the open source project that already implemented something and it was easy to fix. Yeah, and all our stack was, so is open source, we use only, so we're trying to develop all our graphite related stuff initially as open source projects. They are stored on the GitHub accounts of the developers who started developing it and then we're contributing. There are some projects that are like open source developed by other companies, some projects like ZiprStack is developed by us. So you can actually go grab, play with it. Sorry, there is no Docker files for it at this moment, no Docker images, but I hope Go code is kind of easy to install. Yeah, so any questions? Thank you for your presentation. Have you ever evaluated Cassandra as a back end for graphite? Yeah, we tried to use several, so I tried to experiment with several Cassandra based solutions, like I tried to play with I think Kyra's DB and with, well I don't remember the name, also was something that was easy to set up. I think it was cyanide, I think. But one of the main Cassandra problems of at least some Cassandra based solutions is that, well, write performance might be good, but in terms of read performance, Cassandra sometimes does not behave really well. So and because we have like several, so we have 40,000 render requests per second in total, it's kind of important for us to also try to answer as fast as possible. We tried to measure for example, the delay of, so how fast we can get the data back and we got, at this point we got something like several thousand milliseconds delay in terms of the whole pipeline itself and we're trying to keep it as low as possible for monitoring reasons. Hi, thank you for the presentation, very interesting. I have two questions, quick one. First of all, how many people are managing this environment? Okay, how are you monitoring it? And how many server are you using for this system? Yeah, sorry, this microphone on me doesn't work. So yeah, just repeat, we have right now the team of two people and there is a developer who initially developed parts of the ZiprStack who sometimes help us. And there are some Hackathon projects in booking.com. For example, Carbon Search is also part of Hackathon project, one of the teams decided that they really want to have tags and search over the tags and like in today's they implemented this and we're starting using it in production now. So, but like managing day over day is to people. So in terms of servers, we have 200 backends, 32 frontends and something like 16 relays but we are trying to keep, so the logic behind the amount of services, we need four times. So we need to have some spare capacity, we need to have some, like in terms of relays, we have three or four relays per environment depends on how big environment is. And well, usually one relay can handle all the load of basically everything. It can handle all the load of our Graphite stack but we want to have a redundancy so we are trying to keep the number at least free to at least free because well, for us, Graphite is mostly at this moment disk space, yeah, title. So the problem is the disk space, most of all. And what was the monitoring? Well, we have monitoring Graphite, we are monitoring Graphite with the Graphite basically. This is sometimes wrong but this is how, what we have at this moment. We are trying to separate, so for monitoring our stack, we have a small cluster for only Graphite related metrics and we are trying to do alerting based on this so we have a separate infrastructure and yeah, so this basically helps. Two questions, the first is, could you please display the slide just before this one? For a moment, yeah, thanks. And then I'll ask the other question. There's, you talked about replacing Whisper, there's been such a cacophony of proposed time series databases recently, including influx and a bunch of others. Can you comment on why none of these, none of these are what you want or sort of what your criteria are that makes you reject all of these proposals that are on the open source market of late? So if I'm heard correct, you are asking so about what are our criterias for something that we'll be able to replace Whisper, right? Yeah, in a simplified way. So basically, we want something that will have a good balance in terms of read and write performance. Right now, the whole carbon is capable of writing something like 400K data points per second on a single server node, or, well, even more, but that's, as far as we go in our testing, we decided that's usually more than enough for us. Otherwise, it would become again a problem of a disk space for us. So our main problem is that Whisper is space-efficient. 12 bytes per point is too much. And disks are kind of expensive and servers are also not very cheap. So we want to have something that can have good read performance, good write performance. And yeah, also because we use graphite really extensively, we can't migrate in a single day from graphite to something else. So we need to maintain a graphite compatibility for kind of some time. And we need to also have a good graphite to this database compatibility layer. And as we, well, we are team of only two people at this moment. We don't have really much development resources and so on. So I'm first prioritizing those databases we should have already in open source, something that works more or less well for us. So because, well, after that, I can try to like implement a better layer for, for example, influx DB as a separate demon or something like this, but it will take some time for me. At the beginning of your talk, you said that collecting all the metrics is good also for capacity planning. Well, in capacity planning, you don't look at for a week or a month, but you probably look for two years. Is that possible? And how? Because you would read possibly all data. So the question was that, how it's possible to do the capacity planning and if you store data for longer periods of time, right? Yeah, well, we actually really extensively use the retention periods. So we're trying to force the developers and people to define some good retention schema to just decrease the precision of the data, do some aggregations and so on and so on. So there are several approaches to this. Some people like are okay with the default graphite stuff. Some people prefer to do this by themselves because they know their data and they know how to actually reduce it. And this actually helps a lot for this. And also, yeah, we use the Grafana a lot actually to display the data and Grafana can also request from the graphite some specific amount of data points. Also, Go Code helped us to, like migrating to Go helped us to display those stuff a little bit faster than usual. And well, for capacity planning, you don't need to actually, you can wait for a minute or two minutes even for like displaying your graph because you usually do not need it like every day, every five minutes, every hour. So, yeah. Hi, so if I want to collaborate with you on this project, do you have any detailed upstream reference to how to set this up with all of these different components? Well, about how to set this setup up. There is no, we have some plans maybe to implement a Docker image or something like this with reference setup. But again, this takes some time and usually we have something more high priority than this. For most of the demons, there is a read me file with the example command line and options. So, there is no guide yet. We are thinking about implementing the proper guide about how to set this up with some examples and so on. But, well, there is only read me files at this moment. Hello. So, if I understood well, the carbon C relay is the only component written in C. Do you have any plan for further development there? Going to Go, something like that? It's easier to hear, sorry. So, yes, carbon C relay is the only component written in C here. Well, it was actually the first, also the first component, I think the first component that was written. We actually have some plans in like trying to rewrite it in Go and maybe see if it will work at least in some cases. For example, replace the carbon C relay on the machines because Go code is easier to deploy, easier to debug. It's much easier to ensure that it's actually stable and run. So, yeah, we have some plans into. How are you sending metrics to the relay in particular? How are you encrypting the metrics on the wire and things like that? So, at this moment, all our components, so in terms of like sending metrics, they are all talking using the graphite line protocol. We use our own data centers and wire at this moment. So, there is no encryption at this moment. And, yeah, graphite line protocol mostly. But we are, again, we are thinking that it might be a good idea to replace it or at least add optional encryption. So, you said you had 200 machines with the CDs. How do you manage resharding? So, you must have been adding much more. So, how do you, like, it's not as, like, at 10 machines, you need to reshard it, essentially. How do you deal with it? And it's like a corollary to this. Did you think about, like, actually moving from SSDs to HDDs, like, you know, for old data, you know, you won't need this fast. Like, a year ago, you don't care. Thank you. Yeah, so how would you resharding? At this moment, we are, so, for a longer period of time, we got a bunch of scripts written in bash, like, really hackish way that basically are thinking the metrics to a new machine and do with profile to actually fill the gaps. That was the old way. Now, Alexi, which is also the second person who is managing the graphite, migrated this thing to bucket tools because, actually, bucket tools have an implementation of consistent hashing algorithm, the same one we are using in carbon-serial A. So that actually helps a little bit to speed up all the things and do the distribution using some better way. And about storing the old data on HDDs or something like this, yes, we are thinking about that that might be one of the next steps in the future, but first of all, we want to actually replace the whisper with something more efficient before doing this. Because, well, with graphite, it's a little bit tricky. You still need to store everything on the disk and for longer term storages, you might want to reduce the data and there is no easy way to do this on relays, for example, forward the already retention data to HDD-backed machines. How do you manage the renaming or deletion of all the metrics that are no longer used or just because the developers changed the naming convention or things like that? Well, it's also kind of a hackish way to do this. Again, it's a set of batch scripts that basically run the carbon-serial A in a test mode, feed it with the data, see in which server this metric lands and then actually goes over the SSH doing a run-minus-rf metric name. That's the, for renaming, we are trying to not to do this and like asking the developers to take care of this. You have a special cluster of two machines to like as a playground where you can play with your schemers, do the stuff like that and so on and so on. In exceptional cases, we have some mechanisms to rename it, but we're trying not to use it in this moment. Talking about metrics and naming, how did you solve the fact that you have 130 terabyte of data, as like data, how long it's gonna be? How long is the time windows for 130 is one month, one year, more or less, and also how you are a developer not to write strange naming. So did you create a nomenclature, like a documentation that the developer has to follow, how often they have to send metrics or how you're limiting them? They, for example, not filling the entire disk with a stupid metrics introduction before you notice it. So, well, we have several retention schemers, depends on the like data developers want to send. We have some metrics that are per second ones. So usually for per second, it's something like one second for like day, 60 seconds for two weeks or something like that. Then I think one hour for several months, like three months, and one day for like five years, or like basically for forever. We're trying to do this forever, but with this we need to define the, even the, yeah. So, and about how we, so we are asking the developers trying not to do this. Sometimes they do, we have some alerting on like, if the disk space is decreasing too fast, we are receiving an alert, we are investigating what actually goes wrong, who is trying to fill this, because well, sometimes this happens, for example, one of the common cases is that developers are adding some MD5 of something to the metric name, which is always different. Or for example, in Perl code, they usually use the, sometimes at some point they do, so they were sending instead of the name of the variable, like the address of it in memory. So for those kind of things, we have rules that just filter them out, replacing the name with your code is broken, please fix it in caps or something like this. Yeah, but we have no trolling, we do not asking our developers to like declare it in forward or something like this. That's it?