 Thanks, Francesc. So, hi. My name is Matt Baustock. I am a platform engineer working for CloudFare in London. I'm interested in distributed systems and performance and until September I was studying a masters in computer science at night school for the last two years, which was hard but super interesting. So today I want to talk to you about building and designing a distributed data store, which I did as part of my master's final project. So this won't be a talk about stuff I'm working on at work. It's really a personal project and the project is called Timbala. That's the logo. That's actually the second logo because logos are important. And it was originally called Athens DB. So if you see references to Athens DB, that's why. And so Timbala is a distributed time series database or would be if it was finished. Distributed systems are a huge topic. I won't be able to explain everything during this talk but hopefully you'll be able to take something away. So it's quite a broad talk. And I'm going to talk really about how I approach the problem. So it's more about the design of building a system like this and doing that in Go rather than going into like very deep-deck technical details. So just a disclaimer, Timbala is not production ready. So please, please don't use it for anything you care about in production. So what is distributed? Why am I talking about distributed systems? Well, a distributed system is a system where you essentially have coordination between network computers. And this is how our Wikipedia defines it. And I think the coordination part is really important and it's really hard as we'll find out. So why distributed? Why do we need a distributed system? Well, first of all, we want to survive the failure of individual servers. So if an individual server dies, we want the system to keep working. And we also want to be able to add more servers to meet demand. So if we have it's a distributed time series database. So if we have lots of users sending metrics, we want to be able to scale up the number of servers to cope with that demand. So otherwise known as horizontal scaling. So there's a great list of fallacies of distributed computing that came out of some micro systems. I think it's got some great things in here. Like the network is reliable. If you're running a distributed system that tells you that the network must be reliable for the system to work, be very distrustful of it. So this kind of thing makes building distributed systems interesting. And so the use case for a Timbala was to create a durable and long-term storage system for metrics. And so I've worked a lot with Prometheus and really what I wanted was a place to store my Prometheus metrics over a long period of time, say five to ten years. And it was important that this system could store multi-dimensional metrics. So metrics defined using labels or tags. And it also must be also capable of storing more metrics than could be accommodated by a single commodity server. So why not just use the cloud, right? Everyone's using the cloud. You know, you could just use like S3, put your data in there. Well, first of all, it wouldn't make for a very good master's project. So that was the first thing. But also sometimes you want to run stuff on premise, maybe for because you don't want to run your data. You don't want to put your data in the cloud. And you have enough data that you need a system that can handle a lot, a reasonable amount of data. But maybe it's not so much data that you want to run a system like say Hadoop or stuff, which are big complex systems and can be difficult to operate. So the other use case I had in mind was the system needed to be highly performance. So it needed to be able to ingest a lot of metrics very quickly. And I also wanted it to be really easy to operate. I think this is really important with distributed systems that you have to bear in mind how they're going to be operated in production. So being able to see what is happening in the system at any point in time was really important. So what are the requirements? So well, first of all, the system needs to be able to shard the data. So it needs to be able to store more data than could fit in a single node. And by sharding, I mean, essentially spreading data across multiple servers by splitting it up into chunks. It also, the system also needed to replicate data. So we wanted to make copies of data in case a single node dies. We would still have a copy of the data on one of the other servers. I mentioned throughput. So the system needs to be highly available for data ingestion specifically. So I focused on high availability for the right path because when you're reading data, when you're querying the system, a human can always retry. You want to avoid that if at all possible. But if worse comes to worse, they can do that. If your ingestion, if your right path is not available, then the data that you need to ingest is going to start backing up. You're going to have back pressure. And then when you try to catch up, you need to ingest your normal traffic plus the traffic that is backlogged as well. So we want to try and avoid that and really make sure that the right path is available as possible. And operational simplicity. So it needs to be simple to operate and maintain. I wanted to keep the number of configuration options to a minimum, so less things to tune, which hopefully would translate to less things to get wrong. And also add good instrumentation, so logging metrics and tracing so that you can see if a certain client hit a given server, which other servers that request had to traverse through to be able to serve a request. And the other requirement I had was interoperability with Prometheus. So I mentioned the original use case I had in mind for this was to store Prometheus data long-term. So I wanted to reuse Prometheus best features, which is the query language and its data model. And also it has APIs already defined. So I didn't want to have to redesign those APIs when I could reuse those. So that helps me to focus the project. And it allowed me to focus really on the distributed part of the system, which was the part that was most interesting for this project. So it's easy to think about distributed systems if you have a really small amount of data because, you know, you could just put it on one box and it's not really a distributed system. So I think it helps to kind of at least have a target to work with in terms of numbers. So I just looked at Cloudflare's OpenTSTB installation. This is where we keep metrics for long-term storage currently. And in mid-2017, we're storing 700,000 data points per second. So 700,000 metrics observations ingested per second and 70 million unique time series. And those are multi-dimensional metrics. So 70 million unique metrics. So this kind of gave me a goal and you're always going to achieve this for my master's project, but at least helped me to think about what the constraints of the system would be and how I'd need to design it. So how do you build a distributed system? Like where do you start? You have all these servers talking to each other. Like it seems like really complex and difficult. So I had a hard time coming up with an MVP and I didn't have a lot of time to build this because I need to do it on evenings. So I started thinking about ingestion versus querying. So the read and the write path. That was one of the first places I started. And then I thought, well, how can I also reduce the scope of what I need to do? Well, reusing third-party code wherever possible. So reusing those Prometheus libraries was one of the most beneficial decisions in the project. And so I reused the PromQL query library. So I didn't need to reimplement my query engine. And I reused the API code as well. So the system would be API compatible with Prometheus. So any existing integrations would work with it. So I came up with some milestones. The first one was to just get the system working on a single node. So no coordination, no communication between servers. Just be able to store data on a single server and then query it back out afterwards. And then the next milestone was to actually get the servers talking to each other, start sharding the data across the nodes, replicating the data so that we have enough copies to survive single node failures, and then also look at rebalancing data between servers so that we can recover from server loss. And then there were other things I wanted to do if I could take this beyond a minimum viable product. So one of those things was read repair. So that's the ability to, as you're reading data, you can see if a given server is missing some data and basically tell it, hey, you're missing data, you should have a copy of this. So that was one thing. A hinted handoff is the ability to store data on behalf of another node in the system so that if that node's down, you can basically hold onto its data until it comes back up and then send that data across to it. And then the other thing I wanted to look at was active anti-entropy, which is a fancy way of saying having a background process that runs and tries to detect missing data in data that you might not read very frequently. So with read repair, read repair only works for data that you're reading, and with metrics you're often reading just very recent data. So active anti-entropy would allow you to repair data that is maybe further back in time. But I was pretty sure I wouldn't be able to finish that for this project. So I was like, okay, this is cool, this is really exciting. I'm going to start reading research and papers. And this was really cool. I read all these things about Numer and write amplification and how to work with SSDs and M-App and hashing and all this kind of stuff. And it was super interesting. But yeah, there's so much to work with here, and I needed to start small. I needed to get something working. So let's ignore that for now. So back to the essentials. So what did I need to think about for the system to work? Well, the servers need to be able to talk to each other, need to coordinate. Peter Borgon wrote a blog post about his system for ingesting logs called OKLog. And that was really influential to me in terms of framing the problem in terms of coordination between servers. And then indexing, how do you know where your data is stored in the system? How do you store the data on disk? How do you know which nodes should be in the cluster? And when you know which nodes should be in the cluster, how do you decide where to send the data between them? And then finally, how will the system fail because it will fail at some point? So to try to understand the problem more, let's consider some of the traits or assumptions we can make about time series data. So the first one I made was that data once ingested would be immutable. So basically, the data would be append only. So there would be no need to worry about updates to the data. So we don't have to worry about, like, updating a row essentially in a relational database or having to manage multiple versions of the same data. So that helps relax some of the requirements because it makes it... we don't have to worry about managing all those versions. And so the other thing about metrics is the data types can be really simple. So time series can include events, but in this case I was just focused on numbers. Numbers can compress really well. Prometheus 2.0 and above use is a variant of the gorilla compression algorithm from Facebook and this uses double delta compression for 64-bit floats. So it takes the difference between the two numbers and then it takes the difference between that difference and it uses that to compress the data. And if you're interested, the gorilla paper explains how that works. So the other thing to bear in mind with time series data is the tension between the read and the write patterns. And this is really important when thinking about design the system. So you essentially have continuous writes across the majority of your individual time series. So in the case of Cloudflare, for example, we had 70 million unique time series, we might have maybe 40 million of those time series being written to within the last five minutes. So you have a lot of updates across a broad range of data, but then when you're reading back from the system, you're often reading for a given time range of data. So you have this tension between the write path, where you're searching many different time series and the read path where you're going across time. So that was something that was difficult to... There's one of the most interesting properties of storing time series data. And Fabian from the Prometheus project goes into this in more detail in his blog post. So I looked up prior, right? I was like, okay, what exists currently? I won't go into detail now, but the main thing I drew from these existing systems was the idea of storing data in columns, in the stores, and from Amazon's Dynamo paper, the idea of using consistent hashing to determine where to place data in the system. So I mentioned coordination, and coordination, being able to think about the system in terms of coordination was really helpful. And the thing I realized was if I wanted to support a high throughput on ingestion, I wanted to keep coordination to an absolute minimum because that would help reduce the complexity of the system and, as a result, make it more reliable. And also avoid coordination bottlenecks because those bottlenecks could be a bottleneck for the ingestion throughput. So the other thing was to know which servers were part of the cluster at any given time. So I could just do this statically. I could just tell each server what other servers exist. But that's going to be kind of painful, especially if you had a lot of nodes. And also I needed to know when a node fails, so being able to detect a node failure. So for that, I used the memberless library. This is used by Hashicope Surf and also Consul. It's a Go library. It uses the swim gossip protocol. And by gossip protocol, I mean that the servers talk to each other using UDP, and they synchronize their state with each other using UDP. And then occasionally they have a TCP mechanism for reliably synchronizing their state. And they talk to each other and basically tell each other they still exist. And swim has some really nice properties. One of them is that you can detect if a node is still alive, even if you can't access it directly. So the nodes can snitch on each other, and you can indirectly detect if a server is still alive. So that worked really well. The membership library was really easy to integrate. Peter Morgan also uses this in OKlog. And it was a lot easier than I thought to get this up and running. So that was really, really good to use. So indexing, how do you find the data quickly? You have all these nodes in the cluster. Where do you know where your data is? So I could use a centralized index, and I've worked with the RAF protocol before. I knew a bit about consensus, and I started thinking, I'm sure I need consensus for this system. And then I thought about it, and I thought, well, actually, I probably only need consensus for the centralized index. You have a consistent view of what should exist in the system at any one time. But to do that, you need to be able to coordinate between servers so that you can decide what that centralized index should be. And that's likely to become a bottleneck at high-ingestion volumes. So I wanted to try and avoid that if possible. So the other thing you could do is just use a local index and have each server know about what it stores itself. The big disadvantage to this is that if you lose data, you don't really know exactly what you've lost, because if you've lost the index along with your data, then you don't know what you've lost. But maybe we could work with this. Maybe we could do something. Dynamo, Cassandra, and React, they all use this idea of consistent hashing to determine where data should go. So I started looking at that. And so I was looking at data placement and how we place data on the different servers. So consistent hashing is essentially a way of placing items into buckets. And hashing is just a way of using maths to put items into buckets. And consistent hashing aims to keep the disruption to a minimum when the number of buckets changes. And so in our system, that translates to when the number of nodes in the cluster changes, then we want to keep the disruption to a minimum, and we want the amount of data that is displaced to another server to stay to a minimum as well. So in this example, if we have five nodes in a cluster and one node fails, we should only a fifth of the data should have to move to another node, assuming that we're not able to replace that node immediately. So I looked into consistent hashing algorithms. There's a decision record on the GitHub repository for this project that goes into more detail. And I'll show these slides afterwards. Basically, the first one I looked at was the cargo algorithm. And then I kept it writing in and looked at jump hash. I'd encourage you to look at everything to the papers here. I want to explain them in detail right now because I don't have the time. But I'll just show you the jump hash implementation, which I think is super elegant. Jump hash is an improvement over the cargo algorithm because it uses less memory and it's a lot faster. And the thing I really like about it is that that stood out to me was that it uses this magic constant, which is a 64-bit congruential random number generator. And it uses this magic constant to make this faster. So Damien Grisky has an implementation of this. And this is the whole jump hash algorithm. So the other thing I needed to figure out was while the hash function that I use in consistent hashing needs a partition key. So I needed to figure out what that partition key could be and the choice of partition key could have a big impact on how many nodes you have to query when you're querying data and also how many nodes are ingesting data at any one point in time. So again, I've gone into detail on the repository for how to choose this. But what to include in this was a critical decision in the system. And we need to store three copies of each chart, of each copy of data. So I achieved this by prepending the replica number to the partition key. So then I was like, okay, well, I know where to put the data and how to index it, but how do I store it on disk? And I started looking at different kind of structures, different libraries. I started working with RockDB and storage is really hard. I mean, this could have been a project in itself. And luckily, around the same time, Fabian and the Prometheus team were working on this and I thought, well, I'm already integrating heavily with Prometheus, so why don't I just use this library? So the interfaces for this library were really clean to use. The conclusion I got from this was that good programmers or lazy programmers constructively so, because if you can reuse something, then why build it yourself? So I solved the on disk storage problem by using an existing library. So this is the architecture that I came up with. So no centralized index to keep the ingestion throughput high. So the only shared state is node metadata. Each node in the system has the same role and any node can receive data and any node can be queried as well. So what about testing? Well, I found that integration tests were the most useful because I could quickly iterate on the system without having to worry about breaking my unit tests. And so integration tests, I had unit tests as well, but integration tests really gave me the most value. So for the unit tests, one thing I looked at is how even is the data distribution across the nodes in the cluster? So I wanted to make sure that no single node was storing more data than the others. And also are all replicas of the data stored on separate nodes. So I wrote the data. I want to make sure they don't all end up on the same node. So I wrote unit tests to do this with little histograms. And you can build histograms like this using the Go testing library really simply. You're repeating a character. And I used the standard deviation to measure what the distribution was between nodes. And I used the Go sub-tests to test this against different numbers of replicas and different numbers of nodes in the cluster. And this was really helpful because when I changed the consistent hashing algorithm, I could actually see the difference and see the improvement in balance when I switched to jump hash. So the other thing I looked at was data displacement. So if I remove a node, how much data has to move. And this helped me find a bug because I was quoting the list of nodes alphabetically because I figured that would make it more deterministic and determinism is a good thing. But in this case, it didn't actually help because it worked against the jump hash, the way that jump hash works. And in the jump hash paper it says that the main limitation is that buckets must be numbered sequentially. And I was treating them as names of servers rather than just numbers of buckets. So my conclusion was that each node in the cluster needed to remember in which order it joined the cluster. The other kind of tests I wrote were acceptance tests. And I did these by executing the binary itself. And that allowed me to do things like test if my command line flags were still working. And I also did things like test if I looked for certain canary metrics to make sure I hadn't forgot to register metrics. But also other things like can I query? Can I find the result of one plus one? And can I query data out of the system? I mentioned the integration test as being the most effective. There was some crossover between these and the acceptance tests. But they were by far, but the integration tests focused more on the integration with other services such as Prometheus. And I actually had Prometheus writing data into the system. And then I was able to query it back out again to make sure that that worked. And when I queried it back out I would use the official Prometheus client library so I could be sure that it worked with both the Prometheus server and the official client libraries. And I used Docker Compose for this for portability because it was really easy to get set up. And I had the Docker Compose integration test running in Travis CI for every pull request. And this actually worked really well. So I'd highly recommend this. I also set up a benchmarking framework so I could see how the system was working as it's running. And that allowed me to do things like my throughput and also CPU usage and memory. And again I used the benchmarking framework with Harness in using Docker Compose. I'll just go through these. I used Pprof to help determine where I could speed things up. I think that's been covered in a later talk in more detail so I'll just skip through that. And the last thing I want to say is that I didn't get the chance to do this but I really wanted to do it is in failure injection with Docker Compose you can use failure injection to test how your distributed system manages with failure. So you can have a privileged Docker container that can stop nodes in the cluster, inject packet loss, latency and that allows you to see how well the gossip protocol works and how well the system copes with failure. So conclusions. I think the greatest challenge in writing distributed systems is anticipating how they will fail and how they lose data. The implementation is already hard in itself but it's even harder to figure out how they're going to fail. And the other conclusion I took away is make sure you understand the trade-offs that your production systems are making because they are making trade-offs. Finally, use DEP. It's awesome so thank you so much to Sam and the other contributors to DEP. And if you're interested in reading more all the links are up here and I'll share the slides on the FOSDEM site afterwards. Thank you. Before we do the Q&A before we do the Q&A I want those two people that are taking care of the doors to go to the doors. So do not leave yet just let you go to that door. Marcelo, can you go to that door? Oh, perfect, okay. Because otherwise it's going to be a chaos in a minute because there's a huge amount of people outside already. Now, you're going to start preparing and in the meanwhile we have the Q&A. Thank you. Any questions? Yep. So the question was what tools that I used to benchmark the system. So I wrote a little tool that would generate load. It generated random metrics using a seed so that they would be generated deterministically and the benchmarks run in docker compose. So I'd spin up three nodes of the cluster, generate metrics, ingest those and then I'd see how they performed in Prometheus. Yes? So I'm not sure I fully understood the question but I think it was, did the system provide one interface to query all of the metrics? No, so this system is designed to stand alone so it integrates with Prometheus but you could use it without Prometheus. It implements all the Prometheus APIs essentially, the majority. Any other questions? Thank you.