 So, most of us have been living the dream, staring at graphs looking like this for the better part of the last 15 years, and one of the problems with systems metrics as presented by R.R.D. is that they're kind of, it's taking up an awful lot of screen real estate to not tell you very much. This is a scalar number, eight servers in this cluster, but there's all this logic to work out averages and transitions over time. One of the things that goes into the core of the way that R.R.D. was designed was that once upon a time, disk space was expensive, or at least a very limited resource. And so we couldn't just record all the system data ever and keep it around forever because we'd exhaust our disks. So what was built into the system was something that would compact over time, that you would set a policy and it would retain data at a high resolution for an hour and then it would average it down and compact it so that it would then maintain a five minute type average for a week and then compact it again down to month and year and so on. With the result that you had a constant size database, you could, at the moment you initialize the database, know exactly how big that file on disk would ever be and thus manage resource sailing. This was a good thing for systems once upon a time. The trouble is that it's a little on the lossy side. We had an event where apparently we had 7.8 servers in the cluster. And that's obviously a result of averaging and smoothing. And it's a side effect of this kind of storage mechanism. So the question is, can we do something about that? I would like to build a system that keeps the data because the challenge that we have is if we want to do analytics of any form on data in the present age using proper mathematical models, you need to give those models all the points. You can't pre-smooth them, you can't average them. You need to give the raw data to the people doing the math because it's an analytics decision whether or not to throw away data and it shouldn't be one in the storage system. So we need to store all the data at raw, at source resolution. We want to give that data in its original form, not derivatives. This is something that is kind of clued into the system in community a few years back. Jamie Wilkinson speaks very eloquently about this. But we don't want to calculate rates. It's very tempting to store first derivatives in your systems database. But you end up doing it wrong, basically. Whose idea of time are you using? How many points went into a system if you're talking about events and you're down-sampling that to once per second or something like that? You're throwing a huge amount of data potentially, especially when we're looking for anomalies and it's the little transitions that actually end up being really significant. So we want to store raw data points, not first derivatives. And the biggest thing of all is that I don't want to worry about disk space. On the one hand, no, it's not our d-land anymore. And I'll hail to Toby, but we can expand. Except that we now deal with problems like customers with 100 gigabyte minus QL databases in a single table. And wondering why things don't work, or 27 terabyte NFS systems. We won't talk about that. The thing is, is that I want a system where I don't have to worry about whether the file system is getting too big or how am I going to grow this minus QL database or Postgres database. I'd like something that would just scale horizontally without me needing to worry about. And it would be nice if it was fault tolerant. Maybe I'm asking for too much here. So I turned to Seth. At Anchor, we had SF cluster. We've been hearing about SF for some years, among others from SAGE, here at the conference. And just even before I joined the firm, they had already set up an experimental SF cluster. I'm like, hey, this is terrific. I'm pretty enthusiastic about SF. It's good to see that you guys are too. Let's use it. The, those of you, many of you will be familiar with SF, but some aren't. If you have worked with or just learned about SF, chances are you've worked with one of these three things. Rados GW, which is a S3 or Swift compatible front end because the HTTP access to objects. The place where SF really got traction was RBD, a block device that's backed onto SF. And this powers virtual machines in quite a number of organizations. And it's fantastic because it means that the virtual, the block images behind a VM don't have to be co-located as local storage with machines. So it means that the whole problem of local storage and worrying about how to get that local storage to another machine when you're trying to do migration is just evaporates because it's all effectively network backed storage, which gives pretty good management efficiencies. That's actually all secondary to the real point that SF was created, which was that the original goal was to create a better clustered file system. And it's called SFFS. And really all the work that has been done in the last decade since they first started was trying to put in place a really solid, reliable object store to be the backing store behind this file system. And that will come up again a bit later. What most people don't tend to see is that all of this is built on top of a library called Liberados. And this is a client code that talks across the network to the storage cluster to get things to happen. And for a while now, we've been wondering, can you actually program against Liberados directly? It was suggested by people talking on behalf of the SF project that, yes, you can use the API directly. But what did that actually involve? Was that practical? Or was that just sort of a nice idea that wouldn't really ever come to anything? Redos is this. It's a reliable, anomic distributed object store, which is one of those names that at first glance looks really ugly. Turns out to mean something, but in the end, you don't care, because an object is its own best description. So here's the picture that has been seen ad nauseam in SF talks. But again, just for anybody who doesn't know what an object store is, you've got lots of machines. A few machines are designated as monitors to enforce quorum. And otherwise, just a large number of machines, each of which have spinning disks. One of the things that's a little different about an object store in this model is that every single disk, spinning disk, has a process in front of it called an object storage daemon. And that they work independently. So the fact that two OSDs might be in the same chassis is kind of irrelevant. They don't have anything to do with each other. They're individual processes. So you don't build rate arrays on machines. You just manage individual disks, which has been unusual. After all those years, we spent building rate arrays. This isn't meant to be a talk about SF, but I just want to describe one more thing about it, which is that it has a very elegant placement algorithm which works out that you specify as a policy where things go. The point is to articulate the failure domain. So if you have data that needs to be stored, you want it to be, let's say you have a data center where you've got rows. And the rows are quite expensive to do very limited bandwidth between the rows. But between inside the rows, interact, you've got good connectivity, and you want to have a setup where, let's say the power domains are that you expect to maybe lose power to one rack, but not another. Not necessarily what your data center would be, but it's a possible failure to set up. And so you would articulate a crush rule, which says that I want to pick from any one row, so pick a row, and then find me three racks with one server per rack and one disk in each of those servers. And that way you've got, if you want three copies of the data, and that way you've got that data spread across different failure domains. And this is the kind of thing that crushes really good at. Now, that kind of tree layout isn't terribly exciting, but the really novel part, and I think the part that made Seth something that we all set up and took note of, was that this calculation is done by clients. The bottleneck that was emerging circa 2005 wasn't the ability to have object demons, and wasn't the ability to put lots and lots of things on them. It was to locate where an object lives. What machine do I talk to? Which disk do I talk to to get to a piece of data? And the original solution, that was, well, you'd have some index servers that would know where all the objects are. That became a huge bottleneck, it's somewhere between 500 terabytes and a petabyte. And everybody was just crumbling as they tried to scale past that point. Seth's solution to this is a pseudo-random number generator, and you feed the policy map into it, and it comes out with 7, 42, and 129. And the thing is, is that you, the client, can make that calculation as can all the servers. So when 42 dies, they just crank the calculation and discover that 791 is the fourth one in that set, and they now know where the third copy should be. And the thing is, it's not just the clients that know that, all the individual demons know this too. So the moment that this cluster state changes, every participant in the cluster can think about, oh, wait a minute, I've got objects, I'm the tertiary for this particular object. I now need to make sure that that one over there knows that it's now responsible for being a tertiary copy, because I'm now effectively the secondary. And the secondary was gone and the former primary, so that they work in a peer-to-peer fashion to work out, to restore the level of redundancy. And balanced over a large cluster, this has some pretty nice properties. So that's our rados cluster. So the premise is pretty basic. We gather metrics, we put them in stuff, and then later on people pull them out, they do analytics and do work, and then we get pretty graphs, right? Good theory. The real question is actually that I would turn to is the final point that I mentioned, which was how do you build a fault-tolerant system? Well, fault-tolerant distributed are pretty much synonyms at this point. How do you build a distributed system that's fault-tolerant that can handle something like this? And it turns out that there are two properties of our data that make it a little different than traditional database, because I know there are some database developers in the room and some of them are rolling their eyes already going, what are you talking about building another database? Like there's not even a shortage of time-series databases. I think we have a couple of insights to offer. One is that if you choose, you can treat your data points as immutable. Now we're talking about system, primarily the data we're gathering here is systems metrics. You gather the data point and that's it. It's never going to change. It has identity, I like about where it came from. It has a timestamp associated usually, and a value, and that's it. And there will never ever be an update on that data point. Which was a pretty big insight because almost every database structure out there is concerned with updates and dealing with updating indexes and all the side effects that happen when you have to make changes like that. We found ourselves with a data set that is immutable. That has, it correlates very nicely with another property which is that if you want to build a system that is tolerant against failure, it really helps if your operations can be idempotent. Because that gives you the ability to not have your credit card be charged three times, right? I mean, this is the core problem of the web is do not click this button twice. And it's just kind of like the only thing you need to worry about because if you get that right, pretty much everything else falls out of it. In our case, the question was, well, I would like a system where I can have multiple, you want a system where you at least one copy is written. Can you articulate a design where it doesn't matter if more than one copy is written? The way that we did that was, oh, the final thing is to not carry any state in the system. Because after all, there is no state except that which we make. And that leads me to Voltaire, which is the system we've been building. The properties that the, the way it's designed is that we have data points being collected that the, I'm not trying to enforce a universal idea of time. The time is whatever time we get sent in. It's just a timestamp. The identity is also assigned by wherever the data is coming from. Voltaire itself doesn't care. It just namespaces it and hashes it and sends it on its way. I'll show you a bit more about that in a moment. The idempotency property, this was actually, I thought, very clever. And as always, I've learned enough of now to be wary whenever I have a clever, it's probably gonna have consequences. But nevertheless, it was a good idea at the time and remains to be seen whether it still is in the long haul. But the clever idea was rather than trying to work out that I have a point written, a duplicate on right, work it out on read. When you read a whole bunch of points, you're gonna pull them into some kind of a hash map or an ordered list. If you've got an ordering property and timestamps are a pretty good one, then it doesn't matter if I read the same point five times as I'm scanning down a block of data, I can just sort that into, to duplicate by putting it into a set. Or a map in this case. And while you're at it, if it's a sorted set backing a map, well, then you get your ordering property. Because sooner or later, the system's gonna need to order the data anyway. But on the way in, I don't need to have it ordered. So, suddenly I've got two things here. One is it doesn't matter what order points show up in and it doesn't matter whether I write them twice. So, that was the original design. And I apologize this is blurry, but it's sort of like fun to look back at. Oh, it'll be simple, we'll just have points and there'll be things sitting, there'll be some kind of a bus and we'll have multiple ingestion demons absorbing stuff and then writing down to SEF which will end up in buckets and then things will come out of the other side. That took an afternoon to write and again, I'm very pleased with ourselves. And that's what it looks like in production. It's a lot of pieces. Every single one of those boxes is a different server. And what sort of brings me to the point that this isn't a small system that you can go installing a laptop tomorrow and I actually feel kind of bad about that, but it's a large system that has the properties that we're after. The point that I'm talking about in terms of idempotency and immutability. Immutability is a characteristic of the clients generating the data. The idempotency property comes in because the writer demons are labeled ingestion here. They just take stuff that comes in whenever it comes in. If there's a connection loss, if a writer demon fails, if there's a network dropout, if the broker's collapse, the clients retransmit. I mean, this is inevitably everything in distributed systems comes down to retransmit time. And you can be a really long time out. 10 minutes later, you can do a retransmit. It doesn't matter. And you get patterns like, well, what happens if a point was written down and the acknowledgement was lost, right? The two generals problem. Well, in our system, we can get around that because we don't care if it's written twice. And, or eight times, or many more than that when I had real bugs. So, because I mean, you can imagine that things go really wrong here. Like, well, I'll just sort of put a little teaser out there. Let's say that your timeout somewhere in the middle of the system is maybe 60 seconds. Well, surely points will get written down in a couple of seconds if you acknowledge back, right? And in testing, that's exactly what it looked like. I'll think about that for a minute. The other thing we did is early on, and I thought this was clever too and was utterly wrong about, was that we used protobufs, both to encode on the transport and is the serialization format down to the wire. Protobufs are a nice way to articulate data. They give you a slight scheme on top of what's other in binary encoding. They, if you've read, if you work at a place like Google where everything is encoded in protobufs, that's lovely. Anecdotally, it's used as the interface format of virtually everything there. It is sometimes used as a storage format. I was hoping that this would be a good way to make, have different clients and different consumers to be agnostic about, you know, I was working in a language that shall be not named yet, but I didn't want any of the other people to have to worry about that. And so protobufs seem to look a good way to create that abstraction. The trouble comes that a whole bunch of these fields are variable length. And it's surprising how expensive that becomes. And all the, and there's two things going on here. In the original design, if anybody has used collectee, for example, you'll know that every single data point gets sent along with host slash CPU slash CPU zero, and then a value. You get labels like that. And this can be generalized to field value pairs. Could be the data center came out of and maybe something about the customer ID, anything like that. But you end up with these key value pairs, which identify a source. And although this was all very tight, we were even worrying about, you know, how many bytes should be using for numerics versus sort of more general types. And the general encoding and protobufs for integers is variable length. And it uses like this many bits to say, whether or not it's longer and then this many bytes, bits of storage and it rolls along. So a small integer can be encoded in three bytes, whereas a larger 64-bit integer will end up taking seven. And so the sweet spot is, you know, the point is that since you tend to ship around small numbers, there's no reason to pay four bytes every single bloody time you want to, or eight bytes for a word size, to every time you want to encode the number 12. And so on average, that work, that pays off pretty well in terms of over-the-wire efficiency. The trouble is that frankly, at this point, the size over the wire isn't the problem. So I spent all sorts of time worrying about getting the variable encoding right and implementing all this to optimize for something that really isn't, it's certainly currently, and maybe this will change again in a little while, but in a data center on a land between machines, you know, over-the-wire bandwidth is not in the bottleneck. The other thing you see here is that these fields, in order to articulate what is just a map of key value pairs, you know, this pair that you constructed, you can see that, but you're of course going to have to serialize this in and out. So what we were doing was, and this is the part that I thought would allow me to get away with creating a database without having to pay the decade or two of building a full database management system, was that the whole idea was to punt everything to SEF. I said that there was no state, that all the pieces in the system, other than, I should go back to this slide, everything in that system, other than in SEF carries no state at all. There's no state in the brokers. The only state in the clients is whether or not a point has been acknowledged or not, so they maintain points that are in flight, but everything else, it gets down to SEF and that's it. And so, we encoded the location of points deterministically. There's a generation tag, and then the FM2XR7 part with a namespace, an origin, it was just a hash of what turned, you know, something like data center and an IP address of the collector or something, it was just a namespace. And then, a hash of the source. And so, what I did there was serialize out that map, so all the key value pairs, which uniquely identify this host name, host name 71 CPU slash CPU zero. And you just put that into a hash function and come out with that, and that is the deterministic address. And again, and that's sort of the final driver of this was determinism. The whole thing can be calculated. You don't, I don't need to maintain a whole index of what the address addresses were because every participant in the system could figure it out. So, and then finally, a timestamp to the nearest metric day, I made that up 100,000 seconds, you know, why not. And that just seemed as a reasonable system. I mean, trying to do, like, work out what day it is was silly, just rounding it to the nearest 100,000 was easy. And that gave me names to then pass to SEF. And the crush hash function and the placement group algorithms would then work out where it goes. And into the system it is, it goes and it's reliable, it's duplicated and mirrored. And that's it. So, that gave us a system whereby we could write arbitrary amounts of data to, to SEF and pull it out again. The writers didn't need to coordinate with the readers. And I mentioned before the business, all that business about deduplication and not having to worry about it, the magic that made that work is append. You get idempotency if you, if you're append safe. And there's an append operation in libredos. And so once we work out the name of the bucket where it goes, we just take the data point, serialize it to bytes and append. And if you append the same thing six times out of order, it doesn't matter because we sort out ordering and deduplication on read. That was a good design. It had some problems. The, we burned a few things to the ground. First thing we burned to the ground was the SEF cluster. The second thing we burned to the ground were the servers writing to it. SEF makes a few promises or not promises. It just asks, you know, Sage would, would get up in front of us and tell us that you can have an unlimited number of objects. And that is more or less the case. I mean, unlimited is a big number, but you can have the number of the knit and the namespace of objects isn't the limit. So the scheme we had where we would have, you know, all these different objects labeled buying by namespace and hash address. And then over time, piling that up wasn't a problem. The size of objects, they also tell us that that's unlimited. And well, yeah, that's true. They really probably should roll that back because there's this, the trouble is is that both RADSGW and RBD stripe their data over four megabyte chunks. So when you write an object through RADSGW, it actually breaks it up over whatever hash scheme and then writes them down in order. Same thing with anything. The block device, they're four meg blocks, and that makes perfect sense. It might as well be some size and pick one. The trouble is, is that down in the guts of SEF object demons, they eventually end up being single threaded. The concurrency of the system comes from its sheer mass. It's lazy parallelism. There are a lot of places within the system that it's nothing of the sort. And which is fair enough, the trouble is, is that when an RADSGW is responding to a request, it's kind of busy doing that and only that. And so there are cases and it's not so much serving requests as it is doing replication that there's certain operations. And to be honest, every time we thought we'd pin this one down and we kind of figured that we were half right and then found it. So getting an authoritative answer on any of this is actually a little bit tough. But our impression is, is there are operations that happen one at a time. And so if you go and dump a 200 gigabyte object into a SEF object, 200 gigabyte filing into SEF, into a single object, it'll take it. Because it'll absorb that right as long as there's this space out there. But when it comes time to replicate that object or read it back, things will kind of grind to a halt while it's serving that request and only that request. And so it turns out to be a really good, like the experiments we did, and this was up to the point where we became network-bound. But we found that the sweet spot was between about 400K and 20 megs. The bottom of the curve was indeed at four megs. So it's not a hard-coded limitation, but it certainly is in the DNA of the SEF code. That four megs is a good size for objects. So if you're writing something, use that size. And then of course the final thing is, and nobody actually promised this, but I sort of made an assumption which was pretty stupid. And that was that I could just sort of blast out asynchronous, there's asynchronous operations and you fire it out and they're in flight and eventually you get an acknowledgement back via a completion mechanism. It's a primitive low-level, sure futures thing. And that seemed all very coherent and we read through the code and it all seemed pretty sound and it's not that complicated, it's just a straight, it's actually the synchronous mechanism, uses the asynchronous mechanism under the hood. It's fine and well, no. And it turns out this makes sense because when you've got a block device, let's say, on top of SEF and you write to an object over here, a block over here. You dirty a block and it gets flushed. And so this one lights up and it's two friends have to synchronize, fine. And then standard operations and patterns, you're over here and then you write to another block and that OSD takes the right, it talks to its two peers, if your replication level three, and they get back and they get back to you. That's fine, this is what's happening in random, strategically over the cluster, no problem. The trouble, and if you've got a large read, suddenly across a logical, continuous array, then you might light up 12 of them at once. But again, across a large cluster with hundreds of OSDs, this isn't a problem. The trouble was is that we were fine with a number of objects, I've got a quarter of a million when we started out. That represents a quarter of a million data points we were taking in per check cycle. And our sizes were fine, as like I said, sometimes we over did it. But in general, most of our points weren't even triggering, getting past a couple hundred K in data. We had allowed for some larger, higher rate ones, and that would have had us around four megs. And so, the trouble was is that we found that it was taking upwards of, well, there's a large array of bars on this, but three and a half to five minutes per right cycle to get those quarter of a million points out. That translates to about 2,500 operations per second. It's ops per second, that is the limitation in SF cluster. Not object size, not number of objects, but the rate at which you can manipulate them. And it would appear the scales linearly with a number of OSDs. We don't have hard numbers on that. We were kind of busy desperately trying to sort this out. It would be an interesting experiment to do. But it's one of those things, it doesn't show up in development. It's all working great. But when we were launching this many points at it, it would take three and a half minutes to complete. Now there were two problems. I mentioned before about timeouts. Guess what happens when you've got 60 second timeouts on the clients that are waiting for that acknowledgement packet to come back. They're like, oh, you must have lost my ride. I'll send it to you again. So we got up into the tens and even maybe close to hundreds of millions of rights pending. And it drained once we figured that out and stopped resending everything. We were able to drain it in the course of the day. SEF absorbed it. We ended up with some pretty large buckets. But we were able to deal with it. The thing was this was absolutely maxing out the cluster because I was sort of waving my hands and saying couple here, couple here. We were writing to all the OSDs simultaneously. And the reality is that no matter how distributed the system is, once when you want to achieve concurrency, sooner or later, you have to linearize. It's a basic property of distributed systems. And so underneath the hood, sooner or later, one OSD is going to need to talk to another OSD to establish that it's got its copy of the object. It's trying to do a write on and get acknowledgement of and that other OSD is busy. And so you wait. And so the whole thing just kind of ground to a halt and averaged out at the pace of around about 2,500 dollars per second. And it was not a small cluster, by the way. It was on the order of seven machines with on the order of 60 or 80 disks in it at the time. So not huge, but not small. But the point was not to maximize the cluster. I mean, this was an internal system. This is overhead. I can't have this system don't wrecking an entire production object store. I need that storage for other things. We had a couple of VMs backed onto it as well with RBDA. It needs to say, I was doing very bad things to the performance of those VMs. So that had to stop. The other thing was happening. Oh dear. So that's V1, a copy and paste here. So yes, I was saying we had 2,500 operations a second. The other problem was the machine writing. We had to get a really big one. And we had a 16 core machine with 32 gigs of RAM in it and we were flattening it. I mean, I've never seen that many bars in H-top over to the side, cranked up. I've never seen 1,600 CPU usage either. And it was regularly running out of memory. Like this thing was ooming and then feeling swap and then ooming and then gone. And which of course was another reason I was losing all these writes in flight. There's a write cycle, I would lose a whole cycle because it was just, yeah. And what was going on? It was, it turns out that, so we were working in Haskell. We are working in Haskell. And we were using the standard libraries map. Anybody who's worked in, you know, one of the languages that was where our roots are, mutable language like C, we're used to tree structures being mutable. And tree-based map structures are fantastically easy to implement or clean to implement when you can manipulate and change pointers. Haskell's a language which uses immutable data. When you create, when you make a change to an object, it creates a new one. It's copy and write thing. And that has certain properties and you can create this by the way in Java. I mean, I've built persistent, it's called a persistent structure when you have a previous version and the next version of a data structure having created a, made a change, you've got both. Now that you think that that would be wasteful of memory, it's not because you can amortize space over time is you're only changing the nodes up to the root that are the structure that's changed, which means it's actually quite cheap to store both trees. And the reason that's worth doing is that when you've got a concurrent system and lots and lots of threads, it means that any given thread can always have a consistent, correct, not in progress version of that tree to work with or data structure in general, which is fantastic, which means that all the problems you have about having to lock data structures in a traditional mutable program, just go away. I have a copy of the tree, you have a copy of the tree and we can both iterate over them and no problem, which is amazing and gives you, if you build a system that, the only point you have to synchronize is the handover of the tree between threads. So it's a phenomenal thread safety advantage and it's just sort of a given in a functional programming world. The problem is that, remember those field value pairs that were a map that identified a source? Well, each value was on the order of 150 bytes, about 100 of which, 120 of which was this map. And so I was bringing in those key value pairs, building a map of that in order to, which was, by the way, serialize on disk in the protobuf, bring it up into memory into a map in order to order it, serializing it out again in alpha order, just because you had to make sure to then hash encode it, to then trim it down to become that address that was the label, all very deterministic, sure. But I was building up these maps and then throwing them away. In fact, I was taking something serialized, building it up and then serializing it again, which was just a kind of silly in hindsight. But it wasn't the effort of building those structures which are the problems, the garbage collection. We were generating tons and tons of garbage and we actually had to do quite a bit of performance tuning and hooray for Perftop. But I was worried because it seemed to be correlated with the right cycle, like what's going on? And I was really worried that the crush map calculation was what was burning me to the ground. Crush wasn't even taking 1% of the CPU as advertised. It was all the GC. So there were a couple of ways we could have worked around that. It turns out there are mutable data structures in Haskell. Haskell allows you to easily drop into a mutable state. You just need to isolate it so that you can maintain the properties of the language elsewhere. And we could have fixed it that way. We could have tuned the garbage collector, but that really wasn't the way ahead. So, yes. So it's, but the two things. So one was it like this, the Protoblast structure itself is fine, but you're gonna have to bring it into your language to do something with it. And virtually any language in the modern age is going to require you to do, to convert it out of its native format. The guys who have made something called Captain Proto, which is floating around, is actually very clever. They've created a serialization format which doesn't have the variable thing going on and is laid out in memory exactly like a structure. Now, how you actually get structure alignment to light up with a machine you have, I think that's a pretty big assumption. But nevertheless, the basic idea is good. If you're a C programmer and you get a structure off the wire, you can just drop into memory and then access as a structure. That's pretty clever because there's no manipulation required. In a higher level language, Python, Perl, on up, you're gonna need to turn it into that language's terms which means taking the original source serialized structure and then reformatting it in internal terms. And that will create all this fluff. So there's really no avoiding that. And that's what makes what was a cool idea not such a good idea. Okay. So what are we gonna do about it? The changes we made were, first of all, deterministic addresses or like predetermined addresses, excuse me, you can do a deterministic calculation to get an address, but I was just like, we're 64 is pretty big. We will give you an address. And either you can ask the system for the next one. We had an auto increment in there. So if you want to guarantee next clean address or just an unused address, we had a mechanism to go get to do a random number generate and then check and see if it was there, which across to the 64 is pretty good. Or you can calculate it yourself, maintain it yourself, do that, calculate yourself, but at least it was outside of the system that was happening. So for example, our legacy checks didn't have a way to reach back to the core system. And so they did that calculation. And we exposed an ability to make that, that deterministic calculation in the library. The change we made was to hash those resultant sources over 128 buckets. Pretty simple, really. It's like, okay, well, it turns out that just writing to Seth was fine. It was not a bottleneck. In fact, one of our sort of workarounds temporarily was just to take everything off the wire and write it to disk, write it to Seth. And that was glorious. None of the CPUs twitched. Seth didn't even notice. We were dumping everything off the wire just into a cache bucket, basically, into a temporary bucket. And it was great. The machine that was picking that up and then following it off according to the original algorithm was burning, but it sort of really illustrated that we could take stuff at wire speed and get into Seth and Seth wouldn't even notice. So the data volume wasn't the issue. So we sort of took that idea and mutated a bit and said, well, any for any object, source coming in, figure out which bucket it goes into and fire it in. The number of buckets is a factor in the equation and the other is the size of the time that book. So it was no longer fixed to 100,000 seconds because one of the problems there is what happens if you've got a high-rate source. It'll end up with a huge bucket whereas low-rate sources would be very sparse. So the system adapts. So that's why the timestamps here now in nanoseconds are rolling along. In hindsight, this was done by my colleague, Christian. We came up with a design on a Friday and I came in on Monday. He'd already done it on the weekend. So I certainly wasn't gonna complain. One thing I would change in hindsight is that when you do a rados, not that you wanna be doing rados LS very much, but if you're doing diagnostics and you do a rados LS on the pool that has these buckets, these objects in it, this doesn't order very well. Orders by bucket, by hash slot, not by timestamp. And you really wanna see them all together. So that would be one thing I'll change in the V3 generation is just put it back so that it does list out in order nicely. The other thing we changed was that there are two types of points. There are simple points which are simply just Word 64 values and extended ones which are arrays of bytes. So we still have the ability to take in arbitrary length data because one thing we'd like to store is web server logs. Having the ability to correlate systems data with what's going on a web server would be very nice. And so we did want the ability to take in arbitrary strings. And so that's what the objects look like in SEF. So still I'd deterministic look up like once you conform this object you go get it and SEF gives it back to you and that's no problem. The, again, not maintaining state, the evolution of the number of buckets in a given epoch and the size of the length of that epoch is maintained in two SEF objects, one for simple, one for extended and it looks kind of like this. And again just very simple serialization format. And then the data points themselves, the final change was is that we kind of worked this out exactly the same time that Jaron Minsky who's the head of engineering at Jane Street Capital in New York came down and gave some lectures in Sydney. And he was talking about, it's not directly relevant here but it's a fascinating insight. It's like if you want fast code, market, high-frequency trading, fast code, you can't allocate. You have to use fixed size buffers. You fill a buffer, you process it, you empty it and you refill it, no allocation. And that's the way to make fast code. That's not for me, but that's for my guy. One of the people that makes the markets work. So that was a nice reinforcement of the idea we had which was it seemed to us that the way to get around, it seems for a long time we avoided creating file formats. We've had it drummed into us for a while there that we shouldn't create our own file formats. I'm like, well, wait a minute. If you can articulate what the rules for a format are and what the offsets are, you just use it. And so we came up with something very simple. It's three words and we have one for the address, one for the timestamp, fits in very nicely and then the payload and that's it. So for every 24 bytes, that's a data point. That's it. Could we make that smaller, sure, but it does mean that we have the ability to, we're not even doing much in the way of pre-allocation tricks, but it does mean that it's fairly compact. And that comes off the wire and we just take it off the wire and we're ready to disk once we figure out where it's going and that's it. Reading is then finding the bucket that holds that source that it's been hashed to and then we just read the whole bucket in and scan down it and it's a very fast filter operation and find out the three bytes that have the three words that have, if you want to reel it in the column, just run your finger down it, pull out the ones that have your source and there you go and then do the duplication all the times. So that's what goes on for the simple data points, the extended data points, it comes in much together but then when it goes out to disk we write out a simple point that includes the offset in what is just a long appending file for where the data is. So we say, okay, the offset 65330 is, there's then there's a length prefix and then the data, your web server log point or whatever and one of the guys works for us and says, hey, can I put sound data in there? We're like, no. And hey, it's happening. And in fact, I finally designed something that works. I just pulled these numbers up an hour ago. This is the current size of our production vault. There are not even a million objects yet and we're running about one and a half terabytes. This represents about 10 months of every data metric that anchor systems has collected for system metrics and it also now includes for the last three months all the data we've collected on metering and measuring what open stack is up to. So a kilometers data is feeding in here as well. And it's rolling at under 100 operations per second, oscillates, there are some factors there that scale O in but we're running about 77 operations a second and the machines aren't twitching. That's 15 minute load average on the writer. And I'm pretty pleased about that. It means it's a viable system. One quick digression, but back to Seth because I did say this talk was about running Seth in production. There's two classes of load. We're putting all that load is going into the object store but of course we're running the nodes of the system on open stack, on RBD and that in turn is running on a different Seth cluster the one we just brought up for our production cloud. And if you've never worked with Seth this will be a bit jibberish. If you're an old Seth hand you'll recognize this but this is what Seth looks like when it's healthy. There's some monitors. There are, it tells you something about how much disk space there is available and it says that the placement groups, the modules that Seth hashes over before sending it up to its individual machines is all happy and clean. When a single OSD is down, you get something that might look like this. It says that, well I'm missing one. There's only 167 of the 168 are up and most of them are happy but the other ones are happy but they are currently being served by their peers and then it goes into other phases like it evolves from there to suddenly decides after a period of time that no it looks like that machine's not coming back so I'm gonna need to worry about cascading the data on. Seth's very clever in that if there's a reboot happening there's no reason to rebalance the whole cluster for that gift for every given object if the disk is coming back and just the machine is unavailable for a little while. So that's the difference between a machine being down and a machine being out and that's just by default is 10 minutes but you know, you just really kind of you can figure that and you can also flag it to not do that if you're doing heavy cluster maintenance. But you see things like this in a fairly ordinary case. It comes up with health warn as the indication but really that's probably something that we should change a bit because this is entirely nominal behavior. But unfortunately it's the same thing when it looks like this. And I have this on the wall at work and this is like, so show this to Tim. He's like, can I take a picture of this? Yeah, this was real, this happened. We were having some network trouble. And what was happening was InfiniBand and XFS were fighting each other for the highest order of allocations in memory manager. It looks like we were one of the first people to hit this one but since we've heard other people squealing in pain so at least it's not just us. Now we've worked out some workarounds but the problem was ultimately that InfiniBand would start dropping packets because XFS was trying to store too many extended attributes because anyway. And so packets were getting lost and another machine would assume that since it wasn't getting reports of health checks or packets lost it was like, oh that machine must be down. And so it would start telling all of its friends that that machine was down even though it wasn't. And Steph copes with that but this was happening on mass but then it started happening to the monitors too. And so the monitors would go away which was triggering a monitor election so you shouldn't really get a monitor election more than like once a month and that's just because it's bored. We were getting a monitor election every couple of minutes and needless to say the whole thing came to a screaming halt. I've never want to see recovery IO looking like that ever again. But you know what? I'll tell you something. Every single VM was wedged like you're sitting there pressing enter on your shelves and why is nothing happening. Everything was stopped which from an availability standpoint represents loss. Yeah, I mean this thing was down. We were down in terms of any external consumer but all of the VMs were safe. They were just blocked. Ceph is a consistency store non-availability one. Captheorem, CAP, most systems nowadays are built as AP stores. You have lots of copies. You write to it and you write an update and eventually one will overwrite the other if it's last writer wins or maybe if you're really smart you'll build a CRDT that's a topic for another day and if you're not really smart you'll have a shopping cart like Amazon that just things magically appear and disappear. The Dynamo, React, Cassandra, these are all availability stores and indeed S3 as a whole is present. I mean S3 is great if you write to it once it'll acknowledge that right and done. Just don't update anything in it. Well, I expect to have a consistent answer any time this week which is why you don't want to use S3 for backups. Or this to build a file system on anyway. I told you at the beginning that Ceph Rados, the object store, the entire system of all these peers talking to each other was built to be the mechanism underneath a file system, a distributed file system. And the semantics of every file system we've ever built has been the assumption that the file system code owns the blocks. This is why you don't hook one drive up to two different machines. And in order to build, this is the hard part of building a distributed file system is that you've got multiple machines all trying to hit the same blocks, right? So the Ceph, the Rados abstraction presents a consistency model. They're using Paxos to guard the master map of what's available and then the nodes themselves order, create consistency. Ceph will block if it can't complete a write. Ceph will block if it, it would just sit there and stop. The whole thing grinds with halt. Hundreds of terabytes as you saw there just suddenly unavailable. And that's the point. And you kind of really have to sort of clench your hands, hard go, yes, this is what I wanted. Because we're kind of used to languages that, yes, it compiles and runs. It doesn't work. And, you know, we monkey patched it into insensibility, but it runs, damn it. And, you know, why are we working in a statically compiled, strongly typed language? Well, you know, things don't compile when something downstream is broken and we have a build chase. Well, that's the point. I don't want to be able to build this software if it's broken. And I'd prefer the compiler telling me that than our users finding out at runtime. CP versus AP is the same thing. So when we were able to figure out the problems that I was alluding to on the previous slide and would execute one of our workarounds, suddenly a whole bunch entered, you know, shell problems go by because, no, it's happy now. It's got its read back and carries on. So the client VMs that were running the system never fell over and likewise, similar behavior in the object store. We have, you know, our system's absorbing data have occasionally forgotten to write or fallen over, but once Volterra has accepted a data point, we haven't lost a single one in 10 months of running. So that's really not down to me so much as it is to Steph being an amazing piece of engineering. And if you can tolerate the consistency property, it's something you should really consider looking at. We've been in production for a while. As I said, the second version improved the write performance problem. There is a corner case that we've hit, which of course is that you've got, you know, a bucket with, it's very fast for a single read. You know, you just scan down it and pull that data point out and gather them together and just give them back. The reason I built this monster was to do analytics on. And so the first thing that our data scientists, PhDs did was want to request all the data, which went on squared on me, needless to say. So a pretty obvious optimization is gathering a bulk request, figure out what reads are going into a single bucket and do all those at the same time. That's gonna happen now. And then we got some further work to go. The final thing is, am I about to get cut off with the knees here? Really quick. All right, the final thing is caching. I talked about immutability of data being a big insight. The other insight into the system that I want to offer is that for us to improve performance, it's obvious that we can put caching in front of the thing. The biggest insight we've had is that there are different types of data. So we don't just want arbitrary cache. There's three kinds of data we care about. First and foremost, I want those bulk reads to not mess up the system. But there are, there's a live set of data. There's basically everything that's happened in the last couple hours, eight hours or so, I need that all in memory because I can build models that are evolving over time and looking for patterns and doing alerting and calculations. And it turns out it's only gonna be like 180 gigabytes. I can keep that in RAM, no problem. The last eight hours of data is 180 gigs, even if it's really sploded out. I was with a 10X multiplier for memory layout. And so we can keep that in memory, no problem. So we're gonna create a fast path across the top that doesn't go down. So if there's any point that comes in, it'll get broadcast across to the live set cache. And so it makes those points available. The second set of data that we care about is a very, very limited set of, but the few data points that we do care about for billing, right? If a customer wants to look at their bill, they wanna look at their current complete bill, which is last month. They probably wanna compare it to the month before that, fine, and we need to be accumulating the current set so that we can tell people what their usage is in real time. And that's all I need to care about. I just need to make sure that those are up. I can't have a slow query being blocked out of the data bowl. So that's the second specific type of data set that will have cache. And then the third one is to take the remainder of the traffic is bulk reads, cache it, because sometimes people ask for the same query multiple times. So there's no reason we can't accelerate that. But just basically pass that through and keep it out of the way. No, I'm not gonna follow Linus' rule on this one, but it's a lot of pieces. If I actually, in hindsight, I probably should have ruthlessly suppressed the number of repositories that this is proliferated out to. It basically matches mostly Haskell modules, but also other languages in there. The stuff I've been talking about today is the three at the top that are bold-faced. There's a lot of people that have gone into making this what it is. I can't say it's an open source project in the sense that it's a large community yet, but down on the bottom there, Aaron Zimmerman is the first person outside of Anchor who's called in and said I'm trying to install it. Can you help me out? So you've got users outside of the company, which is terrific. And a big shout out to Katie, who's talked on Mackey Valley you might have seen yesterday, who's built a visualization system, which is a big consumer of this. And to people outside the company as well who have offered us help on design and the indexes we'll build on the next generation. So the last thing I mentioned is I wanted to build analytics on this and this is just a very quick example of one of the early outputs of the system in terms of being able to use the knowledge that we've gathered. And this is correlating what kind of failure modes happen in HTTP servers with metasystem metrics and a decision tree model. So that's what I had to say. Thanks for your time. Okay. On behalf of LCA, please take this gift, Andrew. Thanks very much. And please thank Andrew. Again. Thank you.