 excellent to, you know, be able to grab lunch and just talk rust the whole time without being the weird languages guy. So, yeah, Xavier Lang. I'm a software contractor based in San Diego and I've been working with a company called Viasat on some large scale logging infrastructure. And I dig the outdoors and I'm a proud cat owner. So, you know, like you have a lot of languages to choose from and so spend a lot of time reading all over the internet about all the different things you can use. And so, you know, this is kind of my grab bag of what I like to talk about. Not necessarily what I get to write, you know, my projects in all the time. And I'm just very lucky that I got to use the weird category. So, you know, some of my peers will come up to me and they'll just be, you know, what's Xavier up to now? But yeah, like I said, proud cat owner. This is my cat, Lua. So, I tried implementing Lua a long time ago and at the same time I got a cat. The cat hung around for a lot longer. And, you know, Lua's kind of an inspiration that you should get out there and try new things. Lua the cat. Yeah, and, you know, even if you might get hit by a car. So, yeah, see, that's the best reaction. And so, she's really tough, you know? And, you know, even if you get hit by a car, everything's going to be okay because you can just, you know, go home, chill out, watch some stand-up comedy. Yeah, but, you know, I got out there and I tried Graphite Rust. So, you know, Graphite is a suite of applications. We're hitting Python, you know, quote unquote, slow. I don't know what you would expect, but, you know, Viacet, we really liked Graphite. We were having a lot of luck with it. It's got a bunch of functionality. It's got some weird performance configuration, like by default, that, you know, as we scaled up, we kind of hit, you know, this and that bottleneck. But, you know, we still wanted to keep it. I do a lot of system automation stuff and it's kind of tricky to, you know, it's one of the more complicated components to install and manage and I don't like that because it's really secondary to the core purpose of what we've been doing with the logging infrastructure. And there's just a little bit of, you know, for me, not invented here. I thought, you know, hey, I can do better than that. Like, give me some time to try writing an implementation of Graphite Rust, whatever, you know, that means, you know, and like I said, this is all built to support sort of a cloud installation and some other services and, you know, like, what is time series? You know, Graphite is a time series database. What does that mean? And I just wrote some pseudo code here to kind of show you that, you know, think of each data point as a tuple, a timestamp and a floating point. And then you just have a bunch of them right next to each other, so like a vector. And then, you know, once you've built that archive, your series is just a static slice of that, right? So in Graphite, once you've allocated all the room for your metrics, you're not gonna shrink or grow that sort of array of values. And, you know, Graphite's gonna put that on disk and take care of it for you. You know, and so what are we gonna cover today? We're gonna run through some Graphite concepts, just like Graphite, the sort of architecture. We're gonna go through some of the shortcomings. And then I'm gonna talk to you about my implementation, which was meant to address those shortcomings. And then we'll talk about the new shortcomings that I generated. So let's talk metrics. We're not talking analytics. We're not talking business intelligence. We're not talking high level stuff. And we're not talking stats, the aggregation. There's, you know, a lot of work in all these fields, but that's not what we were using. We were using long lived metrics. So you have, you know, this cluster, while it's in the cloud infrastructure, you've got valuable data being stored and these individual servers hang around for a long time. So we're talking, you know, server A, his CPU for two years. Like that's really valuable to be able to kind of be able to look at the samples of this CPU over time and figure out what's going on. Like I said, you know, all these metrics are then stored in pre-allocated space on disk and the names for the metrics really simple, one dimensional ad hoc, you make it up, you give it a name and then you start filling in values for it. And so the names really primitive. So if you wanted to, you can measure stuff like these different, you know, like how many active members do you have in Rust? How many lifetime questions are people asking in Rust? You got go questions. And then what's really intuitive about graphite, the querying mechanism is that you can just put little wildcards in there like that little star and then that'll give you both Rust and Go together. So even if you add new metrics later, your queries will just pick them up and you know, you're plotting all these metrics so it's really helpful to be able to just say, give me all these different series together. And then the values in graphite, like I said, really simple, time stamps, floating point values. And one of the things about graphite is that the metrics are stored in per second. So you're not gonna be able to store a millisecond stuff, don't even try going any lower and you're gonna have a pretty wide range of values for the floating point and that's that. So what can you do with that simple restriction? You know, I'd say you could do a lot. You can graph how your different servers are using their disk. So here you can see a spike of disk usage. Now here you can see the load in the cluster and this is a good example of we added more servers and then the load went down. So on the left half you've got higher load, the right half you've got lower load, really straightforward. You can instrument your different services so this is tracking the heap usage in the JVM and you just have to get the metrics into graphite and then you can start looking at stuff like this. Right, so graphite is actually three components. It's the web part, the UDP TCP listener and then it's the actual file format. Like I said, it's difficult to install, you do pip install or virtual end and then graphite web, for me, one of the less interesting components. I've written a lot of web servers before. I think Python's perfectly good at doing that and it only has read-only access to the database. You don't actually insert anything through the web interface and it's got a ton of features which that's not what I'm going for, I don't want to just rewrite all these existing features. So I kind of took a stab at it but really what's more interesting is the demon for recording data points and so carbon, that's the one you send the metrics to and if you're sending it a name that it's seen before, it'll open up that file and put the values in there and if it's a new metric, it's going to create the file right there on the fly. And then whisper, this is where it gets interesting. This is bytes on disk, this is the archive and down sampling system and this is a database but it is not, it's not your Postgres back store, way easier. None of those fancy features gets you going pretty quick. So, you know, Rust is fast, maybe if I just write whisper in Rust, it'll be fast. It's true. So, you know, let's dig into whisper because that's what I want to talk about is cool, bite, slicing, mutable, reuse, all that good stuff. So here I just kind of sketch it out. You got your cloud and it's writing into an archive and it's putting the data point in there and then what happens is that you set up this whisper file to down sample the data. You go from 60 second resolution, so you're sending it data every 60 seconds to hourly resolution and then like weekly resolution. That's your schema and so I just made that up but you know, go minutes, hours, whatever. And what ends up happening is that whisper down samples the data on every write. So you end up with this sort of read explosion. You've got the first sample coming in, the unique value. You figure out where to put that in the first archive. You write it there. Then you have to read all of its neighbors, a big chunk, synthesize it down and then write it into the next archive. And then that in turn, gotta read that chunk, summarize all that data, you know, like average. Take it and then move it on over again. And like this is what was killing us. I didn't really realize how bad it was but I had set up a really bad schema. So you know, I could have spent more time in Python before diving into Rust, I'll admit. But you know, I think I can make this pretty fast. Let's see what I can do. And then here's how the actual archives are laid out on disk. You got your header. You've got some archive info. These little things there. And then you've got like a large swath of space for the first archive, a little bit less for the second archive and even a little bit less still for the third one. So you're just down sampling like that. So you know, what does a whisper file look like in Rust? This is the sort of right heavy perspective. You can open it and you might get an IO error. I might freak out. You can create a new one and you can provide it a schema and it'll lay out all the bytes for you. Or you can provide a mutable reference and start writing into the whisper file and then this will take care of writing through all the archives and doing that sort of cascading functionality I was talking about. And then, you know, that's the interface but what does an actual whisper file look like? Really simple. It's a file handle and some header information so that I can do all the right sort of offsets. So, you know, this is not at all how simple it looked in Python. I didn't have anything this concrete that I could just point at and be like, this is what it looks like. This is what a whisper file is. I didn't get that much. So I'm pretty happy about that. You know, and then I kind of talked about the archive infos that track the metadata so that I can plan my writes. And this one's pretty cool. You actually get most of these bytes out of the file and here I'm using Rust to give me a bunch of derivation so that I can easily create these archive infos in unit tests but still, not much code. And then here's where the Rust in Python really diverged was using stuff like tuple structs. That's really handy. They're a great way of giving sort of a wrapper around the contained values. So, the code in Python was really mixing up how you could index into a file and how you could get the buckets and how the buckets would wrap around in the archive. It was very confusing so I added a little bit less confusion by giving those individual concepts names. You've got the archive index which is kind of each point in the file and then the bucket name which is actually the normalized timestamp. So in Whisper, you may have a 60 second resolution but it'll actually accept any timestamp between one and 60 and then it'll just normalize it into one. And just getting that one piece of information that that's how data was normalized in Whisper, like I spent so much time reading the Python code to figure it out so I might be a little bit slow but if I had seen these types, I would have been very happy. So I kind of mentioned that there's some wraparound going on so you just, you start writing into the archive on the left side and you go all the way to the end and then you go back to the left side and keep on writing. And this is a good case study for how are you gonna do that wraparound read if you ask for like 10 samples and you are five indexes away from the end of the archive, what you need to do are read five and then read the first five and then put them into a contiguous buffer. You know, that sounds like some slice reuse going on, right? So here's the function I did. So the thing with reading it, reading the wraparound read is that it's usually in the context of that read amplification down sampling so you need the high-res archive and the low-res archive and then in this API, I provide the buffer for storing the whisper data points and then you need a little bit more information. You need a point time stamp and then you need like this anchor into the high-res archive. But if you give me all of that information, then what I can do is I can tell you what archive index you're gonna need and I can give you a slice of that buffer you gave me and I'll tell you to fill that buffer up with bytes. So this is sort of a contrived example but what's also important to notice is the option on the second return parameter. So this indicates whether it's a wraparound read or not. So if it is a wraparound, you're gonna get two of these pairs of values here and then if it's just a contiguous read, that second parameter is gonna be a none and you don't have to use it. So this was kind of me like playing around with the mutable borrows in Rust and another thing to notice is the lifetime annotation. So what I'm saying is that if you give me that buffer, I tag it with a tick A and then I also tag the returned buffers with tick A. So just by looking at this type signature, you can say, hey, I think those two are related and that's really helpful. And then what does it look like inside of the splitter? It uses this really powerful tool, split at mute, which just, I think it's been mentioned before but it's really cool. I'm not gonna spend too much time going into it but that's kind of what powers the read functionality. So yeah, is it faster? Yeah, well I'm not surprised by that but is it more readable? I think so. In Python, it spends more time in user land and then in Rust it spends more time in syscalls but it's this sort of API and this sort of communication that's a lot more interesting to me. So I still have the same naive behavior as Python where you open up and close with profiles all the time and you have to reparse the headers constantly. And it's got some poor buffer reuse but it's still faster and I think that's cool. Another thing, another goal was to make it easier to distribute and you get your statically linked binaries, they're a nice size. I'm not really doing anything fancy, just cargo, build, release. And then where do I stand in features? So the carbon functionality, I'm building out the whisper cache so that I can keep these files open for longer. The big problem there is just figuring out a good eviction strategy and then making sure that you can F-sync the files periodically so that the web server can read fresh data. And then what does that carbon cache look like under the hood? It's got the root path to the file system where you're gonna find all these whisper files and then it's just got a hash map for the open files. And then like I said before, the graphite part, it's not really built out. I'm having issues with some of the HTTP libraries and like I said, it wasn't the most interesting part so I would have much rather spent my time on them mapping data. So I've got some beta progress on that. That's actually been really easy. I've never kind of jumped into a programming environment where I felt like I could just import a crate and start men mapping files. I think that's really cool. I need to get better at detrace. One of the good things is that like when I do a debug build I can trace it and see what it's doing. I'm just not very good at writing that stuff. I feel like that's the next step in my journey as a fast software person. And I need to get better at carrying buffers around because I do have all this power to borrow slices out of buffers. And I don't see why I should be calling Vecnu all the time. So was Russ a good choice? I don't think it's a good idea to advocate rewrites. This was very much an experiment. We got it into production and it writes the data really fast. But I think part of it is we just got smarter about graphite and that made a big difference too. So it wasn't a pure software fix. But Russ has got the great tooling. It's got the ability to run your tests. The benchmarking has been really handy. I love to clone down my repo onto new cloud servers and then run the benchmark suite which does all the whisper writes and just kind of get a good idea of what that server is capable of. I mean I wouldn't advocate that as a benchmark for like whether a server is good at it or not. It's just cool functionality. Like I said, MemMap's cool. Being able to take a file and put it as bytes in memory. That's the kind of stuff I signed up for. And having the compiler yell at me is good. And you know, I'm happy. And anyway, when we have as many metrics as we do, actually rendering them in the browser is the issue. Like we just have a ton of metrics. So you can start looking at different ways of doing it. Server-side rendering for Pings is one feature that Graphite, the Python version already does. So it's always interesting. Yeah, so thanks.