 I'm not sure where it happened, I haven't been able to yet. And so my name is Brian Russell, and I was told about evolving Hermitius support into that data world. So if you don't already know me, I'm one of the main developers on Hermitius. I submitted a strategy, worked with Google for a while, and contributed to many open source projects. And I'm also the founder of Rebusperception, which is trying to do themselves a support chain for Hermitius. But enough about me, what are we actually going to talk about today, right? So firstly, I'd like to talk about how we got to where we are in terms of monitoring. Then I'd actually like to talk a little about Hermitius, probably I know things about it. And then looking at how Hermitius has changed over time as the lab-made environment has evolved. So if you look at a lot of monitoring, a lot of what's going on, and a lot of attitudes to monitoring we have today, a lot of it's based on tools techniques from the 90s and the 2000s that went awesome at the time. I remember my first MRT G-DRAP. They were working after an hour or two. I was like, yes, this is my CPU usage. And my never-fuse, it's amazing. And it was. And we're also in a situation as well where we had not that many machines and not that many services. And they were lovingly cared for by participants who treated them kind of like their own children. And what we call pets these days. And special cases were the norm, right? Every service was special. Every service was a special case, being special love, caring attention. And they had tools like Nagyus in this world where machines are pets. And the services lived on one machine. We had the MySQL machine, the Apache machine, the mail machine, right? One machine, one purpose. And it's all a situation where if any one of those machines deviated somehow, humans would jump on it, right? And fix it and investigate it. And we're going to make sure it's absolutely perfect. That's loving and caring. But this also means that all of this heroism and this great attention to detail was basically also known as bridge. Because you're jumping on all these things with not much actual practical effect, that's basically feeding the systems with human blood. Not literally, usually. Ram can be very sharp. But you know, burr heads back, you want to avoid it. And as we move into the Canadian environment, we need kind of a new perspective. There we go. So that's kind of different to these days. So well, it's not only the case that we have one service on one machine, and that machine and service is going to be there, other than the machine finally dies after we've replaced every several times. Instead, we're in a world with systems like Docker and Kubernetes where services are being dynamically assigned to machines. And those can be moved around on an hourly basis due to all those scaling or new releases or anything else like that. We're also living into a situation where we're going from monuments, like a handful of monuments, to might have tens, or if you're real, hundreds of microservices. And this means that not just do we have these more dynamic services, but we've got far more of them. So the overall result of this is we've got a much more dynamic system where teams are churning and moving around a lot more and there's much, much more to monitor. So that's kind of different, kind of more difficult. So we've got this setting. So let's talk about what Prometheus is. So Prometheus is a metrics-based monitoring system. Yeah, that answer never really satisfies anyone's one of this, but it is what it is about. And there's a lot of people, you know, talking to some of the previous talks, you saw, and talking about tracing, talking about logging, and so on. There are all the useful techniques and important complementary techniques. You need all of them. But the important thing about Prometheus is it doesn't care about every single request that comes in. Right? If you have 100 requests in the last minute, Prometheus is not going to remember every single word of this. But it is going to remember statistics are kind of aggregation across time for this. So it's going to remember, hey, there were 100 requests. In total, they took two seconds. And four of them hit the cache, three of them hit that weird code path, and one of them resulted in an error. So it has that overall view of like all the different subsystems they hook, how long they took, and you can switch it. As well, Prometheus has a time series database at its core. A time series database just means a database that has time as the dimension. So this does, you know, describe quite a lot of different systems with very different characteristics. But, you know, Prometheus is one of those as well. Prometheus also has a pretty powerful data model. So if you're used to graphite, you have like the dotted string things where you have to know the position of, okay, that's the name by data center, that's the name by application, that's the name by region, and that's the ever brought. If Prometheus instead, it's key value pairs with arbitrary positions. No one cares about orders. So you can then aggregate by those by any way you watch. If you want to aggregate by developments, you can. If you want to aggregate by Europe, you can. If you want to aggregate by, actually, I don't care about any of the visions, tell me everything with the same binary because I think some weird soil with that binary. You can do all that. And another handy thing about Prometheus is the values in the integers, they're doubles. Now, this isn't that different. Hey, I've got a 64 bit double versus a 64 bit integer. What's the difference? The difference is, take about latency. If you're between the eight integers, what units are you going to use? Are you going to use seconds? Maybe seconds, microseconds, nanoseconds. And it turns out that people choose all of these depending on the personal context. And then never put the unit name in the metric name. So all I have is latency. Okay, look at that. Looks kind of like no seconds. I hope it's not seconds, because we're real similar, but it is. Hours, I am. And in case you're wondering, this is theoretical. At one point, Prometheus is using all four of those units on the same spectrum. See, we're down to two now. We've only got microseconds and seconds. So instead of Prometheus, we just say because we've got floating point numbers, it doesn't matter because we're taking care of it by floating points. We'll just have every new second so that probably kind of goes away. As well, you can do all the math you want there on those values, sigma, y, add, aggregate. You can join two different plain series together and you can use it out, you can take quantiles, you can do predictions, these mirror predictions because when it comes to like hardness, it's filling up. Static thresholds, either in percentage or in just gigabytes, don't work. However, a linear regression is better. It's not perfect, but it's better. And the important thing as well is that you can use all these in graphs. But anything you can graph, you can immerse on. There is no division between graphing and merging. Just trying to handle it. In fact, these are the Wi-Fi metrics for POSnet, which has been using Prometheus now for three years. Starts to know. It snaps off from earlier because unfortunately the Wi-Fi is not working in here last time check. But this is all actually SNAP metrics for that we have in Prometheus. We use, of course, SNAP because I want to monitor my home switch. Try that to be a little bit act-shaving, but hey, pretty gross. So another aspect of Prometheus is that reliability. And if you look at a lot of systems in there, they're like clustered, they're complicated. It's very difficult to get that stuff right. Prometheus, the core of it is a single binary, a single go binary, it's out of the limit. And each Prometheus server is independent. All it needs is local SSD. You can also do an artist. It usually works. And let's say people go ask someone, it's a single server, what happens, you know, when they kind of talk to something else for a while? Isn't it going to be a gap? And the answer is yes, there will be a gap. It happens, deal with it. Now, this might seem at the cost, but it turns out this is one of those things as well that's actually very difficult to get right. So let's say that either Prometheus or your servers got overloaded and that's the reason why we're spending it. It's great, you can go to that. And let's say everything, you know, the straws are moved from the cables back and, you know, we start getting metrics again. That's like, right, let's backfill. So suddenly, we're going to double the load, at least, getting all that old data back in and we're going to take that load in the services again. So you just caused the outage that, you know, you're trying to stop. So backfilling is not, this means we're very cautious with it because it can cause outages because you're just increasing load in the service that's potentially just about to overload. So a simpler approach is, you know, blips are going to happen anyway because of network weirdness and, you know, you just succeed the timing's wrong. Everyone happens every few hours and years and we just build things to more reliable because failure happens, failure is normal. And of course, because Prometheus is based on a single machine you're kind of limited to this space of a single machine. So for a longer term storage, you do want sort of some form cluster storage system. So it definitely involves the remote storage which can send that out and just barely read it back in. This is not really. So then we come to the more cloud environments, right? So we have dynamic environments where it's not the case that ordering a new machine takes six months and a whole stack of paper but instead that they can just appear as if by magic because there's sufficiently advanced technology. And, you know, you need to be able to detect those automatically. You can't rely on a human pushing out answerable or chef or whatnot. So Prometheus has been called Service Discovery which can talk to Kubernetes or EC2 or GCE or Colso and just get all the machines and teams to keep that updated. So as there's new application rollouts as all the scaling, as even new applications are added they can automatically pick those up and start to work on them if you can figure it out. And even better, like Prometheus is a pull-based system and that means that because we've got a list of, you know, from EC2 of here's all the instances that exist. If we, you know, don't have information which of that is rubbish and it fails, you know, oh, it's down. But the push-based system will be able to tell the difference between a system that is down and never talked to us and a system that simply doesn't exist. So it's kind of handy because you will find that if you do have, like, bottom-up systems that are telling you, oh, I exist. After all, crux goes up with instances that are dare that aren't told to anymore but you're still paying for. So you always do want some form of reconciliation that detects those, so that's kind of handy. Another thing to keep in mind and is heterogeneity, I hope we have, well, pronounced and spelled that correctly. So normally, if you have services on, like, five, 10 machines, the chances are the machines are identical, like, reasonably identical. But if you suddenly have, you know, 10s of microservices with 10s of replicas spread across, like, 100 machines, those aren't so homogeneous anymore because, sure, it's the same instance I'd be bought from Amazon or from Google or Microsoft but actually they have different CPUs with very different processing power and very different warm sky characteristics. And as well as that, you might, well, it's just sharing that personal machine with you might be losing your cashlights or something like that that's going to, you know, make some of your instances slow. And if you've got, you know, all these instances spread around the place for your service, I say about 100 replicas of your service and a few of them are slow, you can't be alert when the individual instance is being slow because that's going to happen all the time and that's not a good use of your time because there's going to be a page contingency. And that's just going to, you know, wake you up in the middle of life, it's like, oh, that again. But it doesn't matter, really, because it's not affecting the end user. As Tom was saying in his red talk there, you know, we should be looking at why it affects the end user in your SLAs. So you don't care about the individual instance being slow, we care about, is the overall user experience across all instances, is it all right? Is the overall latency okay? Is the overall error rate okay? And probably, well, that is an aggravated cross-statch. So instead of nursing them, this will be a bit slow and waking everyone up and saying, actually, the whole service is within SLAs, they're okay. The dot t instance, we can do it in the morning. You know, after you've had your t slash coffee in the morning. And this brings us to a more general point that we're in environments now which are far more complicated than any one human mind can handle. You've got far more moving parts, you've got all these network overlays, you've got the kernel is getting more more complicated, the application is more complicated, you've got all these middlewares. There's so many moving parts and they're changing so much over time that alerting in everything that might pose a problem is just not practical. You can't do it because it's simply in text. Even trying to enumerate and list everything, that can go wrong, not gonna work. So instead of looking at things that could cause things to go wrong, look at your symptoms, look at what your user sees, which is latency and error rate, or whatever your buildings are for your systems. And then, as I just told you quite nicely, you can drill down to your service based on your rates and your latency for your codes. This service here, and within it's sub-systems, this is what the problem is, and then you pull out your tracing, you pull out your profile, you can get your locks. So that's another view of some of the things communities is doing and can do to help you to deal with the cloud-made world and all those dinosaurs and all this churn. But previous wasn't always, you know, this good at it. And it started off pretty basically, like it was four years ago now. It had very little in-way services covering. Problem QL was not what it was today. And naturally, over these years, things have evolved. Community 2.0, for example, which came out there a few months ago, brought improvements in two major areas. The day one was the new time series database, which is far more efficient. And also in the stainless handling, which I worked on, which better supports instances going away. Because well, in all static environments, machines don't go away at all, you know, maybe one every few months. You're in EC2, but what I thought that happened a few times a minute. So you need to be a little more careful and the artifacts that cause are actually a problem. So, version 1.0 for me is sort of the V1 storage. So, at the start of this life, really it's like the first two years or so, everything was in the level we need. All the actual time series data, and all the metadata. So the metadata, I mean the labels, the ultimate dimensional key value pairs, and then find the series, because we need to like index them to say, hey, give me all non-exporter things for CPU. And then we need to find all those metrics and all the names. So it's a two-step phase. Look up your index, then pull your data. All of that's stored in the LVD. If radius was shut down or killed, you're going to lose 15 minutes of data, potentially. In terms of performance, it topped out at around 50,000 samples per second, which was still, shall we say, competitive within the space at that time, you know? That's quite a lot of samples. And it's just a little context. If you had 500 machines, they were scraping every 10 seconds with a thousand metrics each, that could deal with that. So that worked really nicely, even on like a reasonably large system. And so I guess, as to why you can't use my SQL for metrics, at least not this sort of volume. Because the problem you have is, if I'm pulling data from the machine or your processor or whatnot, I'm getting like a thousand metrics at this time with different names. But when I want to do a read, I don't want those thousand ones that happened to scrape two hours from. What I want is this one metric that was a part of 100 strings across time. So we've got a situation where everything I'm writing is the most recent value, whereas you're reading kind of horizontal. So what you end up having to do is what they have to data, so it's efficient to read, and all the given metrics is together on disk. So that means you basically need to buffer up your writes a lot. And basically writing a time series database for these metrics, basically turns out into having lots and lots of write buffering. And that brings us to version two of the storage engine, which is also in Formatius 1. So this was part of Formatius 0.9, which we launched just about three years ago, and this was in my bureau in Berlin. And it moved the time series data to a file for time series, which means both the data is all going to be together on disk. And then the writes were spread over six hours, rather than we've lost very 15 minutes or so. There was also double data compression, so we went down from potentially like 16 bytes per sample down to 3.3 bytes per sample. Because then, you know, in principle, you'd have eight bytes for the data, eight bytes per sample, so time step. And there's regular checkpoints in every stage. So losing 15 minutes of data, no longer a problem. And over time as well, because we had this database for a few years, and lots of other things were approved, we added some extremely basic characteristics to the indexing. Facebook released a gorilla paper with an even better algorithm for progression, so we adopt the dash and we got like 4.3 bytes per sample, which is pretty good. And there's some memory optimizations that were done. So like the last year, yeah, I was trying to do benchmarks, so to be able to tell people this is how much CPU and how much RAM you need. And try to understand it instead, I cut resources by 30%. So, you know, we win some of your lists up, but those are playing like resources for savings. As well as that one that I think you're in there, we made it easier to figure out memory usage, because if any of you have worked with databases, you know how there's like the five or 10 or 20 knobs you need to tune and try to guess how much memory usage is likely. We really just had like five at this point, I think. You know what I mean? So you just said, here's how much RAM I want to use for the heap. And it would just adjust things and evade things to make that happen, which will smirk and then make the administration far easier. That's all that's good. So the outcome of V2 was far more before. So the V1 storage engine, it was doing 50,000 samples per second. The V2 storage engine, the record of ingestion is about 800,000 samples per second. So that's a nice little boost. That's, what's that, 69,000 or 44,000? Of course. Yeah, 16,000. That wasn't perfect. Because if you want to read your stack date, it's not what you would describe as healthy. Because, well, it takes 40 to 50 minutes to check point the data. And previously in V1, it was taking, you know, 15 minutes. So it was even more data loss potentially. Now, this was all fine back when we were at the 50,000 samples per second, where your checkpoint was still taking 30 seconds. But as it gets larger and larger and larger, it up became a different problem. And churn, which is to say new time series, new targets, new machines appearing and disappearing, that affects the indexing. So the indexing was a practical limit of somewhere around 10 billion time series within an hour. Some people, here's a five million, ought to be pushed to 20, 30, 40. But, you know, you're going to run out of problems somewhere around 10 billion rock. And that's across the entire retention period. So if you've got a radius, which you've set to, you know, keep data for three months, and you're keeping on churning instances to releases and so on, you can burn to that very fast. As well, we've got a file per time series. So if you want to delete the oldest 10%, you have to rewrite the entire file. Because only XFS supports the location of files, and one we can't presume that exists. So that's a 10X write application factor. And it is known, per medius, this version of it killed SSDs on several locations. Yeah. Yeah, that's quite an, you know, kind of a fish, first trial in the Raspberry Pi, it killed the SSD. It wasn't too surprising, I was caught by the flash. But when also more production stuff was also being hit by a great application, not the best. As well, LevelDB, it's using the seagull implementation. And we did get to the point in per medius where we started discovering issues like bozzled bugs, curled bugs, and bad hardware. So we are at the issue where we're starting to see weird stuff. We're seeing corruption and crashes, which we think are inside LevelDB. But we can't really bug them because it's C code. We can't really bug them from Go. That's not perfect. So where can we go from here? Because we want something that, ideally, we'll be dealing with a culture that's more efficient than all its label indexing, so we don't have this 10 million limit and avoid threat application. It'll also be kind of nice to be able to take backups if some users want to do that. Because if you have a radius that's realistically taking 20 minutes to do a checkpoint, that means that it's 20 minutes shut down time. Then you take the snapshot, then you turn your previous back on. You know, regularly taking your previous out for 20 minutes is probably not what you want to be doing for backups. You can tolerate five minutes once a week or once a month. But 20 minutes a day, not likely to cut us. So this brings us to version three storage, which in the community is 2.0. And this is written by Fabian. And the principle is that is instead split into blocks, which are two hours long initially. They build up the blocks in memory and you write them out every two hours. Those also come actually later on, put them together. There is a number of indexes, which are hosting lists, so nice one approach. And everything is accessed by an end map. So previously, for me, this is already doing its own memory management. Now we just let the current take care of that page cache. So now in the situation, the previous just uses as much memory as it needs. So there's no over any refrigeration required. And there's also a right head log for crashes and restarts. So a restart will in the worst case take about a minute. Which is pretty sweet. So the outcome is like, we're still in the early days, yeah, we're starting to take a long switch. But millions of samples in just per second is certainly possible. And if you look at other similar systems, not just time series databases, not our databases, and see how much performance they can get from a single machine, the number is top out somewhere around a million to four million samples a second grade machine. So it does seem that we're getting pretty close to the pure at the maximum of what you can actually get out of a single machine. It's nice to know. And read performance also improved quite a lot due to the emergency nexus. So some things has gone down a little bit. And memory and CPU usage were down by the factor of three due to every micro optimization. It's been optimized to hell and back basically. And this writes because, well, we're no longer right to all these files. Basically, I officer down by a factor of a hundred. Which lots of people thank us for. Which is good. So let's just look then at what Remedius, and Remedius 2.0 now has. Because it is based on years of monitoring experience. And the new time series database is far more able than the previous systems feel these than colleagues and other departments. So, Remedius has grown as decent as it grew. Service discovery, if those want to monitor. Prompt well, it adds a blurring of the things you actually care about. Rather than the things you can alert on. So you don't have to learn from CPU usage to be high, you can actually alert on, latency is high, which is probably what you actually care about. And this means that Remedius is a pretty good choice for the incarnate of metrics monitoring. Because it has community as well of thousands of companies using it. There are hundreds of exporters. There's like 15 different libraries, at least 15 different languages with five libraries you can regularly hold a presentation for. It is a big community, and it is only growing. And also, to announce that there is a book that I am presenting to you right now, hopefully without a few months. And no news left on a stage, play, or action figures. That will evolve. So at that point, I would like to remind you to stay so people can ask questions. There are about 10 of us, I think, eight of us. Okay, question. You said that the agents wouldn't be able to apply right now. So the question is that I said we're using actions. This implies great application. So how do we avoid that? It's like, yeah, you're right. So it's not as bad as previously. So I think it would normally only write like two or three times 110 times, works out in total. And I have to check the exact numbers. More questions about you. Next question. Who would play him in the film? So Tom's asking who would play him in the film. I don't know, if you grew your hair out, you might be able to play mine. Ryan, can you talk a bit about how working for me is funded and what's next when you're on that? Okay, so the question is, and I'm following for Permeatius and Roebuck. So Permeatius is a pure community project. There is no one company behind it. So how is work funded? So for example, my company, we're going to support and consulting Roebuch for Permeatius. So you can give us money for a support contract and I will hire more and more developers to work with Permeatius. There are other companies like Coral S, now right now, who, you know, they have a product that includes Permeatius for their volunteering, because they've been able to obtain these volunteering Permeatius does good. They're putting developers on that. Even before that as well, Red Hat was putting developers on it. Soundtide saw us some developers. GitLab is also integrating Permeatius into their products in some pretty cool ways. So they're hiring developers, although they're also working with internal tools. Plus a few other people with projects being around. But the answer is largely the same as most open source projects, which is it's coming from corporate interests, whether they're using Permeatius internally, a lot to expand it, or they're providing support services and, you know, this is the way to kind of advertise. In relation to the roadmap, I'm not sure if there's talks about improving the UI and doing the rewrite. We obviously want to, like, make things more polished, expanding, like maybe add TLS to the various endpoints, because we have the actual key resources to be able to do that. Because, you know, it is a bottleneck of how repeatedly we have, especially with the growing immunity. Next question. Good time. With Permeatius being developed to capture the state and the system more in the network in the here and now, and this usually applies for services, systems, etc. What's the weirdest thing you've ever heard being monitored or tracked by a network from media? So what's the weirdest thing you've heard in track with Permeatius? Satellites. So it turns out there's this company, South Space Network, which is basically something like that, who basically have satellites in orbit, which are like two U-sized, they take photos, and they're tracking actual telemetry with Permeatius, like, of satellites. I think there's been no export or possibly any of those as well. We also know that Deutsche Bank is running the Nordic Sportor on all their platform signaling things, plus various home uses. Of course, someone's done Bitcoin. Someone's done their local petrol prices. Temperatures being done, environment in that has been done. People are doing everything with it. Yep, next question. Okay, can you tell us a little bit more about this? Can you tell us what it is that's being used in the nearest version? Is this something we've built and described in the region on some time series databases already? So the question is, talk more about the time series database when in Permeatius? Okay. So, if you look at systems and how databases are designed, there is only so many models for doing it. So, we're using similar design, for example, also, 8xDBs being tree designs, specifically similar, because it's just basically only one or two designs for these. So, you've got your two hour block. Permeatius is just getting in all the data, assembling it all, chunking it up in memory, doing all the buffering you need, and writing it out. So, when the query is needed, Permeatius has all these blocks on this. So, firstly, it goes to these blocks, then it's only in next, same right. Do you know that the Nordic Sportor is CPU usage? Like, going for each of those, and then assembling that all together. And then that data is the thing that's being written to, to take a head block, which is entirely in memory, and then that's all being now true. So, it's a pretty typical database design, like Fabio Muppet has a good talk, was a story of 16 bytes of scale from the last problem problem, which will probably give you the most detail on that. Okay, time's up. Thank you very much, Brian.