 So my name is Joel Barajak, I'm a senior system engineer at OpenSystems, and I work in the observability team. And what does OpenSystems do? So we are actually, let's say, a managed connectivity company, so we offer managed network solutions. And so it's quite interesting, I think, to be here at KubeCon, which is a Kubernetes conference, and we actually ship mostly, we work with these kind of boxes, these kind of devices. And we don't run Kubernetes on these devices. So maybe it's interesting to think, you know, why are we here in the first place? We run quite a lot of these hosts. Currently we have just over 10,000, situated all over the world, so they plug into all our customers' infrastructure. But today I'm not going to talk about how we monitor these hosts, it's sort of another topic entirely. If it sounds interesting, and it is interesting, there's a link here to a talk I gave at KubeCon EU, back in April, where I go into more sort of details about this. But today we're going to talk about Thanos. At least I hope that's what everyone is expecting. Otherwise I wrote my abstract very wrongly. In particular, what I would like to talk about is scalability, resilience and performance of the right path. So for us, all of our metrics are sort of customer-facing. So if there are problems with the metrics, it comes back to us very, very quickly. You know, customers complain. So we're really, really focused on making sure that we get every metric which is shipped to us from all of those devices. And at the end, if there's time, there's sort of one weird trick that reduced storage costs for us by quite a significant number. It was kind of an interesting journey, let's say, debugging journey as to what was going wrong. So hopefully we can get around to that as well. So hopefully most of you are aware of what Thanos is. If not, you will know what Prometheus is. Otherwise I'll do my best to accommodate for everybody. So Thanos is a framework which is built around Prometheus. Prometheus, of course, is the sort of de facto metrics back end for most Kubernetes clusters. I think if you spin up a Kubernetes cluster today, one of the first things you do is you go and install KubeProm stack to just get basic monitoring in place. So Thanos is sort of wrapping around Prometheus and extending its capabilities. So it offers a global query view. So this is really nice if you have multiple, you know, maybe tens or hundreds of Kubernetes clusters, which you want to look at metrics in one pane of glass. Thanos can kind of plug into all of those different Prometheus. It offers unlimited retention. So one of the, let's say, drawbacks of Prometheus is that you cannot have, you know, retention beyond, let's say, a few months. Otherwise querying becomes sort of just unfeasible due to the decompression time. It's inherently Prometheus compatible. And it also has some new features like down sampling, which actually enables the unlimited retention by sort of reducing the amount of time it takes to decompress long live samples or long life samples. So I mean, let's start with, you know, how do we actually get data into Thanos? So we're going to start with, you know, where everyone is familiar, I think, with a Prometheus instance, putting data into a time series database. This can be anywhere. It can be running on a host somewhere. It can be running in Kubernetes. It can be, yeah, it's just Prometheus doing his thing. And so the sort of classic way of running Thanos is actually to run it as a sidecar. So there is this sidecar module for Thanos or sort of mode of operation, which basically scoops up the blocks which are coming from Prometheus in the TSTB. And then it sort of exposes them by this store API. And the store API is a concept which sort of Thanos introduced, which basically lets different components plug into other components. So we can plug Aquaria into this sidecar store API, and then we can fetch data from the sidecar as if we were fetching it from that Prometheus. And in addition, the sidecar can also put those blocks into long-term storage, into blob storage, and then we have this long-term retention, which is also promised. So this is really cool if you have multiple clusters, because, you know, let's say we have Switzerland West, US East, and Europe North. It's very easy to, you know, make a global view of those metrics just by plugging in one Aquaria into all of those sidecar store API endpoints. And at the same time, all of that data gets flushed into long-term storage, and it's, yeah, it's a pretty neat system. But of course sometimes we can't use the sidecar approach. You know, there are a number of cases where we can't directly plug into an external Prometheus instance. Maybe it's running on a host which we don't have direct access to. Maybe that host only has egress access, for example. We can't get into the host. That's actually our situation. We have 10,000 customer hosts out there. We can't plug into them directly. They sort of, they phone home, but we don't talk to them directly, at least not for the telemetry. So in this situation, it's not feasible to, you know, to point Aquaria at the Prometheus. So we need something different. And this is where the Thanos receiver comes in. So Thanos receiver is a component which exposes a remote write compatible API. So it's, it's then very simple. We simply point Prometheus at this new receive component. We tell it to remote write the metrics there. Prometheus is doing its thing, scraping the data, storing its local TSTB. But we no longer care really about this local TSTB. So we get the metrics by remote write. There is a component called the routing receiver. And this guy is really responsible to validate the metrics which are coming in. Just verify that everything looks okay before sending it on to the ingesting receiver. And this is, this is the component which is actually writing those metrics into TSTB format. So the ingesting receiver is writing those metrics into a local TSTB, which it also stores. It exposes a store API, which means we can query those metrics as they come in, which is another very powerful thing. We can see the metrics as they're coming in from the external Prometheus. And it also puts those blocks, sorry, again blocks into blob storage. And so, yeah, we have a similar kind of model for expanding that to multiple clusters or multiple hosts. You know, we can basically point all of the Prometheus at one endpoint. And then we have, again, the global query view and the long-term retention, but just in a different way. Excuse me. So our Thanos cluster, we are actually ingesting up to 200 million active time series at any given time. This is requiring 90 CPUs, almost a terabyte of memory. And 400 and sort of the, the, the receive and transmit out of the namespace is 450 megabytes per second. So almost half a gigabyte of data coming in per second and a quarter of a gigabyte of data going out per second. So, I mean, it's, it's not Google levels, but for our little company, I think it's pretty respectable. Our setup, just to sort of give you a, put it in the picture. We have our fleet of 10,000 edge devices. These are really Linux machines running Prometheus locally. And we also now start to move into the cloud. So we have Kubernetes clusters where we run customer workloads at multiple points of presence around the world. And these all need to ship their data home. We also have our central Kubernetes cluster based in Switzerland, which is where we are. And we run Istio public ingress gateway there. Now for actually collecting those central, those central metrics, we just run Prometheus with a sidecar, because that's like super easy. Why wouldn't we do that? But for the customer facing metrics, we need to have this, you know, push remote right properties. So the customer devices push the metrics in, they get sent to an hotel collector where they are, we do client certificate authentication, and then they are forwarded into different kind of pipelines based on the tenancy. Tenancy is a topic we're going to cover here, but you can see the type of metrics that we're collecting there. Proxy, firewall, wide area network, Estevan, these kind of things. So things we really care about. We want to make sure that we can scale. So to meet our workload, if we onboard a new tenant, if a tenant doubles quadruples in size, we need to make sure that our metrics back end can meet those. And we also need to make sure that it's resilient. Outages are really bad, like the customers really complain if the graphs don't look good. This is to be expected, of course. And we also want to make sure that we have kind of isolation between tenants so that one guy cannot ruin the party for everyone else. So we have a basic sort of quality of service for each tenant shipping metrics. And we also really care about data availability. Obviously the long-term storage data that should be available, right, it's in blob storage. But what I'm talking about there is actually the latest, the freshest data. This has to be there because here we're sort of using this data to calculate SLOs, you know, with recording rules. And if there are gaps in that, this can really be a problem. So hopefully that sets the groundwork. Now we're going to dive in a little bit and get sort of slightly more technical. So let's think about how we might deploy Thanos, sort of this receiver. And let's take a very naive approach. Let's just deploy this as a deployment. So Kubernetes stateless deployment, is this a good idea? The answer is no. So if we have the incoming metrics, what we need to do is of course load balance those metrics across this deployment. And of course we have to think, you know, how are we going to, what's the strategy we use to load balance? So we're going to do round robin and we're going to do randomly sticky sessions. Maybe sticky session sounds like a good idea, right? Each receiver is maintaining a local TSTB. And if you know a bit about, you know, the sort of the data structure of TSTB, it's actually very bad to have samples kind of randomly distributed between two different TSTBs. Okay. TSTB really likes consistent deltas between samples. So load balancer doesn't really work in this case. This is why we have the routing receive component. So now we're going to take a stateful approach. We're going to make these guys a stateful set. We're going to store the local TSTB persistently so that it survives restarts. And the rooting receiver, we're going to organize it into a hashering. So we're going to organize these guys into a hashering. Why would we like to do that? Let's say that we get a label set come in CPU usage for a given host. What the routine receiver does is it hashes that label set and then it maps it to a given receiver replica. This is a very good quality because now every time a sample comes in for that metric, it will end up in the same TSTB. So this is good. It solves that problem of scattered samples between TSTBs. If we have another metric coming in with a different host ID, maybe that gets mapped to a different receive. Maybe it gets mapped to the same one. It doesn't really matter. So it's kind of a way of ensuring that we can split the load between the receives but also make sure that we have consistency within the TSTB instances. Now in this picture, how can we deal with scaling? So if we add another replica to this hashering, how can we sort of nicely handle this? And also what happens if a receiver becomes unhealthy? So we really need to have a system of hashering management or something which can manage the hashering for us. There's a few different ways we can do it. So one way which is very valid is to do sort of a static definition of the hasherings. And this is fine, right? So you can have a helm chart. You can define how many replicas you need. You can define how the hashering should be set. You can deploy it and that's fine. But it just means that every time you want to change the hashering or every time something changes, you need to do a new deployment. It's not so terrible. Another option is to use a controller. And so what the controller would do is it would actually watch the replicas in the received sort of stateful set and it would make sure that if you change the number of replicas or if one of the receives becomes unhealthy then the controller should be able to respond to that. And there is a controller which has been built. It's not part of the core Thanos project. It's part of this observatorium project but there are a few shared maintainers between the two projects. And so there's this Thanos receive controller. And this is what we use in our production environment to manage the hasherings. So the way it works, we deploy the receive controller just as a normal stateless deployment. We have the routing receives and the receive hasherings from before. And all we need to do first is to just label which hashering those receives belong to. So we just say hashering is equal to hash zero. We give this hashering now a name. And then we set up a very basic config file which says hashering is hash zero. So this is just saying that we have a hashering called hash zero. We feed that into the receive controller. It's called base.json. And this receive controller then generates the full hashering by looking up the endpoints of these receive pods. So the receive controller is interacting with the Kubernetes API to say, look, I get the pods with this label. I map those endpoints into this hashering. And then I feed that into the routine receiver. Okay. So what happens now when we scale? If a new receiver comes along at the from horizontal pod auto scaling or we set the replicas, the receive controller will recognize this and it will update that hashering generated file. And then the routine receivers will load that in and they will start shipping metrics to the new to the new endpoint. What happens if something becomes unhealthy? Again, there is a bit of a, let's say there is a bit of crossover time or a bit of uncertainty between when the Kubernetes API recognizes or when the kubelet recognizes that the pod is unhealthy. It of course depends on how often your health checks are in those settings. But at some point that pod will be taken out of action and the receive controller will recognize that that receiver should be removed from the hashering. So this is really nice. We have scalability and we have resilience of the hasherings. The key configuration, if you want to do this, I would say this is the most important thing to take away. There is a new algorithm for consistent hashing in Thanos. It was I think introduced in the last year, I have to check. It's called the Kitama hashering. Previously, these receives were using just a simple hash mode. And of course with a hash mode, the only argument is the number, the kind of sharding factor. And I think if you've ever worked with this kind of hashering, you know that if you remove one instance from a hash mode, it kind of scatters everything. So there's no sort of consistency between replicas being added or taken away from the pool. The consistent hashering solves this. We're not going to go into it fully here, but it basically makes the effect of adding in or removing replicas from the hashering much less felt. It makes it much more stable. And on the receive controller itself, then you should definitely run with these two config options. So only allow ready replicas. This basically says, okay, if the pod is running, that's not good enough. It also needs to be ready. It sounds obvious, but we have to configure that. That means the receiver will only be added to the hashering once it's fully ready, which means local TSTV has been completely created and spun up and it's ready to accept requests. And another one is to allow dynamic scaling. In the default configuration, the receive controller does not update when things are taken out of the hashering. But if you enable this new flag, I think you have to make sure you're on one of the latest versions. Then it will also update this hashering dynamically. And so this, yeah, if you're at this point, then I think you can say, you know, this was the end of quite some internal work, let's say, to get these hashering stable. It was a lot of fun. But, you know, things can still go wrong, right? And of course, what happens is you have a beautiful platform and then you unleash users on it, right? So users are still sending metrics to our platform and we don't really have much control over what comes in. So there's something which we call the Perl hash incident internally. And it's pretty simple what happened. Somehow, a label value was set to a hash instead of the actual value referenced by the hash. So it's sort of, you forgot to de-reference a pointer. This led to a big problem because we suddenly had, you know, once that deployment was live, we suddenly had a very, very, very noisy neighbor within our hashering, okay? So we had a huge, this is basically textbook cardinality explosion. And as we know, Prometheus and Thanos by extension does not like high cardinality time series. And so then we very quickly ran into this sort of failure cascade where you see the kills over, the low gets split to the next two, but of course they're already keeling over and eventually you have a hashering which is completely unrecoverable. So one troublesome tenant can really lead to a full service outage. This is not ideal. So how can we solve it? The, you know, let's say the obvious or hopefully the obvious way to deal with it is actually just say, okay, the troublesome tenant can get its own hashering. And this is good because then when the troublesome tenant starts to make trouble again, the outage will be contained to its own hashering, right? So we still get metrics from 90% of the good tenants. But this guy down here, we can go back to the service team and say, hey, look, you're currently not ingesting metrics because you need to go and investigate. So that's good. This is called hard tenancy where you sort of physically separate, you have physically separate infrastructure for handling different metrics coming in. And to actually create that is pretty simple. All you need to do is create a second hashering in your JSON manifest. And then you need to map tenants. You just need to map the tenants to which hashering they belong to. And down here you see we don't map any tenants. This is saying if you can't find a hashering for a given tenant, just send it to the default hashering, okay? So this is like the soft tenancy hashering where everything will just be mixed together. And this is the dedicated hashering for the troublesome tenants. You don't want to juggle multiple hasherings. There is, of course, an additional complexity there. You have to manage multiple stateful sets. There is another option which is called active series limiting. Again, this is a relatively recent feature of Thanos, I think since 0.28 maybe. But this allows us to actually look at the number of series coming in and just limit dynamically based on when they get above a certain level. And this is the basic idea. So we have an idea of how many series we can manage, based on our resources which we've deployed in the receives. And what we need to do is we just need to query what the current metric value is for that head series. Each of the receives, like Prometheus, has a metric called head series, Prometheus time series head, head time series, something like that, pops up later. So we can actually count that live. And then we just need to tell the routing receiver to limit if the number of active head series is higher than some value which we configure. Let's see how it looks. Here's our hashering. Here's our noisy tenant. We have to introduce a new meta monitoring query which is going to fetch those current head series values. And it's a query which looks exactly like this. This is exactly the query which is run. And so in this example, for example, you can see the SDP tenant has 28 million head series. The Vang tenant has 23 million head series. The bandwidth control has 9 million. And so then we just need to configure. We need to tell the routing receiver how it should behave, how it should limit those tenants. So of course in this case we're in trouble because SDP is only allowed 2,000 head series. In this case we would have a lot of limiting based on SDP. But this is where that sort of tweaking based on your operational knowledge comes in. And this way we can really shrink the size of the matrix coming in. We also return a retribal error to the tenant who is trying to send too many metrics. So we can alert on that. We can go and look as to why that situation started to happen. And of course nothing stops us from taking a hybrid approach. So you can have separate hasherings and you can also have active series limiting per hashering if you want to. And this is actually what we do in production to basically keep each hashering healthy but also minimize the effect of noisy neighbors. Now this is all well and good. So now we have scalability, we have resilience and we have let's say happy customers. But there are still some issues. So in our customer portal we have graphs and we have metrics or graphs which are fed from metrics. And one of those graphs for example is availability of a given host. And this is based on a recording rule which is constantly being evaluated on data which is coming in. And when we have an outage I mean it's important because we have SLAs which are related to those outages. So if we have a lot of incorrect outages then that's not good for us. We have to go and justify to the customer that this was not really an outage, it was a metrics problem. So we were reported, there was reported that there was some spurious outages let's say, some outages or some real outages. And so we went and looked into the data and actually it turned out that we the rules were being evaluated but there were gaps in those rules. So we had some mis-ruled evaluations or we had some time when data was not there so the rules could not be evaluated. So what's going on here? Here's our setup. So let's say that we have data which is required for a given rule to be evaluated and that's being hashed by the receiver one. The ruler is picking up the data from receiver one evaluating the rule shipping off into blob storage everything is working fine. Now if receiver one has an issue of course that data can no longer be written and then the ruler can no longer evaluate it's going to evaluate to some missing or just missing metric. This gives us the gaps in the data which we saw. And of course you can still have the receive controller working but there is this period where there is this in-between time you know the cube that has to realize that it needs to reboot the path it needs to take it out it's not a perfect system these things don't happen instantaneously so it can be that a rule is evaluated at the same time as a receive is not available. So we can actually make things better just by setting one config which is a replication factor so especially only with the receives so with the remote write approach we can replicate the data so this is what happens if we have a replication factor of three the incoming time series still gets shipped to the same part but it also gets replicated to two other receives so we have three copies of the data and of course then the ruler can read from those three copies and this is fine if one of the receives goes down then that's okay because we still have two more copies so the ruler can still proceed and it can still evaluate its rules so it increases your failure tolerance so tolerance to sort of pods disappearing pods not being responsive you have to be careful because with this replication comes this concept of quorum so for given replication factor there is just a formula q equals r of two plus one so the quorum factor is equal to the replication factor divided by two plus one I think this is you've probably seen this before in other fields in the structure of three we have a quorum of two and what that means is that at least two receives must acknowledge a right in order for that right to have been considered successful so the routing receives will make sure that at least two of the receives succeeded that gives us we can calculate then what the max unavailable is for a given replication factor and sort of you can use that to identify you know what your fault tolerance is you can use that to to plan your deployments and ensure that the minimum number of replicas is larger or equal to the replication factor so that's a let's say it's a very simple thing you can do just enable replication but you should consider whether you really need it in our case we really need it because the rules must be evaluated we need a very very high successful rule evaluation it doesn't actually increase your storage costs because you can dedupe the data later we're not going to go into that today there's not enough time I would love to do a whole talk just on the compactor maybe next year but one of the things you can do is use a pod disruption budget which matches your max unavailable just to make sure that you're kind of resilient in the face of you know nodes going down cluster reboots these kind of things which happen especially with managed Kubernetes providers so everything was good of course this is how it feels sometimes to see Kubernetes and storage especially with metrics and we also do logs and these kind of storage costs they just they can go up so it really feels like this and of course when your boss checks the Azure budget of a breakfast this can happen this actually happened so we were asked sort of you know put lightly but firmly if we could have a look at the cloud costs and bring them down and so we started to investigate at the time which didn't seem to correlate with pure storage but they were coming from the storage account and so we traced it down to this component called the compactor and what it turned out was that the cause of the storage cost increase was actually requests to the storage account so it wasn't pure storage it was just sort of you know write requests, read requests these kind of things were going up and they were pretty expensive and it turned out that this was the first of the blob storage and it does some sort of it does a lot of housekeeping so it does retention which means it deletes old blocks it does down sampling it does compaction of blocks as well but the key thing is deletion okay oops I've gone too far the key thing is deletion so it will remove blocks which it deems are no longer necessary and the reason why a block is deemed no longer necessary could be it's been marked for deletion so a partially uploaded block is just if a receive has been uploading some data it's a big amount of data if it gets interrupted during that process then you have a kind of unfinished block in blob storage so this is what a partially uploaded block is it usually gets solved when the receiver reboots because then it can kind of finish uploading that block but it gets put into a new block ID so you tend to have these blocks hanging around so I'm going to just whiz through some logs you don't have to read all the logs I don't have any of how we track this down I've got three minutes let's say so what usually happens the compactor realizes that we have a block it needs to remove this is partial equals one it will say it found the block and then it marked the block for deletion and then it deleted the block and we see this sort of happy path so the puzzle is why is this partial is going up and up and up the number of partial blocks is keeping on increasing the number of garbage basically in the block storage and it looks like the compactor is actually marking these blocks for deletion but they're not being deleted so what's going on when you go to the block storage you actually see this nice directory view and it looks like you've got files you've got directories you can sort of sift through them like this this is actually a lie so this is not real which the cloud providers do to make you feel at home it's actually block storage is a flat namespace and what that means is that everything is a file there's no directories at least that's how it should be that's how we expect it to be so it's really an illusion and the way to say is if you were to have files like this on your local machine it would look like this a hierarchy from namespace but yeah it's all understood so our cloud provider Azure actually has this hierarchical namespace which can be enabled for block storage it turned out it was enabled by default by our own automation so our Terraform pipelines were causing us issues so when we went to these blocks which were partially uploaded blocks they just looked like empty directories and this didn't make sense because we're like well that doesn't make sense for flat kind of block storage right so what was happening this was kind of a ghost left behind from Thanos deleting blocks and then the hierarchical namespace just left behind these empty directories and of course Thanos there's no way to delete that right so it was really strange it's really this situation right we just have two empty directories in blob storage so this I mean the picture kind of sums it up that the cloud was charging us for these delete requests to completely non-existent blocks this was really thousands of delete requests every five minutes which just got worse and worse and worse the take home message for the blob storage is if you're using Thanos if you make sure you understand which settings have been enabled because this really led to like 86% reduction in our storage account costs just from transactions it was crazy so summing up there's four minutes left for questions yeah thank you very much this has been the Thanos right path and also some fun deep dives into cloud provider features and there's my talk thank you very much with multiple Thanos receivers writing data into blob storage is there a way to choose only to write from one of the receivers instead of multiple because when we are doing replication most of the work done by Compactor is just deduplicating whatever was replicated right so how can we avoid them there's no way to say okay this receiver will write to blob blob storage this receiver won't and that won't work I think because it's kind of you can't really govern how those series are hashed between the receivers right you can't guarantee that one will always be there because as it moves in and out the consistent hashing will update Compactor I think there's no way around it two questions so first one is have you cleaned your data before you send to the receiver clean the data clean the data before we send it to the receiver like a recording lure and then you only scrub the data we don't do any cleaning but we have so the data goes through like a collector pipeline where we sort of stamp it with the data and these kind of things so you basically send all the raw data from the Prometheus to the receiver have you ever considered about using the Prometheus federation so what's the difference between the federation Prometheus and the Thanos so okay the reason we have to do it with Thanos is because we want to collect the data from all these different Prometheus right or do you mean to actually use essential Prometheus instead of Thanos Prometheus federation also has ability to scrub from the different cluster yeah so I think a little bit similar to how Thanos can receive from the different cluster so I was wondering what's the difference between the two and why you choose the Thanos okay so one of the reasons for using Thanos is because especially for the long term retention so for the long term retention of the Prometheus data you really must use Thanos just because if you need to go beyond a few months it's not feasible to query this data from Prometheus yeah I'm not sure I'd have to think about that a bit longer if you just wanted to use base Prometheus to receive the data okay cool thanks so what would you consider a truly high card melody I saw 2 million that seems like obscene as far as cardinality but if we were looking to track cardinality explosion what would you say was a good start I mean it depends how much you want to scale it really depends on your use case for example I mean it's not it's not related to Thanos but we also run Loki so Loki again has this you know cardinality with chunks and stuff for Loki we have we hit almost a million streams which for Loki is pretty huge but for our use case it was absolutely to slice by application as well as by host so we needed kind of horizontal across hosts and per host I mean it's a hard question to answer because it will depend on your system right yeah that's true that sums it up would you consider the bottleneck so when you're adding more and more tenants what's the first thing you're like okay now this is the component in Thanos we have to think about the receivers yeah the receivers yeah so those guys they you need to be really careful with them if they hit their memory limits you can really have this cascading failure situation and that's very so we have like alerts on sort of 70% of the maximum head series for example so it's the time series the unique time series is what you're gonna exactly yeah yeah yeah exactly no worries hey hi I have a question regarding cardinality limiting so you said there is a component which queries just head series and then it says that cardinality is exceeded or not so does it mean that if cardinality exceeded the whole data stream is like blocked yes so effectively that's the way you can introduce a cardinality limit because you know each individual you know let's say each individual label that would be a separate time series so that's one way that yeah that's exactly how you can do that okay is there a way to block on the new time series so for example we have like 1 million for this tenant and then it exceeds like by add in 1,000 of new series so you reject on the those 1,000 I don't think so because so it's basically what's in the current head block so that too within this two hours so basically you know we usually see this sort of ramping up and then a cut so effectively once you get to that threshold and you're ramping up then you cut everything after that and you can't say you know you can't really distinguish which series you're going to cut or not it's just about pure number good thank you very much