 Thank you very much. Can you hear me? Is it good? It's a real pleasure to be talking at the first StannisCon. I was super excited when I saw that there was a StannisCon at QCon. I was like, I have to be there. So this talk is called Receiver Deep Dive. I didn't really put too much thought into the name when I've submitted the talk, as you can maybe tell. So a bit about me. I'm a system engineer at Open Systems. The company, we actually offer, say, managed connectivity. So we do a lot of firewall proxy, these kind of services. And what's, I think, interesting about us is that we have hosts or customer hosts all over the world. So we deploy physical devices. And recently we started to get into the cloud and experience all the fun of merging legacy with new stuff. So this is, yeah, this is quite an exciting time, I would say, at Open Systems in general. And we're big users of Stannis. So Stannis is the new metrics backend which we're migrating to. We've been using it since about two, two and a half years now, maybe three years ago. So a bit of, yeah, a bit of a historical context, let's say. So this is me back in November in Chicago giving another talk about Stannis Receive. It was a fun talk. At the time, I thought I understood the Stannis Receiver. Now I'm not so sure. Yeah, then I had some nice time in Chicago, saw the bean, flew home, and not, well, about two weeks later, Stannis blew up. And it was caused by an unstable receive. Let's say, runaway cascade of receive failures. And what was interesting is that the Stannis Consumption deadline was on the second of December. So this was kind of four days apart. It was very much in my mind. So I was like, okay, we're going to dig into this. So this is maybe conference-driven debugging, if you like. So this is the story of that incident. You probably know this graph. This is showing writes to Thanos. So the green is good, the blue is, eh, and the red is bad. There's not much red except up here. We'll talk about what that is. There's a lot of blue. We'll talk about what that is. And then the green comes back and we'll talk about why or why not that happened. So I don't think I need to spend too much time on this bit. This is the introductory bits. This is to make sure we're all up to speed on what we're talking about. So the Thanos receive component was designed, introduced, like Sasawata said, basically a Pock in prod. It allows us to remote write metrics to Thanos. And this was not the original, let's say, vision for Thanos. The sidecar was the vision for Thanos originally. But we wouldn't be able to work with Thanos if we didn't have this component. So for us, it's very, very important because we literally remote write 10,000 batches of metrics every 30 seconds from 10,000 hosts around the world. So this is how we are able to use and to scale our metrics. So, yeah, so we know that the receive forms a hashering. So we have a hashering. Label sets come in. They get hashed to certain receivers. Different label sets get hashed to different receives. We understand that we have this hashering component, which is basically mapping which series get mapped to which receive replica. And the question is what happens when things go wrong? What happens when receivers become unhealthy? How can we deal with this? Basically, we have a receive controller. So this is how we work with Kubernetes. We have an operator which is looking at our receives and it's monitoring their health, basically. So quickly how this works. This is the receive controller from the observatorium. We tell it to look for a hash label on our receivers. It will then go and find the endpoints which belong to that hash label. It will build a hashering config and we feed that into the routine receives. The routine receives will then distribute across that hashering. When we scale, that will be updated by the operator. That's all good. When something goes wrong, if we have dynamic caching or dynamic controller enabled in the receive controller, it will remove the component from the hashering. Now, what happens when we actually want to read the data? So we have multiple replicas. We have a component, maybe ruler, who wants to read that data. That's important because we may be alerting from this guy. If the receive goes down, the ruler can't get the data, right? So that's a problem because then our alerts are going to be screwed up. So how can we deal with that? We can enable replication. So what we do is we basically replicate metrics, replicate writes to multiple receive components. When one of them goes down, as long as we have a quorum, so as long as enough of the receives agree that a write was successful, the ruler can still continue to function. So this is all good. This is how the quorum is calculated in Thauss. It's maybe a bit controversial because we can talk about that later. But currently, this is the way. So currently, replication factor two gives us a quorum of two. So there's a little hint there as to one went wrong. And we had a great talk just now about multi-tenancy. So we also implement a multi-tenancy model at Open Systems. And the reason is, in the past, we've had problems with service teams who are our actual tenants. Maybe pushing some metric which tries to label a request IP and a domain and a user. So just cardinality explosion, which can cause a sort of failover, not failover, but a failure cascade in our receives. So one tenant can take out many. This is the problem the hard tenancy aims to solve. So we have two specific pipelines for our hard tenants. Okay. This is the summary. Receivers scale the way to remote write. Replication make the receivers more resilient to failures. And hard tenancy isolated hash rings. In principle, everything should be fully isolated. Okay. Now we can dive into the incident. So it lasted about four hours. Like I say, it was at the end of November last year. We were puzzled. We were absolutely puzzled during this thing. We had no idea what was going on. The resolution was just time and luck. Everything resolved. And we sort of worked out what happened afterwards. That was lucky. Let me just show you how we are configured. So up here we have our sources, metric sources. We have the edge devices, 10,000 of them all around the world. And we have some Kubernetes clusters which are just running Prometheus. Okay. They're all remote writing. But actually on the hosts we have telegraph, which is the influx DB in the influx DB world. Actually this agent is doing the remote writing. That's maybe a legacy thing. And then centrally we have an Istio ingress. And this is routing stuff based on a tenant header to different hard tenancy pipelines. And over here we have a query. And this is a global query. So we plug in that query to all of the tenants and then we can query stuff as we like, as if there were no tenants. And this is the start of the incident. So we receive a report basically at 9 o'clock, hey, some of my metrics are missing. And the effect of this is that alerts started to do weird stuff. They started to refire. They started to, yeah, just weird stuff started to happen. So some metrics were getting through, but not all of them. We dug in and we see this. And this is not good. This is the receivers. And it's very strange because we can see that they've all suffered sort of catastrophic failure at the same time, multiple restarts. And it's across tenants, right? So we've got different tenants here. We've got firewall, we've got central infrastructure, we've got bandwidth control. Multiple different tenants all kicked out at the same time. They were unkilled, of course. So what the hell happened? That's really weird. And that's also really concerning because we were feeling secure that we had a hard tenancy pipeline. We shouldn't, yeah, that shouldn't happen. So what happened? The clue was in these metrics here. So this is just the resource metrics. One of the querier blew up. So the querier spiked. We dug in and we find this series query hiding. This is the monster query. And so what happened basically, yeah, so the monster query came along. It didn't work. So they tried a few more times. And then the receives all died, cross tenant could put. So yeah, that was surprising. I was like, yeah, okay. That made me question the whole hard tenancy thing, but let's go ahead. And then Thanos got itself in a real mess. Like, this is the mess that Thanos got itself in. So these are just, this is like a low key query for error logs from Thanos. And we see this kind of strange, strange structure periodicity, but it's just not ending, right? Nothing's resolving. And what these errors are actually are out of bounds errors. So they're returning for a nice, which is normally tribal error. Okay. So these things should be being dropped, but they're not. And this, yeah, it's just not resolving. We were just observing and watching and trying things. And it just wouldn't go away. This was really, really strange. Just check the time. I know it's lunch next. So I don't want to keep you. Okay. So we dug in a little bit and it turned out that actually the telegraph components had the problem. So Prometheus remote, right, that was going fine. That was the curve, the graph here. Prometheus was okay, but telegraph, those pipelines, all those tenants, they had issues. And all this blue is actually 409 errors. And at the top, there's like tiny little amount of 500s, just tiny amount. And you can also see some other interesting stuff here. So the red on the left-hand side, this is the initial monster query that killed the receives for everyone. So, yeah, they, there was some outage there to get some 500s unable to push the request. And then an hour later, the telegraph components started to have issues. Very weird. So let's look a bit deeper at what's going on on the hosts. Okay. So we have actually a very weird setup on the hosts. And again, this is all part of legacy migration. This talk is messy, right? This is the real world. This is actual production where things are not sort of tied up in a nice bow. This is how we have to actually do things. So we have a Prometheus running on our hosts, which then remote writes to telegraph, which then remote writes to Thanos. The reason for that is we actually still have influx dp in production. So we basically told telegraph, hey, we've got this new back-end called Thanos. Can you also ship the metrics there, but keep shipping them to influx? We still need you. It also has an important function because it means that we have different metric buffers for each of these hard tenants. So actually telegraph is doing an important thing. It's kind of a buffer between Prometheus and the hard tenancy pipelines. If one of the pipelines goes down, it'll back up to telegraph. But in principle, Prometheus shouldn't be affected or not. So the telegraph config, it looked okay. We were not retrying 409s. So telegraph should be handling the stuff, handling the data okay. But there was like some other interesting stuff lurking in the logs. Like we found this. And what's happening here is that we're getting out of bounds from the receives. So that should be a 409. But we get two of them in the same response. And so what happens, the overall error is a 500. Like that's weird. There's actually, sorry, I put 429, there's actually 409. That's a mistake on my half. It should be a 409. So 409 plus 409 gives us a 500. There's actually an explanation for that. I'm not going to go into it. There are some slides if you want to see. But it's because we were using replication factor two at the time. And in that case, there's sort of this strange condition where you cannot reach quorum. So therefore the error is basically always a 500 regardless of what the underlying error is. But this was actually a red herring. Like does this make sense? This is the hypothesis then. So telegraph is shipping metrics out of bounds. Thanos responds with the 409. But it's a nested 409, which gets interpreted as a retribal 500 by telegraph. And the cycle continues. And that's what gives us this, right? Makes sense. Complete bull. It's not the case. So actually this is the response code from the receiver, right? The routine receiver. So the routine receiver is doing its job. It's actually responding with 409. And in the logs of telegraph, most of the entries we saw were 409. So telegraph was really dropping the metrics. There are some, like I say, there are some hints of 500s. They were there in the logs. It was happening. But it wasn't the primary cause of this incident. So back to the drawing board. Where are we? Yeah, this is what we tried. I mean, as all good Kubernetes operators, what do we do? We switch it off and back on again. We hope and pray that that's going to fix it. It didn't fix it. We tried restarting Prometheus on the host. That didn't fix it. We tried reciting telegraph on the host. That didn't fix it. We got really desperate and started deleting data in the Thanos receive vial. That didn't fix it. The metrics just came back. All out of bounds. All kind of unexplainable. While we were doing this, the incident resolved itself. So great. We can go home. Problem solved, right? No. What's going on here? I mean, it's a really weird case. This is, again, the monster query, an hour later, 409 loop, four hours later, resolution. That's what we're here to discuss. Does anyone have an idea? Please come find me at the end. So we, of course, go to GitHub and we look into the issues. There is this open issue, still, where there's kind of similar behavior. But then we looked, and this is us on top, and this is the graphs which we get in the issue. And it's really not the same signature. So in the issue, actually, what's going on is some of the fan forward requests from Thanos, they get time out. So basically, the appender cannot keep up with the incoming requests. It starts to fail. So receive starts to time out. That causes more requests to back up. And so on and so forth. Then you get into another kind of loop. That's a completely different failure mode. So I don't think this is what's going on here. We, actually, all of our requests were very, very quick on the receives. The receives were fine. It was something else. Why did it take an hour for this 409 infinite loop to sort of kick in? I'm going to disappoint you. I don't have an answer for that today. I have a hypothesis, and I want to see if you have an idea. But that's really weird. That's super weird. And another thing, this is from a tenant which was not affected by the receiver outage. So here, the tenant was fine at 9 o'clock. The receivers were fine at 9 o'clock. But it still got screwed up at 10 o'clock with the 409 loop. And what was going on here, this was actually the probing tenant. This is the tenant which is responsible for our connection down alerts. So the SRE team was like, oh, we've got to fix this. So they were like restarting everything and trying to fix it, but it didn't work. So a hypothesis. Again, let's revisit this weird pipeline we have on the host. So we have a Prometheus with a retention of four hours, which again, that's the duration of the incident. We have a head block, and we have the chunks which are getting written and mapped, and then the chunks are cut to a persistent disk. So that's all going on. Now at 9 o'clock, well, let's say at 8.50, this is the good scenario. Everything is working normally. At 9 o'clock, a lot of stuff happens. The receives go down. So the monster query comes in, kills the receives. That stops the remote write from Telegraph. Telegraph gets backed up. Telegraph can't handle any more metrics. Telegraph stops accepting remote write from Prometheus. And at the same time, there's some block cut over here, right? I think. So a lot of stuff happened at 9 o'clock. This is when the blocks get cut. Now at 10 o'clock, I think Prometheus started to send old data to Telegraph again. So everything was healthy upstream, but Prometheus somehow started to send old data again. And I think this is the only explanation because all of the tenants were affected. It wasn't just the tenants who had the receives knocked out. It was everyone. And I don't know why that would happen. That's something which I'd like to discuss. So that would kill then remote write because the 409s would come cast gate back down. And then, yeah, we wouldn't be able to ship anything. And we'd have this kind of infinite loop. And then as we continue at 1 o'clock, the block falls out of retention, the data goes away, and Prometheus just gives up trying to remote write and everything goes back. So everything goes back to normal. Okay. I don't know if that's the case. We still need to investigate this more on our side. To me, it seems like that's the suspicious part of our pipeline. This Prometheus to telegraph remote write part. We still need to think about what we can do about that. Maybe a solution would be to actually have one Prometheus per tenant and then we have a fully separated pipeline for everything. But that's going very, very complicated. We need to sort of try and recreate this, let's say. Now, what mitigation steps could we take? The first thing we wanted to do is make sure that this monster query can never get through again. Okay. So the monster query had to go. That's actually really easy. There's this really cool config option store.limits. And that basically says to the receiver, don't try and fetch more than this number of series. Okay. Otherwise, you're going to hurt yourself. So that was a very easy option to enable. But then the SRE team started to complain that their rules were failing because they couldn't hurt Thanos anymore. So you have to find a balance between your users and the system that you're running. So that's just one thing I would say when you implement this, make sure that users are happy and the system. But that works really well for us. Since then, we've had no kind of issues with the receives. It's been very, very good. Another thing, we previously ran with dynamic scaling. Now, this is maybe a bit more, this is maybe more specific for our use case, but usually receives go away one at a time. So usually when we roll out, and what happens when you lose a receive and dynamic scaling is enabled is it actually gets dropped from the hashering. And that means that the receivers kind of redistribute all the metrics, right? Even though it's a consistent hashering still, there's some, there's some metrics, some samples which will go to different receives. And so when the receiver comes back, actually enters back into the hashering. What we found is it's more stable if you don't have dynamic scaling, and you actually run with a replication factor of three, then you can actually handle the loss of the receive. So then you don't have sort of samples all across all the TSDVs. And this ends up being more stable for our situation. But we're kind of lucky in that we can talk to the service teams. We can say, hey, you notice there's a spike in your metrics, do we need to scale up? Is that going to be a permanent thing? Obviously your mileage may vary. Okay, where are we? We're getting to the end. For good measure. Now, I didn't really talk about out of order rights, but this is a feature which I didn't know existed. I think it's actually a hidden flag. So this is a very useful feature which will enable out of order rights for incoming samples. So that gives you some sort of window when you can actually ship samples out of order. That's what it does, right? This is useful, especially as we're using telegraph. And telegraph doesn't really have any guarantees, I think, on the order of the metrics which are ingested. Or order of the samples, sorry. So yeah, I mean, this is wrapping up. I think the main takeaway is that building a resilient multi-tenanted pipeline is quite hard. I was confident back in November. I'm not so confident anymore. There's a lot to learn. That's a good thing, right? We learned from incidents. We can do our best to protect the receives. The receives must be protected. At the same time, I think what would be interesting is to revisit the bubble up approach for errors. This is how the receives determine which error to return when you have replication enabled and you have multiple rights. There's sort of maybe some work which could be done there to enumerate all the different possibilities of errors, rather than relying on some kind of bubble up logic. That would be very interesting to dig into. And as we saw in the previous talk, I think that it would be quite exciting to experiment with the new tenanted queries, because maybe then we can protect ourselves a little bit better. Anyway, thank you. That's been the incident. Let's go to lunch. Thanks, Joel, for that amazing incident debugging. Any questions? Or answers. Answers would be appreciated. Yes, that too. Did you say that you went away from replication factor two? Now we're on replication factor three. Do you think replication factor of two should not even be allowed? Yeah. I had a slide that said that and then I had to cut it. I think that we should solve that issue, definitely. That issue is still open. In the proposal, to give some background context, in the proposal for the receives, it was mentioned about replication factor, and there the formula was given that would make replication factor two a kind of valid replication factor. So for replication factor two, you would have a quorum of one. Now that's actually a bit different, I think, how normal quorums work. But for Thanos, maybe it kind of makes sense that replication factor two should have a quorum of one. Like it would reduce costs. It would be sort of a reasonable approach, maybe. I think we can definitely revisit the issue. We should. Any other questions? Yeah, if not, maybe we can start with the closing email.