 Hello, I hope everyone is having a great time at Prometheus Day this year. So this is Ben Yeh and I'm an SRE at Bydance. Today's topic is about troubleshooting compactor backlog with Ease and let's get started. First, let me introduce what is the Sonos Compactor. The Sonos Compactor compacts blocks on the object storage in order to improve the query performance. Besides, it also deals with block down sampling and data retention as well. So from the implementation perspective, the compactor is just a cron job. For example, it runs every five minutes and each round is called an iteration. So each iteration, the compactor will perform the three tasks here in order, which means if there are too much compaction work to finish, then it can't start down sampling and retention. So usually the backlog happens in phase one, which is the compaction phase. So why does this happen? And maybe we can think about this and imagine it as a message queue scenario. So here and the Sonos Compactor is a message queue consumer. As the producers are Sonos sidecars, rulers and receivers who upload blocks to the object storage. In this case, the object storage is a message queue. So if we scale more on the producer side and we don't scale on the consumer side, then much more data will be uploaded to the object storage and the compactor cannot keep up with the load. And then it falls behind. And finally, backlog happens. So the key thing here actually is to identify the backlog issue and there are several ways to go. So first, the compactor itself exposes some very useful metrics. So these two metrics actually tell us the current iterations and the down-samplings performed. So if these two counters remain the same value or they increase slowly, then backlog might happen. And if you don't see any retention happens for very old blocks, then the compactor might be busy compacting your blocks as it cannot start doing the compaction. And the last point might not be that obvious, but if your compactor has the backlog issue, then some long-term range queries performance might be degraded. So another way to identify the backlog issue is to use the progress metrics. So since Sunos v0.24 release, four new metrics are introduced and there are very good signals to tell whether your compactor hits backlog or not. And they can represent the compaction progress. Please do give them a try and they are very useful in your alerts as well. So next, let's talk about the solutions for the backlog. So in order to solve the backlog problem, we definitely want to scale the compactors more. And the easiest way to go is to simply scale vertically. So we can add more computation resources to the compactor instances. And another way to do is to just increase the compaction concurrency. So there are two flags provided by the Sunos compactor. One is the compaction concurrency and another one is the down-sampling concurrency. So we can tune these flags and make the compactor instance more powerful. And another way to go is to scale horizontally and about horizontal scaling. And there are actually two ways to go. One way is to just short by time. So for example, we can have two compactors and one compactor take care of logs produced last week. And another compactor take care of logs produced maybe last month. And in this way, we can distribute logs to different compactors by time. And another way to go is to short the logs by their external labels so that we can group logs from the same clusters together to the same compactor. And in this way, we achieve the same goal and we successfully distribute logs to different compactor instances. So I think that's all about today's session and I hope you enjoy it. Thank you.