 So, hello everyone. My name is Ben and today my talk is about TSDB Compaction. So, a quick introduction about myself. I'm an SRE at Baddance, working on Kubernetes and observability. And I'm a Thanos maintainer and also a Prometheus contributor. So, first of all, I want to give big thanks to Ganesh for his great TSDB box series. So, if you are interested in TSDB and want to learn more about Compaction, please definitely check out his blog post. So, in his blog post, he defines the Compaction as a process of creating one new block from one or more source blocks. And this is actually a high level overview of TSDB Compaction. And in this talk, let's dive deep into it. So, because the definition mentions a TSDB block, I will first introduce what is a TSDB block. So, a TSDB block is created from an in-memory head block to disk every two hours. And this way, there's no time overlapping by default. So, different blocks will cover different time ranges. So, you can think of this as a way to organize your time series data partition by time. So, for example, when there's a query comes in, it will only touch the blocks within the required time range. So, if you query the most recent data, it will only hit the most recent block. So, each TSDB block is actually a minute database itself. It contains four parts. A JSON file to store some metadata. An index file for helping you search series and a trunks directory to store your data samples. And finally, a tombstone file. So, let's start with the metadata file first. So, it includes some block level metadata like the time range and some TSDB stats like number of samples, series and trunks. And some additional information about Compaction. And in other systems like tunnels, we extend the metadata file to include some information like the external labels or down-sampling resolution of that block. So, next, let's look at what is index and trunks. So, we all know that in TSDB, the smallest unit is actually a sample which just a pair of timestamp and value. So, let's imagine if we map all these data samples to a two-dimensional graph. For example, the x-axis is time and y-axis is series. Then the graph would look like this. And you can find that each row just represents a series. And a series includes two parts. On the right-hand side, we have some label sets, which are just some combination of label pairs used to uniquely identify one series. And for simplicity, you can think all these series labels on the right-hand side can be stored on disk in some special format. And this is called the index file, and it's used for looking up your series faster during the query time. And on the left-hand side, there are some data samples. And each row of samples actually belongs to each series. And to store these samples on disk, they are encoded in some special format to save disk space. And this is called trunks. So, when we run a query, it will first try to match the series from the index file. And for each series, it will try to fetch the corresponding trunks file on the right-hand side and finally return the results to the user. So, yeah, we covered the index and trunks. And I want to skip the tombstone file for now, and let's talk about compaction first. So, in this diagram, we prepare two blocks to compact together. Block A and B are created from the head block, so they don't have any overlapping data by default. And as you can see, they have some common series, like job equals HTTP server. And also in the meantime, they will have some different series, like this series from Kubernetes environment. Because we all know that Kubernetes environment is very dynamic, so the parts might be created and deleted at different times. So, it's very possible that some series will exist in block A, but not in block B. So, after this setup, let's start the compaction. So, I think the first thing in the compaction phase is to merge the several index files together into a larger index file. So, you can see we have two index files from A and B, and we merged them into a larger one. So, during this process, the common series labels, such as the labels from HTTP server and node exporters, they are merged and deduplicated, because we don't want to keep two copies of them. And for the different series, like the Kubernetes metrics, they are just sorted and combined together. So, after the index merging is done, next we want to merge the trunks together. So, let's switch to the two-dimensional graph again. So, what we want to do is to first iterate over all the series from the newly created index file on the right-hand side. And for each series on the right-hand side, we will try to fetch the trunks from different blocks and merge the trunks together. So, this process would look like this. So, because the two blocks don't have any overlapping data, we just need to combine these trunks together horizontally. And, yeah, you can see for these different series that only exist in block A, because B don't have this kind of data, so we don't need to merge anything. And, yeah, finally the results would look like this. So, we got a larger newly compacted block, and the index and trunks are merged together. So, this is how the compaction works. But this is the most common case, because there's no data overlapping. And another case is called vertical compaction. And, yeah, you can see block A and block B have some overlapping data. And this case happens a lot when Prometheus supports backfilling. However, the compaction phase is actually still the same. What we need to do is still merging the index file together, and then we're going over all these series on the right-hand side and try to merge the trunks together. So, the only thing different is that we might get some overlapping in this case, right? And, finally, we will get a newly compacted block like this. And the only overlap trunks is outlined here. And for this overlap data, we need to rebuild the overlap trunks and maybe do some deduplication. But this process is actually very naïve in Prometheus right now, because the deduplication is only one-on-one, which means only exactly the same trunks can be deduplicated. So, yeah, I think that covers all the process of compaction. And next, let's take a look at why we do need it. So, yeah, let's think about the query scenario. So, before compaction, if we do a long-term query, it will try to match all these small blocks and hit them so that in this way, we need to go through all these small index files and try to merge the results together back to the user. But if we can do some kind of compaction, then, finally, we will only have one larger block in this case. So, the same query will only hit one larger block and we don't have to do anything related to, like, merging results or some online deduplication in this case. So, query performance will be significantly improved. And additionally, because each index files from the small blocks might contain some common series, as I mentioned earlier. So, when we do the compaction and the merging process, we can save some disk space by deduplicating the index data. Okay, so, yeah, finally, let's take a look at the TomStone file. So, why do we need it? So, let's imagine the case if we want to delete some serious data from a block, then we can send a deletion request and specify, like, the series matches and the data time range we want to delete. And for performance reasons, this data to delete are temporarily saved in a file called TomStone. And this data are deleted during the compaction time. So, while we iterate over all the series, and we find that if the series and trunks actually match the data from the TomStone, and if it matches, then we drop this and so that the newly compacted block will not have this data. So, and additionally, if the TomStone file contains enough amount of serious data, the compaction itself will be triggered to clean up the data in order to save more disk space. So, now I think you have a high-level idea about what is a TomStone file, but in the perspective of compaction, it's still just some kind of modification to your index and trunks data. So, yeah, to me, I think the essential part of compaction is merging the index and trunks. So, yeah, I think we cover the compaction, but actually before the compaction phase, there's another phase called planning, and it's used to choose which blocks we want to compact together. So, in this diagram, there are five blocks in the TSDB directory now, and the planner chooses to compact the three blocks together into a new block F. So, to choose these source blocks, it follows some planning algorithm, and right now the planning algorithm in Prometheus is a little bit simple, and it only checks the time range and number of TomStones of each block, and this information can be easily found from the metadata file. So, I will not dive deep into the planning algorithm here, but ideally we can extend it to be more intelligent, and in Tunnels, we have supported one planner called index-size-limit planner to limit the maximum index size of the compacted block. So, yeah, I think that's all for the compaction of Prometheus TSDB itself. Next, let's take a look at some, like, more large-scale scenarios in some system like Tunnels or Quotex, and actually I will not cover Quotex compactor here because it's built on top of the Tunnels compactor with some, like, additional features. So, a common, like, Tunnels deployment looks like this. So, we might have two clusters, and each cluster has one Prometheus with one Tunnels sidecar, and in Prometheus, we will configure the cluster name as the external labels, and the sidecar will try to inject the external labels to the TSDB blocks and upload them to the object storage every two hours, and besides, we also have one central Tunnels compactor running against the object storage, and the compactor works in four steps. And first, because compaction needs to do some planning first, and planning step only requires the block metadata. So, the compactor will have a separate job to fetch the block metadata from the object storage and download them to a local disk. And after this is done, then it can start planning using this local metadata. And the third step, if there's an available compaction plan, it will try to download all these required blocks and compact them locally. And finally, it will try to upload the new block to the bucket and delete all these source blocks. So, as you can see, you can find that step two and step three is exactly the same as the Prometheus compactor, and the Tunnels compactor just adds step one and four to get it working with the object storage environment. And actually, there's a little bit of difference in the planning phase. So, planning is actually done separately in different groups, and each group is identified by the external labels added to that block. So, you can see here, we have blocks from cluster Europe and US. So, these blocks should be planned and compact together because we don't want to have super-larger blocks for maybe all the cluster go away. And, yeah, so finally, we got two blocks for Europe and US. And yeah, that's how Tunnels compactor works basically. And next, let's see some extensions we made to the Tunnels compactor to deal with our large-scale data use cases. And, yeah, the first scenario is actually super common where we deploy two Prometheus instances at each cluster for high availability purposes. And in this case, a new external label called replica is added to identify the source Prometheus of these data blocks. And then the problem of this setup is actually the data duplication problem because of the grouping mechanism. And the external labels for different replicas are actually different. So, we will have two groups for two replicas, but actually they are still in the same cluster, right? So, finally, we will have two compacted blocks like in this cluster US with totally overlapped time range for the two blocks. And obviously, you can see this is not good because firstly, we double the space usage by having two blocks with almost the same data. And also during the query time, we have to do some online duplication because also the query touches the two blocks. The query performance is not as good as with only one block. So, to solve this, actually, as I mentioned, we can use the built-in vertical compaction mechanism to merge the overlapped blocks, right? And the tunnels provide a flag called the duplication replica label. And if we specify it, the replica external label can be ignored during the grouping and planning phase. And all these six blocks will be planned and the compacted together. So, yes, this is good because we utilize the default vertical compaction mechanism and use it to improve the space usage and the query performance. But the question is, is it really good enough? And what about the trunks data? So, actually, the trunks data is still problematic because the trunks data are from high availability, pre-missius pairs, and actually they cannot be perfectly deduplicated by the default vertical compaction mechanism because we have two pre-missius instances and they cannot actually perfectly collect samples at exactly the same time. So, the timestamp of these samples from two pre-missius cannot be the same. However, the default vertical compaction only works for one-on-one deduplication, but in the case of different timestamps, it doesn't work. So, finally, we still have double space usage for these replica data. And it would be better to have some smarter vertical compaction algorithm to do this deduplication for this use case. And actually what we want is to have something like this. We want to maybe only keep one sample from only one replica, and that's already good enough to us. So, this can be done by extending the deduplication algorithm used in the vertical compaction, and in tunnels we supported the same penalty deduplication algorithms used by the tunnels query and we applied for deduplication at the compactor side. So, this saves the space usage a lot. And another challenge we made in tunnels is about how to deal with some dedu-manipulation requests from the object storage. For example, we want to delete some high-cardinality series or we want to relabel these series or rename these metrics. And the problem is that these blocks are from object storage and they are immutable by default. So, can we modify the existing index in the trunks easily? So, yeah, because today's topic is about compaction and actually tunnels uses this compaction mechanism to achieve this request. So, let me use relabeling for an example. So, we have this kind of relabeling configuration and what it does is that the first relabel request just renames these CPU metrics to another name and another relabel request just drops this label with the name code. So, in this case, we are still trying to do a compaction for this block we want to modify. And while we iterate over all these series on the right-hand side, we just do this relabeling to each series label. And if this series label changes, then we can do some actions accordingly. For example, in the case of label replace, it's super easy because we just need to rename the labels on the right-hand side. So, only index parts need to be changed. We don't need to modify anything related to trunks. But the label drop case is a little bit trickier because the code label, once the code label is deleted, then these three series will be merged into one series, right? So, in this case, we need to also merge the corresponding trunks files and combine them and do some deduplication in this case. But anyway, finally, we relabel this PCB block and we got a new block after this compaction. So, this is cool. So, why not we have this support in premises as well? So, recently, I opened a PR to support this kind of use case in the PROM tool. And actually, the required change is super easy. So, first, I added one interface called modifier, and it has only one modified method. And this is just used to apply some additional modification logic to the index files and to the trunks files during the compaction phase. And for the Prometheus built-in write method, it's extended with a list of modifiers to apply this logic. And to view it more easily. So, this is the graph again with the modifier. So, you can see the modifier just works as a middle layer in the compaction process. And during the compaction phase, we still need to merge index files and trunks together, but we apply the modification to them before writing a new block. So, what can we achieve using this modifier? So, let's do some brainstorming about down-sampling. So, you can simply think about down-sampling as a process of increasing the script interval in premises. And in the diagram, you can see we down-sample data from 15 seconds to one minute resolution. And after down-sampling, we have fewer samples in this case. And you can imagine, in the case of down-sampling, we just need to change the trunks file and we don't need to modify anything related to the index. So actually, we can implement down-sampling modifier and with the configuration on the left-hand side when we do the compaction, we find all the matches series with this matches label and just rebuild these trunks with this configured resolution. And the last brainstorming case is about dynamic retention. We can also implement a retention modifier and during the compaction time, we can check whether these series matches are required matches and for these matches series, we move them to a new block and keep them for a longer time. So, yeah, I think that's all for today about my talk. So, yeah, thank you for everyone for listening and I'm ready for the Q&A part. Thank you. We're going to have a slightly shorter Q&A for this session to catch up a little bit on time because we've run a little bit over. Does anybody have a question they'd like me to repeat for our virtual audience? All right, I have one then. You mentioned the two dimensions of vertical and horizontal that you compact by. I think I'm correct in saying horizontal is your time there and vertical is overlapping data, that sort of thing. Yeah, but actually, I think vertical and horizontal is the same thing. But in the graph, you can see when we try to merge the trunks together, it's just still like horizontal emerging, right? But if there's no time overlapping, we don't need to do anything. We just change these trunks together. But if there's any overlapping, we need to rebuild the trunk. So, yeah. I was going to ask, do you see any other dimensions that besides those two that you would like to... Yeah, I don't think so. I think only two dimensions. Well, thank you very much. Okay, thank you.