 Hello, and welcome to my talk about how we converted our Cortex data to the SDB blocks. I am Peter, and I work as software engineer at Graph Analytics, where I work on the Cortex project. For those of you who are not familiar with Cortex, here is a short introduction. Cortex is a horizontally scalable, highly available, multi-tenant, long-term storage for Prometheus that provides global view across all of your Prometheus data. Cortex features blazing fast queries and is 100% compatible with PromQL. Cortex is currently an incubating project at CNCF. In short, Cortex acts as a remote variety target for Prometheus servers and allows querying the data back. Year 2020 was the year when we have introduced a new storage engine into Cortex, called the blocks engine. The storage engine uses the SDB blocks exactly the same as Prometheus itself does. In fact, we reuse the SDB library from Prometheus to write the data. In addition to that, Cortex also reuses some components from Thanos, and we have optimized some of those components for the benefit of both Cortex and Thanos. Why did we spend all this effort? SDB blocks are simply much more efficient at storing the same data compared to the storage used by Cortex previously. In addition to that, we also get extra features that Cortex did not have before, like support for queries without the metric name, pertinent data retention, or much simpler data cleanup from the long-term storage. Introducing the new data storage format poses the question about what to do with the old data stored in a legacy format, and that is what this talk is going to be about. Original storage in Cortex is called chunk storage. It has two parts, index and store for chunks. On this slide, we see example of three chunks. Each chunk contains samples for one-time series for a specific time range, typically a few hours. In addition to those samples, Cortex chunk has some metadata about itself, like what series it belongs to, stored as a set of labels, or minimum and maximum time. Each chunk is stored separately, for example as individual object in the cloud storage or individual cell in Bigtable. As you can imagine, this generates a huge number of objects or cells. Due to the replication, individual sample may end up in multiple chunks, although Cortex chunk storage does have some features to reduce this duplicity. For Cortex to be able to locate the correct chunks during the queries, it uses index. Index is stored in no SQL database, like Bigtable or DynamoDB. Index needs to have multiple features. First of all, index needs to allow for efficient lookups based on time, because prompt queries are always restricted to some specific time range. To make them run fast, we need to restrict search to this time range only. Index must support search by specific label names and values. Next, index must also support multi-tenancy, even though the index entries for all users are stored in the same database. How exactly index looks like in Cortex has evolved in time, and Cortex uses so-called schemas to describe each version of the index. On the picture, you can see simplified view of so-called version 9 of the index, with entry types like label, label value, or series. Here is the fun challenge. How to efficiently return chunks for Cortex' uptime metric with job names starting with Q? Let's take a look at TSDB blocks now. Block consists of multiple files on disk. Each block has a unique identifier that encodes timestamp when this identifier was generated and random part for uniqueness. Inside chunks, some directory, there are so-called segment files, which are numbered. In this image, there are two. Block also has an index stored in a single file. Metadata are stored in a small file in JSON format. And finally, there are tombstones, which Cortex currently doesn't use. This is what meta-JSON file looks like. It contains some information about the block, most importantly, the time range that it covers, but also some stats and information about compaction. Now let's take a look at segment files, those numbered files under chunks directory. Segment files contain individual chunks. These are similar to Cortex' chunks, but much simpler. They only contain samples. Chunks are stored one next to another in the segment file, until they take roughly half of gigabyte of space, and then another segment file is started. Each chunk has so-called reference number, which is basically the position or the offset of the chunk in the segment file, combined with the number of segment file, 001 in this example. Encoding these two values together gives us a reference number. Index in TSDB block is a single file, and it is quite complicated internally for maximum efficiency. But the basic idea is simple. It contains information about all time series in the block, that means list of labels, list of chunks, or the reference numbers, together with minimum and maximum time for each chunk. Index also contains all label value pairs for fast lookup, and these label value pairs are then mapped to series IDs. And that's basically it. Now you know what's inside your TSDB blocks. It's interesting to see that Prometheus does not encode series type, like gauge or counter in the index. What is important about TSDB blocks is that they are standalone. Each block is independent from any other block. Blocks can be merged together into larger blocks through the process called compaction. And this helps to save the disk space, because big part of the index is typically the same. That is, if you have the same label names and values for longer than couple of hours. To learn more about TSDB blocks, I can highly recommend series of block posts by my colleague Ganesh Vernekar, who maintains the TSDB library in Prometheus. Slides contain the link if you are interested. So how are TSDB blocks used by Cortex? Cortex components put incoming data to blocks and upload them to object storage, like GCS or S3. We define our own structure for storing blocks. Basically, each user has its own directory full of blocks. And we also put little extra metadata into meta JSON file. But otherwise, we just use plain TSDB blocks. In Cortex, we generate two-hours blocks at first, but later we compacted them into one-day blocks. That is, each block covers 24 hours of data. We have seen how Cortex chunk storage looks like and what TSDB blocks look like. Now, how do we convert or Cortex chunks to TSDB blocks? To generate blocks, we need to find series and chunks that belong to this block. Since we want one block per day per user, that means finding series for specific user that has samples in that specific day. Unfortunately, Cortex index was not designed with these requirements in mind. Maybe thinking, what? Isn't that what index is for? Yeah. Well, index is designed to do lookups for label and value pairs, but not to find all series for a user. In fact, it's even difficult to quickly find all users. In addition to that, Cortex uses different versions of index. Fortunately, in our production databases, we have only used version nine and later which are compatible. So that is what we have focused on. On the other hand, we can read the entire index and generate list of series and chunks that should be put into each block. And that is actually pretty easy and efficient to do. It just requires a single full scan of the index. And this observation drove the design of conversion tooling. Our conversion tooling has three main components. Index scanner for doing full index scan and generating plan files, which is basically a list of series and chunks that belong to a single block. Block builder, which uses single plan file, downloads all chunks and constructs TSDB blocks. And finally, scheduler, which monitors available plan files and distributes them to block builders. I will shortly talk about these components now, but if you want even more details, there is a link to the design document. Scanner scans the index tables. In Cortex, our index is divided into individual tables. Each table covers one week of data. When scanner scans the table, it is processing all index entries in some order. Due to how Cortex stores index entries, this order is basically random, because Cortex prefixes each entry with a hash for a better key distribution. At least that is the case when Cortex stores index in Bigtable, which is what we have used in our collection. While reading all index entries, scanner only selects entries that describe the mapping between series IDs and chunks and stores these mappings into a file. We know minimum and maximum time of chunk because it's part of the chunk ID. So we know into which days the chunk belongs. So in the end, the scanner produces one file per user and day. And this file contains complete list of series IDs and their chunks. And this is called plan file. If you have thousands of users in the index, scanning single table will produce seven times that number of users of plan files, one per day per user. Plan files are then uploaded to the object storage bucket, and scanner can continue with the next table until it processes all of them. If new table appears, which can happen because entire conversion takes many days, and customers are pushing new data in this time, scanner can still process the new tables. Originally, scanner has supported Bigtable only, but now it also supports DynamoDB, and there was some community work on adding Cassandra support. Scanner is not horizontally scalable by design. It's typically pretty fast already. BlockBuilder is the main component of the tooling because it generates the SDB blocks. BlockBuilder is told which plan to work on, and then it downloads the plan and fetches all the chunks from chunk store. BlockBuilder then builds the SDB index and segment files and uploads generated the SDB block back into packet where Cortex can find it. BlockBuilder takes care of many details that need to be right. It deduplicates samples from multiple chunks in the same series. Such chunks may appear because of replication used in Cortex. Builder can also fix some bugs introduced in Cortex chunks over time, for example, duplicate labels. Builder, of course, takes care of sorting series before building index. Otherwise, index would not be correct, and that's so without getting wound killed. Builder obviously knows how to produce meta JSON file with Cortex metadata. Important point about BlockBuilder is that once it has the plan, it doesn't need to interact with Cortex index. It only needs to fetch chunks. Cortex chunks already contain full information about the series, like label names and values, and that is all that BlockBuilder needs. This also shows one reason why TSDB is more space efficient than Cortex chunks storage, namely because the information about the same series is not repeated in every chunk when it is stored in TSDB. Instead, the series information is only stored once in the TSDB index. BlockBuilder is horizontally scalable. You can run many of them to get blocks built faster. We have said before that index scanner rats plans to the bucket. Scheduler is the component that finds those plans and sends them to the BlockBuilders. BlockBuilders communicate their build progress using another small file in the bucket, which is regularly updated as long as BlockBuilder is running and working on the plan. If Scheduler finds that the BlockBuilder hasn't reported its progress recently, Scheduler will consider such a builder as crashed and will abort the build. When BlockBuilder is done with the plan, it uploads finished file next to the plan file. This tells Scheduler that given plan is finished. If build has failed, BlockBuilder will upload failed file instead. This is a very simple orchestration using object storage. Note that the scanner doesn't need to read the files. It only issues list commands. All information is encoded in file names. By doing a single list, Scheduler knows which plans are available, which are in progress or finished. This allows Scheduler to update its in-memory state every few minutes so that when the BlockBuilder asks for the next plan, Scheduler can return one. How did it work? Well, it worked. We have successfully converted all stored chunks data into TSDB blocks, and we could downscale our big table instances. What have we learned? That simple distributed system is good. Communication using buckets by listing of files works just fine, and it's easy to manipulate the state of jobs by manipulating files in the bucket. For example, if you delete the failure file, Scheduler will see that the plan needs to be built again and will give it to the BlockBuilders. Were there any problems? Of course there were. First issue we have run into was related to meta-json files. In early versions of BlockBuilder, we did not correctly set all meta-json file details, especially Compaction section. Cortex Compactor was then confused by such blocks and deleted newly built blocks soon after they were uploaded to the bucket. We did not catch this in testing because we didn't run Compactor in cluster where the test was done. But once fixed, we were quickly able to just redo those blocks again. Another issue we have hit was related to memory. When writing TSDB index, we were trying to sort all series in memory, but some of our customers had so many series that we could not fit all the labels in the memory during the BlockBuild. After retrying the job several times with more and more memory, we eventually gave up at 30 gigabytes and stored series data on the disk instead with some sorting afterwards. That fixed the issue. And last point I have is horizontal scaling for the win. By using planned files, we split the whole conversion task into many small sub-tasks, allowing BlockBuilder to be horizontally scalable. That is, we could just run more of them to proceed faster. One of our clusters contains thousands of tiny tenants. A number of plans on this cluster was in millions. But we got it converted in just a couple of days by running many BlockBuilders. If you have your own chunks data that you would like to convert to blocks, all this tooling is part of open source code base. In the last part of the talk, I want to show you how you can write TSDB blocks by some simple go code. Of course, if you want to convert some data to TSDB blocks, you can also use the new Prometheus backfilling tool. But sometimes it may be more efficient to write TSDB blocks directly without converting data to open metrics format first. And it's also easy and fun. To generate TSDB block, we need to generate, sorry, we need to do three steps. Write chunks into segment files, write index, and write method JSON file. And we will use TSDB library from Prometheus to do that. You can find full example at provided link. This code example shows how to prepare chunks for a single time series from given list of samples. Each sample has a timestamp and a float 64 value. We need to iterate through all samples. We convert the timestamp to the Unix timestamp in milliseconds because that is what Prometheus expects to find in TSDB block. We check we have a pender for chunk. If not, we create a new sort chunk which is used by Prometheus and we get a pender for this chunk. A pender is used to write individual samples to the chunk. We are not only building chunks but also chunk meta information with minimum and maximum time. Now that we have a pender, we can write samples to chunk. We simply append sample to the pender and update max time in metadata. In this step, we assume the timestamps for samples are increasing. That is how Prometheus expects to find samples in the chunk. After writing 120 samples into a chunk, we finish the chunk, append metadata to our list of metadata and set a pender to nil which will cause creation of new chunk in the next iteration. And that is it. This creates chunks. Chunks are written to the disk using chunk writer. One can get the chunk writer from TSDB chunks package and it currently has these two methods, write chunks and close. Write chunks can be called many times. We give it a slice of chunks to write and after writing them all, we can close it. There are a few important details. Chunks must be sorted in the same order as is the order of series in the index. Prometheus does it this way and Prometheus may use this fact for future optimizations. Within single series, chunks must be sorted by increasing time. Calling write chunks will update reference field of each chunk meta structure. This reference field is important in the next step, writing index. Index writer has two important methods. First, you must add a symbol with all symbols that is label names and values and these symbols must be sorted and each symbol must be added exactly once. After writing all symbols to the index, we can finally start adding series to the index. To add a series, we need to pass in the ref number, set of labels that the series uses and chunks and the slice of chunks meta structures. These chunks meta structures must have reference field set and as we have set on previous slide, it is set after chunks are written to the chunk writer. The reference number passed as first argument is somewhat strange. It's really not needed and the implementation of index writer in TSDB library only uses it to check that it's increased between calls. After writing chunks and index, we are almost finished with the block. Last piece that is missing is meta JSON file and this file is represented by block meta type from the TSDB library. We need to fill all the fields and most of them are pretty self-explanatory. Version is currently one. Unfortunately, this constant is not public in TSDB library. Compaction information must be set as well and we simply set block source to itself. The failure to do this may confuse Compactora. After writing meta JSON file to disk, we have a complete block that we can use with Prometheus. For Thanos and Cortex, we need extra bit of metadata in the meta JSON file but nothing too complex. You can find the fully functional example code at this address. It contains a tiny program that generates TSDB block with the single series. If you put this generated block into Prometheus, you can query it and see a sine wave. If you want to use this approach to generate TSDB blocks, check out Prometheus storage documentation and section about backfilling. There are some small things to keep in mind when converting very recent data and overlapping blocks. On the last slide, you can find some links to learn more about the blocks engine in Cortex and TSDB blocks. Thank you very much for your attention.