 Thank you for the introduction, allowing clear, yeah, I think my placement here. So in addition to what he gave as an introduction, I am also a Prometheus team member and I maintained the Prometheus TSTB upstream. So that is why I am speaking about it here. Before we go into the TSTB, what is Prometheus? Prometheus is a metric based monitoring and alerting stack, you can instrument your application or run exporters which will expose the metrics that you want to monitor and Prometheus scrape collect those metrics for you, store it inside, let you alert based on condition. So it is a whole monitoring stack and talking about the time series itself, it has an identifier in Prometheus it is such a set of label names and label values like a pair of strings you can say a slice of pair of strings. The HTTP request total here is an example where the metric is tracking the total number of HTTP requests done for this particular job and that is that one is a special label name which has a custom label name and that is just a value. Along with an identifier time series is just a stream of samples. Here a sample is a tuple of timestamp and a value. We use Unix timestamp and the value can be a float and that is the basis of a time series which we are going to store in the TSTB right now. So Prometheus is huge it does a lot of things but we are only going to focus on the time series database which is inside Prometheus in the middle, how it stores the data in its raw form and just makes it available for queries. So we are not going to worry about anything else that Prometheus does. So here there is a TSTB and the sample that we are going to talk is the timestamp is a 64 bit integer which in Prometheus we store the milliseconds as integers because we found that is kind of enough for most of the use cases and in case of value it is a float 64 value. So Prometheus right now supports more than float 64 custom data structure to store high resolution histograms but for the sake of this talk we are only going to talk about sample having 64 bit integer and a 64 bit float value. So let us before diving into how it stores inside a TSTB it is we can look at the overview. So there is a component called head block it is component which stores the index of the recent data in the memory and recent data in the memory and some things memory map we will see how it works in a moment. So this is the component that first gets the data in the TSTB and after the head has stored data for some time we create persistent blocks immutable blocks which you can show for some term and do queries on it. So we are going to go inside this book itself and how they are created. So let us first look at writing a sample inside the head block. So here is the head block there is nothing inside right now and we get one sample and we create an index entry for this particular set of labels called a time series. Chunk is a compressed set of samples we call it a chunk in Prometheus we use something called Gorilla compression by Facebook and every time you get a sample you compress it in flight and store it there instead of storing the raw samples. But before we put sample into the TSTB we have first created the series entry in the index and then we write into something called write head log we will see why it is required right now. Once we write it into the write head log like log that this write has come to the TSTB we then write it inside the TSTB in a compressed fashion. And inside the write head log if it is the first sample we store the labels of a series and series gets an ID that is ID 1 and we write another record which says the sample is T1 comma V1 for ID 1 it is the write head log it just logs every write that comes to the TSTB. So we need write head log for durability just imagine you did a write request to the TSTB and the system kind of crashed right now or you said to the upper layer that ok this write was successful but if the systems currently crashes all the data is in the memory how you produce it back. So we write a head log to replay the events that happened as is in the writes and recreate the in memory data structure that we had before the crash or even during restart. And we get another sample we write it to the write head log and then the chunk and inside the write head log now you see we do not write a series record because we have already written a series record and you pass through it from start to end. So we just say for ID 1 we got another sample that is about the first step of adding samples in the Prometheus. So let us say you added more samples and it got full right now like you have to cut the chunk somewhere you cannot just keep on growing the chunk. What we do here is once a chunk is full in our case we take it as 120 samples but we are currently trying to cap it at a certain size as well. We just memory map it into something called head chunks. Head chunks are just a bunch of this compressed chunk and just the reference of that particular chunk like where it is on the file in the memory. So you are not storing the compressed chunk in the memory you just you are just holding a 8 byte reference and whenever you want to fetch this series we just take the reference fetch it from the disk and then query on it. This way it saves memory for during majority of the time when you do not need to query that data. And the same process repeats you get more data more chunk gets filled ok now you have a lot of data in Prometheus how do you take care of it. Now we create something called persistent blocks out of the data that is present in the head block right now. We need to do that to make queries and a lot of things efficient. So this process is called head compaction where we take the data from the memory map chunks which are on the disk and also some data in the memory if it falls within some logic that Prometheus decides ok we want to compact data from this time range to this time range and just create a block. And the same process repeats we create another block the earliest block I am numbering it n the new block is n plus 1 it is like a linear set of blocks that come into the TSTV ok. But why do we need compaction? So time series are not always the same after some time you may be ingesting data into new set of time series. So you do not want to hold index record of that in your memory. So once you compact and put the flushed data onto the disk you use those index entries from the end free space also you are holding the chunk references in the memory that also gets cleaned up and restarts will also be faster because now you have to replay less set of events to create the memory structure ok. So what about the bigger block if you notice in the diagram I have the n minus 2 block as something large block it is we created from the smaller blocks that we just created. So imagine you have 4 blocks A, B, C and D. So based on some logic we choose 3 blocks at a time and just merge them to create a bigger block we will see why it is required soon like every block has its own index and we are going to look into the index soon. Yeah block compaction is for efficient queries it also reduces the disk space usage because the index is not repeated and we will see soon how. So before we look at the block at some point you want to get rid of the old data you do not want to store the data in your system because sometimes you do not need the old time series data. So you can configure TSTB to have retention based on the disk space usage and also how much time range your data is covering. So this is an example of time based retention if we consider this as some kind of number line every data which is beyond the red line we have to delete it. But if a block overlaps we cannot pass immutable we have to delete a whole block together. So once we add more data and create more blocks and the block goes outside the retention range it is just simply deleted like there is no waiting as soon as new blocks come up the old blocks goes outside the retention range the data block is just removed from the system. So this is at a high level how TSTB works lets dive a little deeper into a block. So inside the persistent block which is created out of a head block it contains an index which maps all the data that is present in the block it has a meta telling all the important information about the block which will let you decide how to and when to query it. Tombstones for deletion will come to it soon why we need tombstones and why cannot we delete the data from block immediately and there is the chunks themselves. So we store all the chunks together in a bunch of files all the compressed samples chunks are compressed unit of samples. So let us look at all of them one by one starting with the meta file in the meta file like every block has a identifier which is called universal identifier and block stores data from a particular min time to particular max time. So if you look at the meta you know what is the time range this block is covering few health information like how many series how many chunks and if the block was created from other blocks we also store the parent blocks for debugging purposes. Now let us look at the chunk files themselves this is pretty simple it has a list of files every file is capped at 512 MB and in every file we just store the compressed chunks as is because all of this reference is mapped inside the index and the reference is stored in a very simple manner. So we have a 8 byte reference for every chunk the first 4 bytes stores the file number in which the chunk exists and the last 4 byte stores the byte offset in the particular file where the chunk exists. So if you if I give you a 8 byte reference with the first 4 bytes you will which you want and you will take this last 4 bytes and just seek your file to that byte offset and there you have the chunk the chunk has helpful meta information to know how much to read at time. So this is a simple way how chunks are stored on the disk the most interesting part is the index Prometheus stores index in something called an inverted index format and see we will see how it stored and how it square it both. So at a high level the index has 4 components first is the symbol table the symbol table stores all the symbols basically the strings that were seen in a time series. So the symbols are like if you had label name HTTP request total that is a symbol job equals to nginx then job is a symbol nginx is a symbol. So all the symbols are stored together in a single table because if you repeat all the symbols in the index everywhere it just takes a lot of space. So you just store it in a label in a sorted fashion and just use index the first symbol is number 1 the second symbol is number 2 and just use those numbers everywhere in the index so whenever you have to get the actual symbol you look up this table. So this is the purpose of symbol table and now the series itself. So it stores the labels that pertain to this particular series instead of storing the string itself it uses the symbol table index right now to store which labels are for this particular series and then a slice of chunk references. We saw earlier that chunk reference is just a 8 byte number but with every chunk we store what is the min time of the chunk what is the max time of the chunk and what is the encoding of the chunk so that we know how to decode the chunk. And the series are stored in a sorted fashion based on their label values like or first based on the first label name and value then the second label name and value and so on. So if you take s1 and s3 we know that if we sort it again s3 is going to come after s2 and s1. So this is how series is stored these are just simple plain information that is stored in the index. Now we come to the interesting part so in inverted index word the postings are nothing but the series I skipped one part here. So every series has a reference s1, s2, s3 that I am mentioning here. So those are again byte offset in the index. So if I give you a series reference called s1 you just take an offset in the file with s1 and there you get the series. In Prometheus we align it to 16 bytes so reference is actual byte offset divided by 16 and if I give you a reference you just multiply it by 16 and get the byte offset and you directly go to the series where it exists. And postings are nothing but those series IDs. So postings are series IDs and we store a list of posting lists. We will come to it in a second why we are storing a list but this particular section stores a list of those series IDs and the reference Pl1 Pl2 is again the byte offset in the file. I will come back to this again after looking at the next stuff. So there is posting offset table this is the important part. So you saw every series has a set of label name and label value pair and when in this posting offset table we store like okay foo1 equals to bar1 is present in a set of series which is represented by the posting list one and foo1 equals to bar2 is present in this particular set of series. So this is how we store the inverted index for every label value label name and label value pair we store what are the set of series that correspond to this. So this just points to a posting list as a reference and based on Pl1 Pl2 Pl3 we can look at this table to actual set of series that are for this label name and label value. So now we will see how we use the index to query. So let us say now I want to query in Prometheus fashion this should fetch series that has both the labels foo1 equals to bar1 foo2 equals to bar3. So we take one matcher at a time let us see we take the we look at foo1 equals to bar1 and foo1 equals to bar2 against the posting offset table okay now we know it matches a particular set of posting list. Now we have the reference where the posting list exists. Now we take the reference look at the posting table and we get the set of series references that actually match these label values and now that we have two sets of series like series references you just intersect them and finally you now you know that this series mentioned by the reference s6 and s22 matches the query that you get. This is a simple query that you can do on TSTP and now you have s6 and s22 you just take those references again and look at the series table and now you get the series and the series now tells you what are the chunk references that you have and then you can just fetch the chunks and do the query on that and depending on the time range that you have queried let us say you have queried from T1 to T2 you just trim the chunks when giving back to the API caller. I will go back to this diagram again so in short we started at posting of when querying we started at posting offset table with that information went to the postings from there we went to the series and we finally got the data. So when you have to this is about querying a single block like the index is concentrated on a single block when you have to query multiple blocks you do individual queries to each block even the head block gives the same interface of using the label name and label values and get the samples and series pertaining to it. So there is a there is just an implementation called courier which queries each individual blocks and then merges the data together. So this is where the big block comes into help if you have less blocks big blocks look up less number of blocks to get the same data and because the index series because sometimes series stay for a long period you again de-duplicate these entries. So that is the use of having a bigger block and it is not just the equal matcher that used to be gives you can also do not equals to or you can match a reg X or you can say if it does not match a reg X just give me these results. So finally we come to the tombstones. Tombstones are there to record the deletions that you make on a block because the blocks are immutable if you have to delete a data and series you have to recalculate the entire index because all the offset and everything is calculated based on the byte offset. So that is very inefficient. So when you get a deletion request you see which series it affects and what is the time range the deletion is asked for and you just record it in a file called tombstones which just says for the series reference this time range is deleted and so on. This is usually small so we do not really optimize stuff here and whenever you are querying when you are looking at the chunks you also cross verify with tombstones and only return the data that does not overlap with these tombstones. So we have like 6 minutes so I will quickly cover a couple more things. We talked about head chunks that we maintain in the in memory database part. We have to see how we replay it on startup and there is another artifact called a snapshot on shutdown which kind of helps in this replay we will see in a moment. So when we are replaying the data you first replay all the head chunks basically the compressed chunks that you have on the disk and once you have that you replay the right head log like in when you are replaying the right head log decompressing the right log takes roughly half of the time and then actually ingesting into the TSTB takes roughly half of the time. The help of the memory map head chunks we can discard samples which already exist in the compressed chunks so that saves quite a bit of time but still replaying a right head log is a little slow. So there comes a snapshots on shutdown where everything that is that you have to replay on a startup when you are shutting down gracefully you just take a snapshot of that you take snapshot of all the series that exists instead of taking it from the right head log records and then take the snapshot of the in memory chunk which has not been flushed to the disk in which case you do not have to go to the right head log. You can simply skip the right head log just replay the head chunks and snapshot this way you can speed up the replay and there are another component called right behind log which is used for out of order ingestion of samples. The process that I showed right now every sample has to have a timestamp greater than the previous sample for the particular time series for the compression to work well. So there is for out of order ingestion we have another artifact but because of time crunch we are also not discussing that and this was a very brief introduction in 25 minutes you can only explain so much. So I have a 7 part block post which explains everything in detail. So if you want to link two slides you can scan the first one if you want the link to the block post you can scan the second one. Thank you. So we have 4 minutes for Q and A. Do we have questions for him? Wow that was a very well received. No but I was quite intrigued I have never seen actually TSTB structure before head logs which is a very common idea right? I would like to add one more thing like this TSTB block is not just used by Prometheus there are projects like Thanos, Cortex and Memer which are distributed time series database built as an extension to Prometheus for long term storage. They use this specific block format to store their data like they literally use the TSTB code its open source its go code they literally create the same kind of blocks and they have some kind of superpower in the sense that the blocks are distributed in Prometheus it is all on a single machine and Memer Cortex Thanos it is distributed so you can the when querying multiple blocks you can just run a query on a separate machine and that way you can speed up the queries. So this TSTB block format is used by multiple popular distributed database as well.