 So thanks for coming this is the first seminar or first lecture in our seminar series on time series databases. The reason why I'm hosting this the seminar series is because time series databases are the hot thing. All the new database startups are time series databases. So I have a big question of like what's actually new? What's actually different? Why should we care? So that's sort of a goal we're gonna have this semester of learning what time series databases are about. Someone's gonna make money. Someone's always gonna make money right? Hopefully me. The question is why. So we're excited having the first speaker today is Paul Dix. He is the founder and current CTO of Influx DB. They're a time series database company out of New York City and they're probably the number one time series database that shows up on Hacker News. We're actually out of San Francisco technically is our headquarters. Okay. Where do you live? I live in New York City. Okay, there you go. We're on a plane. Okay. So he's here to talk about Influx DB and he's gonna talk about the storage engine and the terminals of their particular assessment. Okay, thanks Paul. Thanks. All right, cool. So yeah today I'm talking about the storage engine that we built. There's a whole lot of stuff with Influx but today I'm focusing mainly on a storage engine that we wrote from scratch specifically for time series data. So Influx DB is open source. It's MIT licensed. It's written in Go. At least open source for a single server basically. Anything that's single server if it's focused on what a single server can do it's in the open source. We have a commercial product that adds high availability and scale out clustering but this talk is about the code in the open source single server stuff so you can actually go check it out. Because this is the first talk in this series I figured I would set some context first. What is time series data? What the hell? Like this is how I view time series data so I'm not sure what the other speakers will say. Because I live in New York City this is one that's very popular there. Stock trades and quotes. Those are that's time series data right? The raw event stream of trades is a time series. Here we're looking at the stock price of Apple. This is actually we're not looking at all the raw trades we're looking at a summarization of some underlying time series of events. It is supposed to be a graph. That's fairly washed out. I think it's actually the projector. Yeah it's the projector. So this is a big use case for us which is metrics. Here we have a dashboard of some server metrics application performance metrics. That's all time series data. I think analytics user analytics is time series data right? Those are event streams where you're counting things. This is a log from Apache. I've used this view this as a bunch of time series right? There could be 200 requests over time. There could be 404s over time. There could be requests to a specific page. You can basically break this up into a bunch of separate time series. And then sensor data right? Taking readings off the physical sensors out there in the world. I view server monitoring and sensor data is very very similar where in server monitoring your sensors are software sensors and in this kind of IoT sensor data it's actually physical sensors deployed in the world. So for influx we think about two different kinds of time series data. There are regular time series which are basically samples taken at fixed intervals right? One sample a minute whatever. Those are basically summarizations of something right? And then there's irregular time series which are largely event driven right? That's if you have requests into an API it could be the response time for the individual requests, trades in a stock market. The thing about irregular time series is that you can induce a regular one from an irregular one right? So say you're tracking request response times and you want to say okay give me the min, the max and the mean in 10 minute windows for the last 24 hours right? You've created a regular time series out of an underlying raw event stream. So when I first started this project a question I got often was why would you want a database for time series data? Like you know use a relational database put in a time column order by that duh. But there are a few reasons. So scale is one of the most important reasons. Scale in terms of the amount of data that you're dealing with. So taking an example from server monitoring say you have 2,000 servers that you're monitoring which is a moderate sized infrastructure with a number of people right? You take a thousand measurements per server we've seen people taking anywhere from about 200 to up to 2,000 measurements per server that they're tracking and you're sampling these every 10 seconds right? So basically you're talking about 17.2 billion actual events or data points per day. You wouldn't want to stick that into a table. So basically at that scale how you organize the data becomes important. Compression becomes very very important. We want to be able to age out old data. So it's common in our use case to have high precision data that you keep around for a small window of time right? Maybe seven days and then have lower precision data that are summaries of things right? So say you have 10 second samples and then you have 10 minute summaries and then one hour summaries. Yes all of it online and fast. And the thing about aging out data right? Like the naive way to do it in a database would be like oh we're just going to delete every record once it's aged out. But what that means is once you hit your envelope of time that you're tracking for every right that you do into the database you're doing a delete and that is a workload that really really sucks for databases. So basically the way people hack around this in SQL databases is they'll create separate tables for each like block of time and then they'll just drop those tables to age it out. But ideally your database system would handle all of this for you automatically. So if you're thinking about stock prices for example but even stock prices usually you think about like a debugging sequence. Things are going fine something happens now you want to look in the past and try and figure out what happened. Yep. So how far back you go all of that information may become pertinent. So do you fault back in really old information because you haven't just re-recorded somewhere? In the future we will. We don't currently in our architecture and we have we do have people who keep all their data around for all time and we have other people who keep like seven days worth of data and they just drop all of it on the floor because they don't care. So automatic down sampling right if you're if you want to compute summaries for longer-term views of data right say collect the min max mean sum and count automatically in different windows of time. And the other thing you want to do is you want to be able to do fast range queries for to return an entire series or to do a computation on a series of events right. Get me seven days worth of data and give me one hour summaries of that data. So with influx it's actually kind of like two databases in one system. So the first is the raw time series database for for storing time series data and the other is an inverted index for matching metadata to actually time series that we want to do computations on. So I want to do some preliminary intro materials to how influx organizes data or what what the data looks like in influx because it's different than like a sequel where database where you have like a table and all this other stuff. So obviously everything is indexed by time and everything is indexed by time and a unique series that you're tracking. We organize the data into shards which are basically just contiguous blocks of time. So we want to be able to read a bunch of because range queries are so common we want that data to be organized like next to each other so we can do quick range scans. With influx DB this is our this is like our line protocol we have like this text based protocol for writing data into the database. It is basically schema-less. You don't have to like create a table and define table and it's schema in advance. You basically just create a database and a retention policy which tells which you you tell how long you want to keep the data around and then you just start writing data in over an HTTP protocol. So the line protocol looks like this we have a measurement name which is a string. We have tags which are key value pairs where the values are strings and we have fields where the keys are the field names and the values can be either an int64, a float64, a boolean, or a string. We're adding support for you int64 soon. And then finally you have a timestamp. Now we actually represent the timestamps in nanosecond scale which we do have people using nanosecond scale timestamps surprisingly. Those are most of those are use cases in like there are some people doing like stuff with quantum experiments where they're tracking time series data and there's also high frequency trading firms they're tracking nanosecond timestamps on their network hardware. That's right yeah well depending on the setups like we have we have one customer who has I think five data centers globally and they have guaranteed less than 300 nanosecond clock drift across the five data centers they shoot for less than a hundred nanoseconds. So it's like basically what Google has the true-time thing with spanner. So yeah so I'll get into this basically this the tags is is indexed data so I'll show some queries where we actually break that up so basically the values of fields those aren't indexed at all right so I'll talk about how it's actually organized on disk so make a little bit more sense. So basically like let's look at how we would store this data maybe in a key value store so say we have a series which is just the string the measurement name the tag set and the field name we map that to some identifier and then it's just tuples of values and timestamps right ordered by time for the individual series so if we had a key value store we say we have the ID for series one we have the time and then we have the value and then we have you know separate series two time and a value and if we insert another one for series one ordered by time so the important thing is if we're using a key value store for this the having it being having the key space be ordered is important for this kind of organization scheme which is what we used to do when we used other storage engines so when we started the project this this is basically the model that we were going on and we're like okay many storage engines have this model initially we used level DB which is a log-structured merge tree out of Google so and then we also tried over time we tried rock DB hyper level DB which are both forks of level DB LM DB and bolt DB which are both copy-on-write memory map B plus trees so none of them were giving us what we really wanted so in was it in the fall of 2015 we decided to write our own storage engine right which when I told like our investors and other people that we were doing this they thought I was completely insane because the the number I've heard is if you want to create a new storage engine it takes about a decade to actually get it to the point where it's mature and stable and you can trust it yeah so yeah so the point is like first we tried LSM trees and the problem with LSM trees were that deletes were too expensive so then we organized like we created a separate LSM tree per block of time but then we ended up having way too many file handles open right for people who had very very large databases they'd actually just blow up their entire system because they'd have too many file handles so like okay well let's try memory a map to copy on right B plus tree maybe that'll be better for us but the problem is the right throughput sucked on that it wasn't even close to what LSM trees could do obviously because LSM trees are optimized for rights and for the time series use case rights were very important also we didn't get compression which again for us was very very important so none of the like popular storage engines actually met the requirements that we had at the time so we wanted to be able to get high right throughput we still want awesome read performance so that we could query in real time right we people want to be able to query the database and get results in ideally like 100 milliseconds because they're building visualizations on top of it so users are waiting for this data to return and we want a better compression and ultimately it like writes can't block reads in this in the system and at the same time reads can't block rights we can't have those things like locking things up and we want to be able to write to multiple ranges of time simultaneously without it impacting the performance of the database obviously hot backups are important this was a big thing for us because level DB doesn't have hot backups and we wanted to be able to have many databases open in a single process without blowing up the number of file handles so the storage engine that we created we called the time structure merge tree TSM tree for short it's basically like an LSM tree heavily inspired by an LSM tree yes exactly trademark but different so the components of it look like this we've right ahead log we have an in-memory cache and we have index files on disk right that's very very similar to LSM trees right in an LSM tree walls the same in-memory cache is like mem tables and index files are like SS tables and just like SS tables our index files are read right once read only after that right so basically going through what a right looks like you know right comes in the system we have a right add log we actually do an F sync so that we were sure it's durable it's just an append only file and at the same time we write it to an in-memory index and then periodically we will flush the in-memory like cache into on disk index files what we call TSM files and the other thing we do is that we memory map all of these files so we can just access them like an array in memory right so the structure of the TSM files these these are the index files looks like this we have the five five byte header that identifies what what it is we have blocks of data we have an index at the end and then we have a footer which tells us where where the index begin is so the header looks like this we have the magic four byte string and then we have a version byte which we haven't bumped yet but I suspect that we're actually going to create version two of this thing we actually just started development on it like last week so hopefully we'll have like an alpha a version two by the end of the year the blocks look like this right so you just have a collection of blocks you have within a block you have the CRC and then you have the data the compressed data and then the index looks like this right we have a key length key which is a string right remember that time series key we have the type because we can have different types there a lot of time series databases that actually don't support these all these different types it would have made the task a lot easier if we say just supported float 64s for instance but unfortunately no yeah this so none of this is actually compressed the only thing that's compressed is this this is where the compression lives so I think I have slides on that in a second yeah so basically these are the byte counts you see are actually like all the actual on-disk byte sizes right so yeah the type and then the count the min time the max time and the offset in the thing that's indexed in the specific series yeah in the specific series that you're looking at so for each series in it in a TSM file we know what time range of that series we have we know how many values are in it and we have an offset in the file so we know where to go to the beginning of the series so then finally we have the footer which is just the offset of where the index is located in the file this is also where you can like jump around this basically array so we can say like oh well let's jump to this series and read data let's jump to this series and read data so here's what a compressed block looks like now by default we will compress up to a thousand values into a single a thousand values and time stamps into a single block right so what that means is even if you're going to read only one value from disk like one one timed value from disk we actually have to decompress up to a thousand to get that but in practice most of the time what people are doing is they're actually asking for a range right so we have the type we have the length and we separate the timestamps from the values so in the summer 2015 Facebook put out a paper called gorilla which is about their their metric system and they talked about their compression scheme so it's basically it floats and they use basically a double delta compression scheme that interleaves timestamps and the floats together we were doing the re one of the reasons they do that is because it was designed to be in memory and append only so technically we're we can do append only but we also can do inserts into previous blocks of time so we actually had to design our system differently and because we were doing a basic up to a thousand values in a block we wanted to separate them out because for many of our use cases we can actually achieve really really good compression on the timestamps right if we know we're taking a value every 10 seconds you can use run length encoding to compress like a thousand timestamps right you need to start time you need whatever the the delta is between those values and then for the values we use different compressions depending on what the data type is right so timestamps encoding based on the precision and the deltas like I said we store timestamps down to the nanosecond scale so if people actually have nanosecond scale timestamps and they're far apart we actually can't really use compression so we pick compression for each block based on the shape of the timestamps that we see so the best case like I said is run length encoding so the deltas are all the same within a block the good case we use simple 8b compression there's paper by Anna Moffat that talks about this index compression using 64-bit words and then the worst case is we just fall back to the raw values right we have to store the full 8 byte timestamp so for floats as I mentioned we use double delta compression this is very very similar to what Facebook's gorilla paper uses we have a fork of so like I said our stuff's written in go we have a fork of a library created by Damian grisky that does this billions of bits those are easy in 64 uses double delta first and then if we can't do that zigzag compression same as protobuffs and then for strings we just use snappy we've been thinking about adding dictionary compression is something to the database but we'd have to test to see like if we get that is like a win you know like if people are writing in strings where it's basically like you know potentially like an enum of different states or whatever we might get something better there but we haven't tried it out yet so we're optimized for an insert only append only read workload but you can write in a record basically the for a value you can view the unique key the identifier as the series key right the string of measurement tax at field name and the time stamp at a nanosecond scale so only one value can exist at that specific combination so if you write a value in with the same key at the exact same time stamp that's basically an update we write those and then we resolve that at query time on the fly so the updates can be expensive to resolve and then later that gets fixed up when we do compactions which I'll talk about in a second so deletes this is very very similar to LSM trees we write a tombstone we resolve the tombstone at query time and then compactions later on go through and clean out all the tombstones when they rewrite the data so compactions our goal with this we want to be able to combine multiple TSM files into larger TSM files we want to put all series points into the same file so it's very very common for for for our users to have say like a million series that they sample once a minute right so basically you're only getting one data point a minute in each one of those series which obviously like if we just we didn't like reorganize that data we wouldn't be able to get good compression out of it so the point of the compactions is to basically take all those different little bits of data from you know those little individual points or maybe a few points and combine them together so we can get good compression and runs organized on disk all together yeah file systems I call this dribble writes and if you look at scans you just keep enough memory around that you have a deep break back but it's coming slowly enough you can't keep enough memory around but there's a deep yeah yeah exactly so I mentioned the one one thousand points in a block there are multiple levels like in level DB right so there you know so we can tell like where where the compactions are and how far it is and the other thing I mentioned is so the data is organized in what we call shards like contiguous blocks of time so if we had say a shard for today when it becomes tomorrow we create a new shard for tomorrow and then we perform full compactions on the old data right even though you can insert older data and historical data the most common use case is that people are inserting data from now or very very recently so essentially four hours after a shard goes cold for rights we will do a full compaction to try and get as much of it altogether at once so the query language that we design for the database is basically like a kind of looks like SQL but it's not SQL it's like a mutant of SQL this is we're going to be changing this actually fairly soon well we're still going to support this but we are moving to a functional query language which I believe is actually better for working with time series data but talk about this so here we're looking at the 90th percentile from CPU measurements for the last 12 hours from the western region in 10 minute windows of time and we're going to look at that for each host so basically we're going to get a separate time series back for every single host we have in the western region with the summarized data so as I mentioned the series key is just a string of the measurement the tag set in the field name so the question is how do we map these little bits of metadata to the actual time series under the under the hood so this is where the other part of the database comes in which is essentially we use an inverted index so most people are familiar with inverted index for full text search right generally you're indexing a bunch of documents and you match terms that appear in the documents to the IDs of the documents themselves in our case we match metadata about a time series to the actual series right so here's what that looks like we have the series key those get mapped to some sort of identifier right so there's basically a look up for series key to identifier we have a look up of a measurement name to the fields that are in that measurement we have a look up of a tag key to the different tag values and here's another tag key to different tag values right and essentially then you just create like posting lists right so posting lists for the measurement CPU posting lists for host equals a host equals be region equals West and if you have that then when they do queries across like different tags and stuff like that you can do things like intersections of the posting lists or unions that kind of stuff clarified this arrow to thing is that saying put it in this index and put it in that index no no no that's that's just like me calling out what that thing is okay so post this does not say that post equal a goes in index one so this says so basically this is this is an array of series IDs so it's so say you're tracking memory from host equals a as well and memory was like you know there are actually many memory measurements who attract but just say you have one and that series is you know series 23 then host equals a would have one comma 23 ready to have the IDs of the series that host equals a appears in that's effectively getting you to a pointer to a data structure on storage that you yes it gets us up to a point where we know right so if they do no no no so if they do they do this query to do this query what we actually have to figure out is we have to figure out what are all the series keys what are the actual underlying time series that match this right so say we had just two hosts in this region a and b the series key would actually be CPU host equals a region West and then value is the field name right and then the second series would be CPU host equals be region West and value so we need to look up both of those time series to compute this so that's where basically the inverted index comes in is instead of document IDs their series IDs so that we can then jump over to those TSM files to look up that data so the first version of this index which is actually what's in the production system right now the 1.3.5 release of influx DB is that this index index is actually entirely in memory it's loaded on boot from the raw TSM files we look at those files look at what series keys exist and we build the index on the fly when the system boots up problem with this is that it's memory constrained you know the more the more time series you have the more memory you need to track all of this stuff and the other things is when you have higher cardinality or a lot of data over time it slows the boot time of the database which we've seen so the index that we're building right now this is not there's like a preview release of it in the current thing that you can turn on it's basically a feature that you'd have to turn on we don't recommend people use it in production but it takes this and actually converts it into something that's disk based so that we can we can have basically we it's disk based and memory mapped so we let the operating system handle what's in memory and what isn't and we can actually jump around on the files to do these lookups so here's what it basically the time series metadata comes in we we look do we already have it on the on disk index which I said is is memory mapped if we don't we write this new series metadata to a write ahead log and we also put it in the in memory index and then there are periodic flushes of the in memory index to these files of which we're calling tsi files time series index files so again looks very similar to the structure of our actual tsm engine and then later there are compactions that happen to combine these smaller index files into larger and larger index files so why am that why am that uh because we just wanted to we wanted to be lazy and let the operating system handle the paging for us I don't know actually oh yeah that's if anybody wants to help us figure it out I've I've heard this yes okay yeah that are my grave I want like don't use mf your database my grave okay yes so these these files are basically this is what a log entry in the right had log looks like the the index files are also uh right once and then there you can you can only read them that you can update them yeah I'm having trouble understanding something probably which is so when you do this query do you always get back like a pointer to an entire time series like is there a concept of like a join like I just want to find out in you know the corresponding data between two different series what does the client have to do that and is the database there for you repeat the questions and show them yeah so the the question is do I always have to get back an entire time series or is there a concept of a join in the in the database um so so by default uh so this oh uh hold on go back to this query here we go all right so here we have this query um if I didn't have group by host here and say we had two hosts it would the underlying engine would look with those two time series but then it merged them together based on based on the times of the values and then give you the 90th percentile value of the combined time series so it runs one this if you didn't have group by host it would return one time series exactly yeah that's in the that's in the database engine yeah yeah so that the query engine is a totally separate thing and there's uh we're building a literally building a new query engine now for the new query language which is functional uh and the way that looks there will be other kinds of joins in that one so and why would you say sql is not functional uh so well that's kind of outside the scope of this talk but uh so the way the way the new language is designed and actually there's a I wrote of a very long like doc on github about it if you actually go to the influx db repo repo and look at like uh influx ql 2.0 like do your do a search for that uh but we're actually not calling it influx ql 2.0 because we actually have to support influx ql as it exists today and so this is going to be a new language but the way it looks like is it's basically chained functions so instead it could look like lisp if if uh we wanted to do that but it looks like spark but Paul Graham or Rich Hickey couldn't make lists popular so I'm not going to be able to so basically it looks more like a set of chained function calls like kind of like you'd see in d3 or jQuery so basically like you have a function and conceptually it's you get a data frame in it does something to the data frame which is the all the time series data where the rows are the actual time series and the columns are the are the times and then you have the values along that axis um and then you call another function which does a transformation to that data frame and returns another one the construction of your answer is the way to think about it is imperatively not functionally so it's not describing what the answer is I've always thought of sql is describing what the answer is right yeah I mean I'll show it I'll show it at the end of this talk I can show an example of what I'm talking about with the new language maybe it'll make a bit more sense um okay hold on I went too far okay so the index file layout uh so we have a series block we have a tag block a number of tag blocks at the end of the file we have the measurement block and the offsets so this is what a series block looks like you have a bunch of series keys remember those long strings um you have a hash index uh which I'll talk about in a second what that looks like um and then basically you have file locations to see like where that series key is okay so so there's the series key we see that there we can go there and see the length of the actual key the key itself um and then the length of the next key right so basically we see offsets for where that key is stored in the file now for the hash index we use Robin Hood hashing uh which I had actually never heard about up until earlier this year um yeah so the there's some nice properties for Robin Hood hashing that make it actually really really good for an index that is only going to be read from after you've constructed it uh you can fully load the table uh and you don't have to have linked lists or anything like that for for a lookup and like I said it's perfect for read only hashes so it looks kind of like this hopefully I don't butcher this explanation um so basically what I have here are three arrays I have positions in one spot I have the keys uh and I have what's called probe lengths which I'll talk about in a second so in our example say we're going to in we're going to insert a into our lookup table so when we hash a we get an index of zero whatever our hashing algorithm is it doesn't really matter we just want to know like where in the lookup table is a supposed to live so we insert a into the table at position zero and the next we're going to insert b luckily b hashes to one position one so we're going to insert that into the table there we go there's b now we're going to insert c and in this case c hashes to one so basically we have a collision right we have collision in the hash table so what we do is we actually write c into the next spot over uh and then we mark that down we mark the probe lengths of c as one so now going forward more say we want to insert d and d hashes to position zero so like okay we just ran into a which is there so let's go to the next one over like up no b is there and we're at probe one if we were to insert it here it would have a probe value of one but we look at b and it has a probe value of zero so what we do is since b has a lower probe value we take out b we insert d and mark its probe value and now we're going to look for the new spot in the table to put b so we go the next spot over we see the c is there c has a probe value of one which matches b's probe value so b can't take c's place so we're going to insert b over there and we're going to give it a probe value of two so i think you know the reason why it's called robin hood hashing is the concept is essentially you're going to rob from the probe rich where rich is close to zero and give to the probe poor right so what that does is you can actually have hash table where you have an entry at every single point of the index and you have certain properties about looking things up how many how far down the down the table you have to go to find a match for the thing you're looking up so one of the refinements you can do is you can mark down what the average probe length is right so if you have a hash table that's fully inserted and you see that the average probe length is two what you would do is whenever you hash a key to look up in the table you would just add two to the to the position and then when you do your search instead of just going forward you fan out forward back forward back right so yeah so yeah so let's do a look up yeah right so we're going to look up D now D hashes to we said the average is one so basically D hashes to zero but we're going to we're going to start with the one so actually the other thing i didn't know that's the yeah yeah because that's the average right so we're just going to say whatever the hash is plus one for the position the one other thing i didn't know is you also keep track of the max probe length right that's super important basically you're going to search both until you hit the max probe length if you hit the max probe length you know you have a miss so the other thing we also do now which we didn't when i was first making these slides is we actually keep a bloom filter um to try and eliminate not looking at all misses yeah as quickly as possible so yeah so basically like D hashes to zero plus one and we actually found it right there so Z we look up Z it hashes to zero that's not Z so we move the probe we move it over it's not D we move it that's not the either so we know we've hit the max probe value Z isn't present in the table so the other piece we have in the this indexing stuff is the cardinality estimation so people frequently want to know how many measurements do i have or how many unique values do i have for this particular tag key right like if host is a tag key a great thing to know is like oh how many hosts do i have in my infrastructure so we keep sketches for cardinality estimation right and for that we use hyper log log plus plus which is pretty cool um yeah uh that's all i have what's that what is the plus plus on hyper log log it's just like refinements on hyper log log okay yeah plus plus is like a the newer one yeah yeah if you look up hyper log log plus plus there is actually like a new paper that came out that basically refines i don't remember the paper name but it refines the the previous approach oh sorry hyper log log hyper log log uh was it's a way to uh do basic counting of of a bunch of strings it's a sketch yeah it's an approximation um so and depending on how many how many uh bytes you assign to store the sketch and how big your your actual key spaces you get different levels of accuracy in terms of what the sketch provides and hyper log log plus plus gives you some nice properties on the accuracy of the sketch uh and actually we in low cardinalities we actually don't use a sketch we use a precise count so uh that's all i have on the storage engine stuff i can you actually have a few minutes do you want to show them quickly yeah yeah let me how often do your users delete or update data update almost never delete they sometimes do for the most part they use their retention policy stuff to just age out the data automatically um but deletes aren't very common well so aging out automatically means that when you look at the page and realize that it's been deleted there's not an entry put in there no the aging out automatically is basically like where we just drop the files yeah yeah um certain tombstones for all the items in the no no no that's why it's arranged in shards so essentially like if we have a shard per day and we're only keeping seven days worth of data once that shard is older than seven days we just drop all the files and update the in-memory structure to say that shards gone um here me mirror because it'll probably be easier this is going to be difficult oh there's a bunch of reading there um actually no yeah i i have yeah i probably have a lot of tasks open anyway but i think for those purposes i know right yeah good good luck finding what i'm reading about okay let me see do do go here oh this is gonna be painful of you here we go do do do so actually in the new our new data model is essentially we just have we're eliminating the idea of a measurement name and a field name and we're just going to have tags which identify a series and then values and timestamps uh so here let me show go down to some examples do do do a lot of stuff all right hold on and this is this is all on github already uh we hopefully will have like something we already have something kind of functional we'll hopefully have something that will release this like an early alpha to get feedback from the community uh pretty soon oh wait hold on i forgot i intermixed um yeah it looks like the spark query language yeah like yeah so basically like here we go um um yeah so say we're doing this so basically like here we have we're selecting from database foo we have a where clause which is basically just this string in here it's basically an expression where we can say we have those matches we have equals not equals regex match uh ans or sports prins all that stuff range of data so we're looking at the last 30 minutes of data we're going to window that data into 10 minute windows and then for each of the windows in the series we we compute a sum and then interpolate is basically like uh insert values in the missing in the missing windows um and then in this case this is basically an example where i'm doing a join to say calculate an error rate so i'm joining on the host key and the expression i pass it is i want to look at the errors metric and divide that by the request metric times a thousand uh and return that as a metric so people frequently want to do like these kinds of like transformations on the on the time series and they like every time i was trying to like shoehorn some of this logic into our sql style language it didn't really feel consistent and we did like we added like limited sub query support uh earlier this year and i was literally at a customer yesterday where they were complaining about how they it was hard for them to figure out what sub queries were doing and what the different syntax was whereas like this is like each function can be represented on its own is its own like unit where you know you have an input and you have an output and for function chains like this it's easy for users to see like oh what kind of data is getting returned at each point each point along the chain right because when when users are like writing their their queries about the stuff they're doing most of the time what they're doing is they're visualizing the data in a dashboard um so they kind of like fumble around to try and figure out like okay like they conceptually can think of like the thing they want to see on the screen but then they have to like try to like craft the query so they get the data and they eyeball it and they're like wait that doesn't look right and yeah so part of this is i think it's a more elegant way to represent like the things people are trying to do with this data uh but i think it's also going to be easier for users to debug their queries to see what's going on at each uh step so let me see if i can show i have like a yeah yeah so this is like this is actually going to be different so but say i wanted to get the 90th percentile the max and the mean of the cpu readings for you know host a it would look like this what we're actually going to do is we're going to have a function called fork where essentially like you can view this query as a dag uh where each function is basically a node in the dag and the data just flows down through the dag so fork would basically fork the dag into you know n number of forks where you say okay this is this source data here and we know we want to get we want to fork it so that we on that source data we get you know percentile max and mean which is that kind of pattern is is very very common for the dashboarding stuff because they want to see like the band of performance for something right they want to see the high the low and the mean that kind of thing um yeah cool awesome thank you yes is the answer uh but um so it depends on the shape of your data so if you have a very like one like the largest instance on AWS it's like say 10,000 values per batch and you have only about a couple hundred thousand time series we can get close to a million values per second going in uh that starts to drop down as you increase the cardinality because of the way the storage engine is designed right now uh we sort that those keys as we do compactions and stuff like that so basically what happens is if your cardinality blows up super high right now we see it probably around 20 million unique time series is definitely probably the upper bound for a single server i would say closer to like 15 million you can certainly have more but then it affects your throughput in your ability to keep up with the throughput of of ingest right so um what we do what happens is if your ingest rate is like you know you're trying to insert you know a million values per second across 20 million series the flushes of the uh right ahead log to disk will work for the most part uh or the the cache the in memory cache to disk will work well actually that starts falling behind because then we start like sorting we we we block sorting those keys um so the there's a setting in the database for like the max the max uh wall cache size and basically when you hit that it starts rejecting writes um so yeah we're we're doing some work to try and improve that which i think we can do because we're sorting the series keys where i don't think we need to um and the other thing we're we're going to be doing is we don't actually assign series keys a unique id at this stage we store all the series keys on disk in those blocks and we're going to be updating it to assign a unique id to each series key for a single series do we accept data from different sources yes but that is fairly uncommon that i know of right so um if you have like a sensor out there in the world as long as you have like the the the combination of the measurement name in the tag set should uniquely identify all the measurements coming off that sensor so and usually you're not it's basically the only one sending that data right even even if you're looking at like you're tracking requests to an api like you could have you know 100 app servers they're all serving that same api as long as you put the host in the tag you would those would all be separate series now that query time you can merge them all together on the fly to get performance of your api across you know n nodes or you can do it for individual nodes as well timestamps can be out of order um i guess it's it's optimized to four things that are append only and in order um the most common thing we see is time stamps for different series may come in out of order right so for the sensor data use case people are frequently transmitting data like over gsm or something like that so they'll collect a bunch of samples and then you know transmit like once an hour or once every four hours so those time series may be collected at different intervals than these other time series um but that's fine the database database works really well with that but there's no requirement that the time series are actually ordered as they come in hi thanks thanks again thank you so next week we'll have karthik from uh streamlow which is the creator of the hair and streaming service at its where so he's going to give a talk same same time same location i'll see everyone in a week now