 Hello, my name is Jeff Tao. Thanks everybody for coming to my talk. I'm the CEO and founder for TD Engine. TD Engine is an open source project. I started more than four years ago. Today, I just want to talk about the new data model I used for this time series database. First of all, I hope everybody sitting here know time series data, right? May I ask you a question? Have you ever used or heard about the time series database? OK, too many. That's good. So for time series data, there's so many scenarios for time series data. For example, the energy, like wind turbine or solar energy, they generate lots, lots of data. And also for transportation, logistics, even for smart manufacturing, like POC or SCADA system, generate lots, lots of data from automation system. And also, even for example, for connected vehicles, even for the connected bicycles, for whoop, they generate tons, tons of time series data. So IT monitoring, finance, they are all time series data. So why we want to use the time series data? The first purpose for time series data is to monitor the change over time, right? You will always look at the trend, just like a financial market. Looks like it looks good. And also, you want to predict, like for example, for the smart manufacturing, they always want to try to do predictive maintenance. What time? I need to maintain this equipment. For example, the elevator. What time? I should provide the maintenance for the elevator. What time? I should provide the maintenance for my car. And another big use case is abnormally detection. And of course, sometimes we do want to analyze correlation between measurements. OK, for the first four points, everything we do for time series data is we try to provide insights into operation, like IT operation, or for smart manufacturing, we need to know the operation. Why I want to start a new time series database? Almost seven years ago, when I looked at my startup, you're working on smart devices. Smart devices generate tons, tons of data. I found it's very hard to process those large, huge amount of data. Then I found the influx DB. I found the problem issues. Or there are a few time series database on the market already, OK? But I found they are not good enough. Why they are not good enough? Because they don't take the full advantage of those characteristics of time series data. Let me explain to you what I found for time series data. For example, number one, of course, is always timestamped the data. Number two, for all those data generated by sensors, devices, they're always structured data. For example, it's just a floating number. Most of the time, just a floating number or some integers or some state. It's patient for smart mining and manufacturing for many, many cases, OK? The third one is just like a string. Every device, every sensor is just like a string. They are generating data at fixed intervals, like every one second or every 10 seconds. They generate the data point, right? And also, each string is independent, OK? You can have one million data strings. For example, for the smart meters, even maybe in the whole Spain, there are more than 10 million smart meters. Each smart meter is a data string, right? And also, each smart meter is independent from other smart meters. Number four, the data rate is very stable. Because once you have the sampling rate, you know the number of devices. You know the sampling rate. You know how much traffic it can generate, right? So unlike holiday in Christmas or other holiday season, the traffic to Amazon will be maybe 10 times higher, right? For a popular game, it will be 10 times higher, the traffic. But for IoT, for smart manufacturing, for many cases, for tensile data, the data rate is very, very stable. And number five, compared with the standard database, transaction is not required, OK? Because no change, no update. You don't need a rollback. So that's a big, big advantage, OK? Number six, you can have more write than read operation. So for tensile data, people seldom check the raw data by eyes, not like social media. When you post the something on LinkedIn or Twitter, many people view your tweets or posts, right? But for tensile data, like the data generated by smart meters or by Google, few people will look at the raw data. It will be checked only by analytical tools, OK? So the pattern is different. And of course, I said that the data is really deleted or updated. And also, there is always a retention policy, because you don't want to save those data for too long time. Maybe you just want to keep the data for one month, three months, or just one week, OK? Then you want to delete it automatically. Number nine, there is a big characteristic for tensile data is real time data computing is required, real time. Without the real time analytics, IoT, or like smart manufacturing is almost useless. Because by real time analytics, they want to waste a lot, or tell something was happening, right? And also, the last characteristic is the query is always in time and space range. When you query the data, you always have a starting time, ending time, right? And also, you always want to fix in which location? For example, I just want to switch to the smart meters in the whole Spain, or billboard, right? So you always have some range, OK? There are already many tensile database on the market. There are also open source. The most popular one is inflex DB. Another one is called tensile DB. Tensile DB is based on post-grace, OK? And also our TD engine, and another one, QuistDB, Prometheus, OpenTSDB, Victoria metric, many, OK, on the market. But what's the special, OK? Why do you want to delete a new view, right? Or new tensile database? Because when I check the data model, there are a few data models for the tensile database. So first, the model is a very popular model. It's called a tag-set data model. Every time series is uniquely identified by a metric name and the set of tags, OK? It's used by inflex DB, OpenTSDB, Prometheus, OK? Every time series, like for example, I give you example here, like for a vehicle, OK, for a car. For Prometheus or OpenTSDB, a time series for connected vehicle could be written like this, like you have a vehicle identification number, then you have some tags, like a brand model, right? For inflex DB, a time series for connected vehicle can be written this way, like the metrics and the vehicle identification number, brand, tags, right? So every time series is always unique, identified by the metric name and the set of tags, OK? But there is another model, it's a relational data model, just like mysql, like Oracle, regular database. This database, for this model, schema is always defined first, OK? There is always a timestamp column, and each metric has a dedicated column and a data type. So for example, for tensile DB, because it's based on Postgres, one table to store the tags, another table to store the time series data. So when you wanted to query all the data for the Tesla Model 3, S3, so the SQL statement should be by a join, you know, right? You need to join the table, save the tags and join the table with the metrics. So you need to join two tables together, OK? That's... QuistDB also uses this model, QuistDB, OK? TensileDB, QuistDB uses this data model. I propose a new model, when I check all those databases, I found that I can propose a new data model. It's still based on relational database, but my model is a little bit different. I give a name, it's called one table for one data connection point or just one sensor. If you have one million smart meters in my data model, you need to create one million tables. If you have one billion smart meters, for example, in China, there are over one billion smart meters, then you need to create one billion tables with my data model, OK? So what's the benefit to create one table for one smart meter, OK, for one table or for one sensor? Let me give you a very simple example. Just look at the smart power meters. For the smart power meters, maybe you have multiple smart meters, like a device 1001, 1002, 1003, right? Each smart meter takes the matrix like a current voltage phase, and also each smart meter has some tags, like a location type. So if you design your data model by MySQL or only relational database, you probably, most of the developers will create the schema like this. The first column is device ID, the second column is timestamp, right? So, and also probably you will create index for the device ID, you will also create index for the timestamp. It's very straightforward, right? OK, but if you look at this, if you look at this design, there's some problems, because every device for the data points, every device has different network latency, OK? So, because of the network latency, the timestamp cannot be guaranteed in order, maybe out of order, OK? And also each smart meter has a different data pattern. Maybe the power usage in your home is totally different from the power usage at my home, right? So data pattern will be different. Then if I create one table for one data connection point, since it becomes very straightforward, no, just look at the one table for one device, then the timestamp will be in order, because although there are some network latencies for each device, but the relative order can be guaranteed. So data, when you write a new data record, it becomes a data appending operation. So data ingestion rate can be much higher, right? Because you just appended the data for each table, so it's much faster. And also you don't need to save the labels for each row. Labels can be saved just one time. So you can save the storages. And also another thing I'll talk about is the data compression. It will be much easier to compress the data because the data pattern for each device is very seminal, OK? By this way, OK? So I summarized the benefit of the design. Rate calls are automatically sorted by time, right? And simple appended operation for writing new data. And also less fluctuation in a column values. And the device and the label will not be stored separately, OK? How to store the data for each table for single table or for single device? I just stored those data block by block, OK? Maybe each block contains like 1,000 data points. For my PPT here, I only show like six data points, OK? But inside each block, the timestamp is already in order, OK? Of course, every table will have many blocks. So I will create a block index. So once you tell me what's the starting time, what's the ending time for your query, I know how to locate the blocks right away, right? And also, there is another beautiful thing for time series data. For each data block, I already have pre-computing. For example, the count, how many data points for each block? Like what's the total sum? What's the max? What's the minimum value for each block? It's already stored there. So when you just want to get the average or max for a time series, I don't need to scan the raw data. It's very, very fast, OK? And another good thing is for each block, I also have a schema defined because sometimes your schema can be changed, OK? So in front of each block, I have the schema. If your schema is changed, I would just create a new block. But for each block, the schema is always the same, OK? So for like an IoT case, for smart manufacturing, even for logistics, for many cases, your schema can be changed, but it won't be changed every second, you know, right? Your schema can be always be constant for maybe for a few days, OK? It can be subject to change, but it won't be changed that often. So each block has the same schema, OK? So now let me look at the... How do you store those data? It's based on column-based storage. Column-based, for the standard database, you just store those data row by row, right? Row by row, OK? I think everybody knows this, row by row. But for our time series database, we always store column by column. It's column-based storage. So I store the timestamp first, then store the current, then store the voltage, then store the phase, right? So every time series database, and also like even Clickhouse, many OLAP database, they always use column-based storage. But we can... TD Engine can still achieve much higher data comparison ratio compared with other column-based database. Why? Because for one table, one device, the data is almost the same, you know, right? Because for example, the power usage at your home, of course there are some fluctuations. Sometimes you turn on your air conditioner, sometimes you turn off your air conditioner. But for a period, almost the same, right? But if you mix your power usage at your home with the data from my home, my home's data totally different from yours, although it's column-based storage, then it's harder to compress, okay? So for all compression, we always do data first, okay? So for one table, one device, you put those data don't fluctuate that much. So it's much easier to compress those data. So we achieve very high data comparison ratio. By using one table, one device, okay? Okay, so let me summarize by using one table, one device, is we can ensure read, write efficiency of single data connection point is the best. I don't think you can think about other good way to read or clear the data for one single table. One table, one device is the best way to ensure the best performance for single device, okay? But there are big, big challenges for this data model because I just mentioned, if you have one million smart meters, you need to create one million tables. Even if you have one billion smart meters, you need to create one billion tables. And also each table has some tags, right? Sometimes you want to aggregate the data together. For example, I want to aggregate all those smart meters in billboard, right? Maybe in billboard, they are one million smart meters. How do you do the aggregation? Then it's becoming complicated, okay? How do you solve this problem? That's a big, big challenge for my data model. Then I'm lucky, I got another idea. It's called the soup table. Just this soup table is designed for efficient aggregation of tables, okay? So what is the concept of soup table, okay? Soup table is just a type of data connection point. It's a template. It's kind of category. So for example, for the smart meter, smart meter is a category, okay? Tesla is another category. Or maybe you have elevate is another category of devices. For each category of device, you create a soup table because they share the same data schema, you know? Oh, okay. Now, let me just give you an example. Give the example. I create a soup table for the smart power meters, okay? So the table, just create a table, smart meter. You can have the schema, have timestamp, current, voltage, phase. But compared with the standard database, I have an extension. It's called tags, tags. Tags is a static attribute. It's like a location, okay? And the type, I just have two tags. Now, when I, now I use the small soup table as meter as the template to create the tables for like for example, for six smart meters, then I just look at the syntax below, like create a table T1 using S meter. Then for this smart meter, I specify the tags, Senhuo Sai, California Senhuo Sai type is one. The second one is like a California Palo Alto type is two. So I use the soup table as the template, but specify the tags. So compared with the standard database, each table, you can associate a set of tags with each table, okay? The schema for each table is defined by the soup table. But you can associate a set of tags with each table. Now, when you try to aggregate the data, you don't look at the table, you just check, you just create the soup table. For example, create the average voltage and the maximum current of all smart meters in California Senhuo Sai. You just like say, select average max from smart meter from soup table, specify the tag field condition. That's it, right? So when you want to do aggregation, you're always querying the soup table or the category of a device, okay? Then everything becomes much simpler, okay? From the user experience. And also each tag can be a tree structure. And you can associate the multiple tags. Like a TD engine can support like 128 tags. Each tag means in a different dimension. For example, you can have a tag for location. You can have a tag for like a model. You can have a tag for like maybe what kind of business like for each smart meter, maybe it's for home or for business, for small business or for big business, maybe different, right? You can associate many, many tags. Then you can do multi-dimension analysis, okay? And also how to improve the efficiency, all design, how to make it very, from the user experience, it becomes very simple. You just need a query for aggregation. You just need to create the soup table. But how to make it efficient? Look at the risk diagram. I always store those tags in a separate storage. It's called the tag data. I always store the time sense data in a different storage. I always separate the tag data from time service data on like a low-sql database. For example, for HBase, HBase is a typical KV store, right? You'll always have the key. Key is always like for inflex TV permissions. The key is just a metric name plus some tags, right? So, but they mix them together for HBase. But for all design, we want to separate the tag data from metric data. Now, if you want to query the smart meter located in California, San Jose, you'll, I go to tag data first, find what devices, right? What tables I need to search. But that, the data set, the tag data volume is small because each smart meter only have one row of data there, right? If you have one million smart meters on the tag data, it only has one million rows. But for time service data, each smart meter maybe have one million data points. If you have one million smart meters, you can have one billion, I mean one training data points. So, I always go to tag data first to find out based on your tag field condition, I find out what tables you want to look for. Then I go to time service data. So, looks like a design is a dimensional table. Tag data is like a dimensional table. Time service data looks like a fax data, okay? So, it makes efficient. So, we solved the problem by using super table concept. So, it becomes very efficient for the aggregation, okay? Now, so by using the new data model, what do you got? Let me show you the benchmark, okay? For the benchmark report, by using our new data model, I compare the TD engine with the influx DB and the time scale DB, okay? We are using the open source TSBS testing benchmark. So, testing benchmark is proposed by influx DB and the time scale DB. They all use TSBS for benchmark report. So, look at like our data ingestion rate. TD engine is at least like 1.5 times fast. In some cases, even it's 10 times fast with the same data set. For the query response, it's at least like 1.2 times fast. In some cases, it's even 40 times fast, okay? So, data ingestion rate higher and also query response is higher. Look at the disk usage because we can compress the data very well, okay? In the worst case, we can be like 1.2 beta then influx DB. We are much, much better than time scale DB. In some cases, we are even 10 times better than influx or time scale for data compression ratio, okay? Look at this server CPU usage. We are much, much lower, okay? We don't consume much CPU usage, okay? And also our benchmark report, you can download from our website. You can, we also provide the testing script. You can run the testing by yourself, okay? To verify our report. So, it means our new data model really works, okay? Our new data model is more efficient, okay? And also, I want to take some time to talk about scalability. Our design is just the distributed design, just like many other low-sql database, just like a Cassandra. Before my session, the Cassandra you're talking about here, it's, I don't want to talk about more about it because I don't have much time left. So, we divided the data, okay, into shards, each shard or each called each Vnode. Vnode, there are multiple data nodes. Each data node, we can have a Vnode. Vnode contains time series data. Besides, each Vnode can have three replications to provide the high availability. And also, besides the Vnode, we also have a management load. Each node report to management load for the status. And also, our design separated the compute from storage. So, we have a query node just for computing. So, the computing power can be adjusted dynamically. And also, we divide the data in two dimension, okay? You have a big, big data, a huge pie, but we divide the data into two dimension. One dimension is by time, okay? So, for each file, it contains data maybe only for one week or one day or even one month, right? But another dimension, I divided the data based on devices or sensors. Each block only contain a number of sensors data, you know? So, it's very natural to divide the data by two dimension. Then, it's easy to conquer the big data problem, okay? And also, we solved the high cardinality issue by our design. Our design, by our own testing, we can prove, we can support over one billion tables, one billion tables with all the only performance issue, okay? And also, the whole system can restart within one minute. Yeah, high cardinality issue. And also, another thing I would like to address is, because when I started the TD Engine project, I want to differentiate our database from TDA, from InfluctsDB or TensorFlow or QuestDB, so the TensorFlow database. I add more features, okay, into our database. It's not just the TensorFlow database anymore, because for TensorFlow's data, besides the database, you always need a message queue, because like for smart manufacturing, for IoT, once you connect the data, many applications want to consume the data, right? So, you put the Kafka there. And also, you always need a radius for caching, because for example, I want to know the current reading, I want to know the current location for each Uber, right? You always put the latest data in radius, okay? And also, you need a stream polarization for real time analytics, so you put like a spark or link over there. So, even for the very simple TensorFlow's data, the whole system becomes very complicated. Then, my idea is I want to combine those four pieces into one single piece, just into TD Engine. So, it can reduce the complexity of the whole system. It can reduce the operation cost, okay? And also, another good thing is I open-sourced our whole code, okay? In 2019, I open-sourced the single-loader edition. In year 2020, I open-sourced the classed edition. Like in FluxDB, doesn't open-sourced the classed edition, but we open-sourced our classed edition. Last year, just one year ago, we even open-sourced our cloud-related edition, okay? Nobody open-sourced the cloud-related edition. Up until now, we already have over 21,000 stars. We have like 4,700 folks, very active project. I'm very proud of this because every day, more than 500 new instances running from all over the world. Okay, every day. Every day, more than 1,000 clones on GitHub for our open-sourced project. I hope you guys can check our GitHub website. And also, because we are backed by vintage capitals, we still need to survive. How do you survive? We try to use cloud service. So we provide cloud service on AWS, on Azure, on Google Cloud. After this talk, you guys can try our cloud service. It's free. And you can check, you can compare TD Engine with in FluxDB or TanskyDB or whatever, okay? Yeah, so that's my sharing, okay? That's my talk. That's my QR code for Mani Green. I hope you can connect with me. And also, if you like, you can follow me on Tweet. I'm a developer, actually. Although I'm the CEO for this company. I'm the developer. Actually, for the first prototype, I spent two months at home to write the first prototype to prove my data model works. I wrote almost 20,000 lines of code to prove my data model works. Then I started to raise money from vintage capitals. Now I have almost 100 employees team, okay? So, only questions. I'm a saved programmer, okay? I'm not a Java program. I write the same code, okay? You guys, okay, please. Yes. Oh, no, no. We have a build-in. We have build-in cache, build-in stream processing, build-in data subscription. We don't want to replace Kafka. We don't want to replace Flink because they are very powerful. But for TanskyDB data, we have a much simple solution. To make it work. For example, for the caching. Oh, for radius, it's a generic caching tool, right? What data you want to be caching? If you read more, it depends on reading. But for IoT data, for Tanskyd data, you always just want to cache the last record because the latest data. So it's much easier to implement, you know? So, and also for data services, I can share another, if I have time, maybe in another meeting I can share. All the design, everything is based on bin log, based on WAL, right ahead log. So based on WAL, we provide the data subscription very easily. And based on WAL, we provide the stream precision for Tanskyd's data. So it's a very simple solution for Tanskyd's data. Yeah. Oh, Prometheus is a very, very good product for DevOps. But Prometheus doesn't have a good scalability, okay? And Prometheus only handle like floating numbers. They don't handle like strings or those stuff. And Prometheus don't handle like auto-order data. But for IoT, for smart manufacturing, you have to handle auto-order data. But for like DevOps, auto-order data is a lot of there. And also for DevOps, they're always the matrix. They don't need to handle strings, okay? So Prometheus is very good for DevOps, okay? But for us, we focus more on industry internet and also focus more on IoT. Yeah, okay. Any more questions? I really like your answer. Yeah, a benchmark of what? Oh, no, oh, I don't. I cannot get the executables. Yeah, if they provide me free one, I can. Yeah, oh, okay, it's good. So we provide very good high availability, okay? So each shot, each node has three replications. Even one node is done, as long as two nodes are working, the whole system is still working. Of course, we have a very good availability platform. We use Grafana to provide, we'll give you alert, say one node is done, you better make it work, right? But there are still two nodes working, so the whole system is still working. Yeah, so if one node is done, there are still two nodes are working, the whole system is still working. Then when that node is back, the data will be synced first. Yeah, we use roughed, we just use standard roughed for data replication, roughed, yeah. Okay, anyway, I'm almost run out of time. You guys are welcome to connect with me on LinkedIn. I spend the most time in California, okay? My office is in San Jose, California, okay? Okay, thanks a lot, okay? Yeah.