 And welcome you to today's PostgreSQL conference webinar store and analyze petabyte scale time series data with PostgreSQL and timescale DB. We're here with Ryan booze developer advocate at timescale who will discuss what time series data is, while you probably have a lot more of it than you realize and the insights it can help you uncover about your projects apps users organization and beyond. My name is Lindsay Hooper. I'm one of the PostgreSQL conference organizers and I'll be your moderator for this webinar. So a little bit about your speaker. Ryan's developer of Ryan's a developer advocate at timescale prior to which he worked for more than 17 years as a developer DBA and product manager at multiple ISVs delivering SAS products based on time series data. Ryan's a long time DBA starting with my sequel and PostgreSQL in the late 90s, and he spent more than 15 years working with SQL server and that SQL family is a desire to bring some of that community spirit into the growing PostgreSQL conference world. So, welcome Ryan. With that, I'm going to hand it off. Take it away Ryan. Good afternoon. Good evening, everybody. Thank you again, Lindsay for the introduction, and it's exciting to be here. It's been fun, particularly as I've come back into the PostgreSQL community over the last number of years to see just so many community conferences and involvement continue to grow. And so, that's just fun to be a part of and fun to see and it's been a great time to be a part of timescale and see how we can help with some of that. So, I was telling Lindsay before we came on on air here that, you know, I realize because of what's happened over the last couple of years, timescale so much has happened that as I actually went to find some slide decks talking about various features and so forth. A lot of those slide decks were two and a half and three years old we've done lots of talks about features but a here's timescale here's what we're doing slide deck and it's really a testament to what's happened with the pandemic and honestly just the amount of development we've had over the last two and three years. A lot of the things we said we're coming have now been here for more than two two and a half or three years. And so it's fun to talk to you about them today. See how it might be of use and help to you. So in case you don't know, it's all over the airwaves today but this is February 1 the beginning of the Chinese New Year and it is the year of the tiger which we couldn't be happier about because our mascot is the tiger and so we're excited timescale see what happens through this coming year. Here's the roadmap of what we're going to be talking about very high level and from a, you know, point perspective. What is time series data. What is timescale to be and how does it help. What are some of those main features that really can help you accomplish more with the massive amounts of time series data that all of us are getting more and more of month by month and year by year. And then attempt to do a demo of kind of a smattering of features in two different ways which I'll show you as we go, just to give you an idea of how this stuff works in real life. And, and really how it might help you, even if it's not always about time series data. We talk about that a lot at timescale as you would expect, and time series data, you know, we say is recording of every data point as a new entry so that's one of the things that really transformed probably near the end of the 2000s as we entered in 2010 2011. So companies realize this great need to store more and more of the data, not just the latest observations of the data. And that's when the, the need for database systems that would accomplish, and be able to store those things really became kind of a new subset of what we were, whether we were interrelational databases or not. You know, so because we're recording every new data point we're really recording the change of whatever's going on around you whatever your application or the data you are that you care about is is maintaining and really time series data is everywhere. And I think that's one thing that I've recognized over the last 15 plus years in application development is even the things that we're not constantly changing. We always needed to know, at least when they happened right even if it was a user joined that time stamp was valuable and so even that data that doesn't change often is still time series data there's there's value to knowing the point in time that it occurred. Now we're used to all kinds of time series data, whether it be stock charts, whether it be things like air quality, you know we see these kinds of things all over the place, you know we have. This is Prometheus data, you know, observational data in our space has exploded in every community every company and network, people are worrying more and more about observations and how they can track and use that data. All of us know these charts if you are alive in the last couple of years and can can see a picture you have seen these pictures. We know what these graphs show us and tell us. And so it's a part of everything we really do. We've also come to realize that there are times or there is time series data that we don't even think about as time series data. We did a fun little post investigation last summer the NFL made available time series data from all games. This is from I think the 2018 season. And so every single play every player in the football has a sensor and it tracks like six times a second. And so even that is time series data and even though it was inconsistent it wasn't what you would maybe expect. We were able to really identify some fun things because of that which is really fun and cool. We often say if you read some of our blog posts. This is a common thing that you'll hear from timescale that the more we see and the more we observe all data is time series data. Now whether or not you specifically agree with that I hope that you at least agree that all data that you are currently working with is probably related in some way to time series data. And that's the, the part that has become the need. And to be honest it's why I'm at timescale, you know, working out an application that we were using a non SQL database and we felt that data silo which we're going to talk about in just a minute. So as we entered 2022 we've been talking a lot about, you know, what is the term that is being coined more and more is this data intensive application, you know there's time series data there's machine learning there's all kinds of need for a lots and lots of data but what does that look like. And so one of the definitions I found recently which I think really fits into how we can help in the time series realm and relational databases. So these three main points, we you want a data intensive system to be reliable. It absolutely needs to be scalable. That's a no brainer we all know that is a need. And then it's got to be maintainable and I think that is some of the places that we found the struggle for many people who are trying to find the right database for the work that they're doing with their time series data. So I created these existing databases and again I've been working relational databases for more than 20 years, I love SQL and I love relational data. But even at one of my previous companies before we really had an easy way to do some of these things, you know, thinking about storing. I see billions of rows of data in a in a at the time of SQL server but really any of these options felt like how am I going to manage that. That is just bigger than I know what to do with it there's something cumbersome about it. So even existing databases that we know and love like Postgres weren't necessarily built to handle this kind of data at scale without a lot of additional knowledge and ability to maintain it and manage it requires a higher skill level. There's, you know, out of that need to have something that could scale. So a lot of no SQL databases have been the choice for the last 10 or more years, and they do well some things but most of the time we end up finding that we have to sacrifice something else and often is, you know, query complexity. You know, my quick story is just that I was I've been at two time series companies one was utility data, and one was predictive maintenance vibration data. And we would have terabytes and terabytes of data in these databases but to get that the business value out of it, which is really, really difficult, you know, we had to write programs to merge all of those things together. And all I wanted to do was write a SQL statement over this stuff. So that's that's one of the things we have sacrificed. And there's certainly other databases that are coming online now that have a lot of really excellent features that they serve a need well but there's still something that, you know, depending on your, your application or need, you're going to sacrifice something. In this case in some of these new hybrid SQL databases, they're really excellent OL AP kind of databases that do some things really really well. But you don't get some of the things like transactional integrity and the relational integrity they're not designed at this point to be really excellent at relational data in the way that we're used to with transactional relationships. And so that's where timescale came into the picture, you know, five or so years ago, we saw the need exactly that thing that I described the founders of the company. We were trying to begin an IoT database company and realized, why are we having our IoT data and one database and all the business data and another database wouldn't be great if we could put the two together. Why has no one ever done this. How could we help that be more manageable and reliable. And so timescale was born. Well, I just want to talk through those three points and then we'll get to the features themselves so reliability. You know, we really do make a pretty big point very consistently that we're built on Postgres because we really love what Postgres has done over the last 20 years. You know, we get to inherit and so many other applications and extensions within the Postgres ecosystem, get to inherit from all that great work and contribute back to that mission. So that's, you know, talk about reliable having that kind of foundation underneath you. This is really exciting. We then take the schema management layer and we do some things transparently that otherwise you would have to know how to do in your own you would need quite honestly a number of full time people to do it most likely, particularly as the data base grows. And then we can add and we can because of what Postgres provides, we can actually enhance the way queries are executed on this large scale data and provide other features, which we'll talk about in just a minute. And then again because what Postgres offers, we can actually in inject into the query planner when it's a time series query on the scheme is that we help you maintain. We can help better query those partitions and those parts of your data, then Postgres by itself, because we can optimize for certain circumstances. So really awesome. And then finally, all of that culminates in being able to say we really are Postgres with enhanced functionality we were not doing anything that Postgres doesn't allow us to do. We have not co opted anything specifically unique and special in the functioning we've provided new functionality for this specific time set but everything that you know about SQL, you can use in timescale, all the tools, all of the ways that you know how to write SQL. It will work and then we provide time series specific functions and procedures, and you know just automation under the covers to help you really maintain and get the most out of this data. You know what the other ways I think we're reliable is we are an open source code base. We're on GitHub, you can see everything that's happening, including an upcoming release you can see the code you can see what's going into it you're welcome to do a pull request I guess I took the snapshot a few days ago, and there's a few pull requests there, certainly some of them are ours and we have you. It's it's just been fun to see this be done in the open and get that feedback from users in iterative agile way. And then it's reliable because it really works is all things Postgres right so you just have that that knowledge and ability say I've been doing this with Postgres I'm using this tool set. There's a 99.9% chance it's going to be extremely easy for you to transition if you need timescale. You know there's always going to be one outlier somewhere that does something unique and and then we can help you through that but it's really an easy transition. And that's always exciting. You know the last thing that I've really come to appreciate over the last couple years then when it comes to just Postgres and the reliability that we're built on top of is really the community. I, you know, as Lindsay said I was a part of the SQL server community for many years. And I think they do an excellent job as well. But you know the Postgres community is is different and how they function but there's so much worldwide, and there are so many companies I just grabbed a few logos that came to mind. There could be hundreds of other logos on the screen. Just people giving back to the community finding ways to make Postgres better provide content to help you learn Postgres. I think that's an excellent asset of what makes a postgres reliable. We love Postgres. This is Eon by the way Eon is our mascot. You'll see more of him. So why timescale DB over vanilla Postgres. So, I'm going to get into the feature here for just a second, and I actually put this on here because there was a question recently in the Postgres community slack. You know, why would anyone use timescale over Postgres for this kind of work. And I think that what we're going to get into here hopefully answer that question, you know my response was the person who said that is is one of the probably five most well known Postgres people in the world. And he is very talented and has been using Postgres to 20 years so he might not ever have a specific need because he just knows what he needs to do. And this is honestly coming from my background, you know I needed specifically think about scalability, and to maintain a system at this scale, and the features and know that the queries are going to write in the way that it's going to function is is many people full time equivalent people without some kind of technology to help lower that bar at the entry, you know, lower the bar of entry as it were to get in. So really that the first real magic here is we do automatic time partitioning and actually we do automatic time, and you can do space partitioning as well. There's reasons you would do one over the other. We generally have found over the years that if you are doing timescale on a single node. So one server space partitioning doesn't tend to help that much you tend to create more partitions and you don't really need, and that can have other ramifications that you didn't expect where space partitioning is essential and becomes really necessary is in a multi node database cluster which we're going to talk about at the end. That means you can now you know spread your data across three five 1025 30 servers and we have some some users doing that. We have a couple users in the in the networking space you're doing network monitoring and I know one client is up to like 35 data nodes at this point so it's it's a really fun thing to see that happen. The other ideas funny as people come in I get off my slides here I get taken off. So because we do this automatic partitioning. That allows us to track the partitions are being created. So once we can do some of that work for you. Some of the things we're adding into the system we can do things like exclude, you know, your constraints like a time constraint that would normally not be able to be eliminated ahead of execution. And in the middle of this and say, Okay, this is what this is not a stable predicate so the now function, you know, changes every time you run it. But it's, it's a way that we can get in say oh we can exclude the chunks that you are going to query, based on time to save work that would otherwise not be possible unless you do some other setup and other work. One of the main things we're able to do is by doing this automatic partitioning for you, both are in time and space depending on the need. There's a lot of things we've done within the query planner itself that allow us to help you execute your query more efficiently. That does a couple things. Number one, so you know if you have years of data and hundreds of partitions what we call chunks. We can exclude many of them very very quickly so the planner never even has to worry about looking at them. So you save time by not even querying the data if it was in one table. There's still that potential you would you know all of that data potentially has to get ported in the memory depending on what's happening with the query. So you save that and then because the partition, the indexes are saved at the chunk level and so you then get the second benefit of querying less data with indexes are actually smaller that can be pulled in memory more quickly than maybe a larger index as fragmented and so forth. So there's a lot of ways that on large sets of data that can help. And what I like to say is is time. Okay, I gotta I gotta tell you this is actually just came to me last night I forgot we had Iron Man, Ian, and I was just super excited. And I got to thinking that really, when I tell people about timescale the hyper tables. So that automatic partitioning we call hyper tables. It's the abstraction over the partitions what we call chunks. And it really is the foundations the arc reactor of timescale DB right it's the it is what powers so many of our other features. There's much more beyond that but a lot of the features we're going to talk about today are possible, because we do, we allow you to create hyper tables and manage the data that way. And so that also then helps us that all of that that basics of hyper table really is the underpinnings of scale now there's more we're going to talk about with scale but that's really the underpinnings. And so the maintainability, honestly, a lot of this comes as part of being part of postgres right we, we provide automated policies for many of the things we're going to talk about over the next few minutes. So once you create a data retention policy for instance, it will just run on its own you don't have to think about, you know, dropping data at certain periods of time to make sure you have enough space on your disks, or maybe to make it we provide automation, where it makes sense in a very easy fashion for you. So it's so easy to maintain and we give you the visibility into what's happening with those policies. Data retention archiving if you know how to do with postgres you can do with timescale proven backup and restore. We work with all of the typical postgres methods for doing those things and even something like replication. You can set up replicated servers because again we're timescale we're timescale for postgres timescale better postgres so you're able to use the things you know and love to make that work. And so, I'm going to get into the features in just a second, but I'll just pause here for a minute to see if there are any questions about what I just talked about. No questions yet, but this is a great reminder everyone get those questions into your chat. Great. So what are those timescale DB superpowers we'd like to say that we are postgressive superpowers, specifically with time time series data but more and more with just data, and you'll see what that means. I'm going to talk about six specific things. There are more than this but we only have an hour. So, I'm going to go quickly through these to give you an idea of where we are, and how you might be able to begin to explore these things. The first is because we have hyper tables and we are managing that partitioning and that chunking for you. We then can keep track and we can do things on those partitions of data. One is compression, we to the best of my knowledge are the only postgres extension that provides a kind of compression that we allow there's one other that has recently in the last year, enable compression. They do it a little bit differently we actually do compression based on the actual data type so we try and make sure that based on your data type we're doing the best compression possible. And a few other things we do there. Continuous aggregate so that is so essential time series data. Once you have billions of rows of data. It's really hard, even with the best servers to get instant response with relational data in that way. And we often need that aggregate that's usually what we're doing give me the, you know, usage per hour over some period of time, where if you're doing that query over and over again there's no reason we can't pre compute it for you and store that in a way that's different from materialized views which we'll talk about in a minute. Data retention will give you the policies to be able to drop data to save space when it's appropriate. User find actions. I'll show you what that is. It's a built in job scheduler which Postgres still does not have on its own. There are a couple other extensions that will allow you to do something similar. But we actually rely on this job scheduler ourselves and we've opened it up to you as a user to run your own jobs. And then, of course, multi cluster or multi node, right to have a slide deck that talks about petabyte scale. It's not going to happen on a single server and so we'll talk briefly about what multi node is and how you could start to use that. And then I added this kind of the last minute is hyper functions. I'll just review what they are. And talk to you about how we're really expanding the cache of hyper functions, and it's something you can install in your own databases. Some of them work best just with with, you know, how we do time series querying, but they're specifically always tied to, you know, the way hyper tables are done. So some of them you could play with in any database if you wanted. All right, so let's just roll through these compression. So reliability, you know, kind of hits all of those marks, the way compression, you know, the goals were really to reduce storage size that makes sense. But to do in a way that you could still maintain ingest right so if you have to transform the data into a new form. That will slow down ingestion and so we did it in a way that allows you to kind of have the best of both worlds. We wanted to do it by using reusing Postgres we didn't want to put a new storage system in place what Postgres already does is excellent, and why not take advantage of that. We already talked about the automated policies but we wanted to do it in a way that's transparent to you as a user. When you run a query, you shouldn't have to worry most of the time about whether the data is compressed or not they they function that the way the query has to be executed is different, but you shouldn't have to worry about that. We take care of that implementation. At a really high level, again, there are other videos and other talks we've given on this but and there's lots of great documentation on it which I'll show you at the end. At a level what we essentially do is we take your uncompressed data which is row form it's called a row store data row by row, all of the columns on a row as it were. And we then take up to 1000 rows per regular per hyper table, the chunk that you're compressing, we take up to 1000 rows and transition them into arrays of column store data. So rather than getting a row of data you can now individually select 1000 values of a column out, you know, or millions of values of column out and not have to retrieve the entire row anymore. It allows us to do compression because we can do things like Delta Delta and those kinds of compression algorithms to get really really really good compression ratios, without sacrificing a lot of query, a lot of query latency. It is transparent, right so generally you set up a policy you tell us how you want it to be compressed, but then over time as the data gets older you tell us at what point it makes sense to compress this data it will be queried in a slightly different way and that does have some ramifications. Honestly, often it will be faster for a lot of the kinds of queries you do. It will not necessarily be faster for select star, because you're basically asking us to take all of the columns, recompose them back into a row and then give it back to you. So there are some nuances but usually your older data you're looking at it more narrowly we like to talk about a narrow, but deep query. And so at any time if you look at this graphic at any time you have uncompressed chunks, and usually your older data is compressed. And so that gives you both kinds of data storage in one table. So if you look at the row store which is really good for hey select all, you know you want to index lots and lots of columns updates and deletes, but column stores really good when the data is not going to change anymore. When you aren't going to do lots of inserts you can insert into compressed chunks but it is slower because we have to transform it a little bit. And so if it's not going to be, you know, you set it up to be a little bit in your in your history is usually one's best. It's great because the data ends up being much smaller and you'll see some of that in just a second. So what does that mean. That just means that row stores usually really good for these kinds of newer data profiles right I need to see all of the columns for the nearest timeframe because I'm monitoring many many things at once. So your column store data is really good when you do lots of aggregates over specific columns what is the, you know, total fare and passenger count of the last six months of a specific kind of thing well those columns I can pool millions of values out of just those two columns. And it's much faster to do that work and it's array data so I can actually do some of those calculations a little bit faster. So compression, and it's available on any time scale to be hyper table, continuous aggregates is really one of the flagship features that many people come to timescale for I'll be honest it's the one that I came to timescale for originally. We had, as I said terabytes of data. And we really needed a way to have those 20 minute and one hour views because there's no way we could query even from our no SQL database to query all that raw data over that long period of time was just arduous. So that's really what a continuous aggregates allow you to do. It's taking all of this raw observation data, and on a schedule that you define pre aggregating it into, you know, the new view that we keep up to date. And what is different about these, it's something that Postgres does not have yet is because of how we do this. If data changes at some point back here. And you have set up the policy appropriately, we will refresh the materialized view but only the portion that's changed. So in Postgres, if you set up a materialized view you can refresh it. But there's no incremental refresh you are always refreshing the entire view at once. So that means two things it takes longer, usually even though they've introduced parallel refreshing, but it also means that you can't save the data. So every time you refresh that view you lose the history with continuous aggregates it becomes a historical data store for your aggregate data, as long as you don't specifically tell us to get rid of it. You can now have an aggregated view and drop the raw data at some point in the future when it doesn't make sense to keep it anymore. And so now you have another layer of history that you can keep within your database. It's a really great feature and one that's just going to keep getting better quite honestly. Pretty much everything I just said, I forgot to have these these two slides after this. The one thing I'll point out is it is actually hyper table chunk based architecture. It's actually a different form of a hyper table. So you get some of those same benefits because it's chunked, you can set up a data retention policy which will eventually, you know, drop even the materialized data so a lot of people that make sense to do it that way you might say hey I want to keep my raw data for six months. I'm going to aggregate everything to one hour, and then maybe after a year I don't even need that data anymore so I can then also drop the aggregate data off the end maybe after backing up or so forth. And as I already said it already only refreshes the regions that are that have changed so you don't have to always refresh the entire thing. The one thing I'll add here we are not going to have lots of time to talk about this at the end I'm going to mention briefly with our new hyper functions, some of them we are looking towards how can you take an aggregate like a one hour aggregate in re aggregate it into new forms that still maintain the correct value so it doesn't work for things like average if you take an aggregate of one hour, and then later say I want to aggregate the one hours into one day you'll get a different average value that just doesn't work that way right. So we're doing this we're providing you functions that would even like to do things like two step aggregation so you can aggregate to 15 minutes, and then that same view can be used to aggregate to one hour, or one day or have you so some really fun stuff happening there, creating it looks just like a materialized view, except for a small little, small little piece right there. It does require a time bucket function which we'll talk about in just a minute that is our function allows us to do interval based, you know time value bucketing. You can set up a refresh policy then and then this will run in the background on the schedule you tell us in this case this is going to run every hour. And it's going to see has any data changed within the period you told me to check for. And if it has refreshed that continuous aggregate. And then finally since it's a view you can create just like any other view in Postgres. And so I just simply took that view and made it my source table now my source view. You'll see that in just a few minutes and in real life. So data retention, as I've said is chunk based. And so a chunk is created for a time interval. You can always add more chunks in a timescale to be there are some databases. Again, I don't mention any there's no ill will and any comments I make I promise you one of the problems in for instance, since Amazon's time stream that was released two years ago. One of the issues there was there was actually a limit, you know, you could not insert data that was older than a year. I believe that has since changed, but there's no such limitation timescale when you insert data. We look at its date, the date you've told us to use for the partitioning. And if there's not a chunk for that range of data, we create a new chunk for you. And, you know, you don't have to worry about doing any of that. There's another Postgres plugin that is starting to provide some partitioning features, but it's not automated in that way. And so this is just really transparent. The nice thing is because we're dropping chunks, there's no feature is no table bloat right this is just a drop really child table that's all you're doing. And it can be done on hyper tables and continuous aggregates. You can use your data ages, maybe you start with the, the raw hyper table data. At some point you eventually compress it and that you maintain that for some period of time. And then eventually you just might say maybe you back it up into, you know, cold storage like S3 or something like that. Say I just don't need to query this raw data any longer it's just too old for me to be super valuable. So all of this has happened transparently happens with policies that we allow you to set up very easily. And then multi nodes so this has been one of the big pushes we've been on for the last three plus years quite honestly. You know, the way that multi node works at a high level is it's it's a many timescale DB servers instances that are connected through commands that we give you so they are individual servers. Then one becomes kind of the leader and that's where you do all of your data access through. You can have many, many data nodes for various purposes. And so you get what's called a distributed hyper table it works and acts generally like a hyper table has chunks as data retention has compression has all of those features is just spread out across many many servers. And when queries, when there are query forms that allow us to push predicates down to the data nodes, we do that so you get even faster queries on the scale of data that you're normally using when you get to multi node. Right. So if you're querying billions of rows on one server, usually it's going to ultimately be faster to push that query down to many nodes let them do their partitions and send back the results to be formed and sent back to you. Multi node is, you know, efficient time and space partitioning. It does significantly increase ingest performance so again I mentioned one of our users has at least 30 if not more the last time I heard about three months ago is about 30 servers data nodes, and they are they are ingesting. And the last I talked them is like 1.6 or seven terabytes of data. I'm sorry, 1.6 billion rows of data a day. And they're just having a grand old time with it so it really does help because you're spreading all of that ingest across many servers which is great. So these aggregates are coming soon. And then we are just about to release on our hosted environment and and the functions will be there for you to do this on your own, but allowing you to do cluster rebalancing so if you add it you can add and remove data nodes. If you need to add a new data node to continue to expand that hyper table that distributed hyper table. So these are the main functions that will allow you to move chunks between servers to re spread the load out so that you have more consistent balancing. And so that is coming in 2022 we've been hard to work on that as well. Then one of the last things before I get into a couple quick demos is user defined actions. This may sound silly but for me before I came timescale this is actually one of the things I was most excited about. In the SQL server, there's so many things I had just become accustomed to creating a server side job, you know whether it was for maintenance whether it's for small ETL kinds of things I need to do on a regular basis. And I was always frustrated that was kind of hard to do in postpress. And we were already looking at timescale and I heard that they were opening up their job scheduling capabilities that just was like one of those geeky things I was kind of excited about. And that's really what it is this is the engine that we use behind the scenes as soon as you install timescale to be the extension, you have access to the, to the jobs engine and so you can create your own store procedures to do anything you want really within reason in whatever language if it runs on the server it should be able to run with the job engine. So you could do like a Python script or an R script if you needed to, and then you can set it to run at some interval and you can pass it customized configuration. There's just a lot you could do with it. And I'll show you how I'm actually using it today for this demo. Here's some of the things we talked about we have some examples things like data archiving small ETL like jobs maybe you ingest something once or twice a day into a staging table you normally would have an external script that runs in lambda or somewhere like that to do all the stuff. So you might be able to do that right in the database now custom down sampling ways to actually further down sample data on your own things like that. Lots of fun opportunities ways you can do it. And then the last thing just to briefly mentioned as hyper functions, I really don't have time to talk about them today I'm going to show you one. I would simply say we have put out a number of posts in the second half of 2021. Some really exciting things happening. And it's a project you should just if you if you have time series we have analytical data, and you use Postgres you should go check this project out we call it the timescale DB toolkit. It is installed by default in most of our Docker images if you host your database with us it's just there by default. It is a separate extension so if you install timescale DB and your database on your own. Then you will have to install this extension separately but it gives so many fun new ways to query your Postgres data, things like percentile approximations aggregates you know statistical aggregates the counter stuff is super cool. If you do a lot of counter data so for me when I was doing. When I was doing utility kind of stuff utility meters roll over all the time we always needed to do counter queries and aggregations things like that. And honestly one of the really cool things that just is we're developing more and more you may have seen a blog post about it is having the ability to do functional type of queries. So piping the output of functions into the next output into the next function. If that's kind of the way you're used to doing data it's it's fun and it's neat to see how you can actually create some of those same functions using the toolkit so check that out I will show you one or two these very quickly today. As we go to the demo. So I'm going to switch to the demo. If there are any questions I'm open to them at this moment. Yep, we've gotten one great is can post GIS and timescale DB use concurrently to manage spatial queries of time series data. Absolutely. Yeah, it can we have a number of it. That's actually funny. The first database I'm going to show you is kind of the quintessential taxi cab data that a lot of people use to show querying lots of stuff it's almost two billion rows of data. And it has, it has your geospatial markers in there for the cabs and locations that pick us from have you. Honestly, I knew we'd be short in time so I didn't put those queries in here. We do have a tutorial using this data it's an older form of the data. But absolutely you can just post GIS you can do lots of stuff I know Ryan Lambert's on this call. If you have never read Ryan's stuff. I'm going to go up Russ proof labs. He does excellent excellent posts and how to use post GIS, and he's done some things I know even the timescale as well so the answer is yes depending on what you are, what you're doing certainly reach out and we could, we could maybe take him further. Wonderful and we actually just got one other question that just came in. Let's go to the early slide that we can define chunk local indices will timescale DB take the indices of the main table or do we have to specify the chunk indices manually. Awesome question. Everything you do through the hyper tables transparent so when you create an index on the hyper table, we transparently apply that to all the chunks. There are we provide functions to do specific things to chunks if you need to, and you can. You can do that as well at the chunk level but yeah, everything is transparent meaning you're working with the hyper table as if it's a regular because it is a regular Postgres table, we then make sure that the other stuff gets pushed down appropriately. Wonderful, thank you. So we have about 20 minutes. I don't actually, I always say this I don't think this is going to take 20 minutes. So hopefully this might spur one or two things here's what I'm going to do I have two data sets here. One is the taxi cab data set if you've never looked it up. I started a number of years ago New York City makes their data available and a gentleman did analysis on at the time is like a billion rows of data, and about twice a year. He updates that project and so this data is currently from like 2013 or 2010 until July of last year. So one of the first things I just wanted to point out and I think this is some things that people don't know about timescale is a lot of the functions we provide. Some of them are specific to timescale like you won't be able to do drop chunk. If you're not using timescale hyper tables, but we provide functions like approximate row count which is not possible in Postgres. The only other way you could do this really is by doing a select count star from and that's very slow in Postgres. If I did that here, even on a really big server this is not a big server this is a for CPU 16 gig servers this is not even very big. It would take the rest of this call you know it would turn and turn and turn through the entire table trying to count all the rows. We provide approximate row count and there's nothing special about it you can run it on your tables. I run that in the matter of 72 milliseconds and that's because it's probably like three milliseconds but I'm using d beaver and it adds a lot of graphical time to draw the data and stuff. I'm using 2.7 billion rows of data in this table. Alright, and so that's just an example of a number of functions that are just a part of the extension that can help you do your work, which is, I forget that sometimes and I see someone go and use. I never thought about this, even time bucket is not a hyper table specific function I just never thought about it because I've only ever used it in context of hyper tables but it's a function that takes a time stamp in and so you can use many of these tables not specifically using a hyper table. Now many of them are hyper table specific for short because we're doing all that management for you. But I just thought that's fun. A lot of people don't know about approximate row count. You can just install the extension and never use a hyper table, and you can still use this function, and you can get counts on your, your tables, it uses statistics to make it fast. This is a little function I wrote to see the compression so I did compress this table. This is an example of timescale DB compression, there's 1.7 billion rows of data with a B. It was 372 gigs before I compressed it. Those chunks now compressed down to 94 gigs for an overall percent difference of 70 almost 75%. We have saved 75% of the disk space by compressing this data. You can see that I'm following best practices here. Most of the tables are compressed, the older data is compressed, the more recent tables are not. Generally those tables would be close to now. And so that's why I would want them to be uncompressed so I could quickly query them for some more like wider queries, but that's how this is currently set up. So we can query through, you know, this is what the data looks like. If you've never seen it just has a lot of, you know, pick up date, time, fair amount, number of passengers, things like that. And then this is some of the first things I thought was kind of cool. So notice I'm going to query an area that's compressed. So this data is in a compressed chunk. That means that for every row it finds it's getting, you know, it's looking to see how many things are in there, how many rows of data of the thousand that are in there. So this ends up being, I did query earlier this is about 8 million rows of data with caching and D beaver whatever it's a second or two. I thought there's interesting though so that's the total number of rides for that one month. The next month anyone know what was going on around May of 2020. There was a pandemic. So we're down to 400,000 rides in that one month what a huge drop but we can see it in the data which I think it's just really fun. And again that is also to compress table. And now we're in an uncompressed table and it works just the same I didn't change anything and postgres is just working and doing its thing and we're starting to get back up last year. Almost 2.6 million rides so starting to get there. So things like again, we can just run queries like we would normally do. I'm looking across in this case eight or nine million rows, in a matter of you know half a second, we get the sum counts by rate ID and so forth. And again this is a really small server this data is what did we see 100 gigs. This server only has 16 gigs of memory. And so all of this is happening by swapping data in and out off a disk and because it's compressed most of it. So it's really efficient to go get that data off a disk and get many many many many rows at once out of that column. One of the other things we provide to anyone who has time scaled extension installed is something called skip scan we blogged about this last May. It is a specific way of doing distinct on that uses the index so postgres doesn't actually have this by default. It's the only sequel database I know of that has distinct on. It's a way of kind of doing a, you know, unique row by some ordering, but it doesn't use the index like you would expect and so in this case, if I have an index on rate code ID. If I did this without time scale and I was going to do it but it takes too long over these. I don't know this is maybe eight or nine million rows of data. It would take minutes because it literally has to walk every row and then group them appropriately just to get the one. And I do that and it takes, you know, matter of a couple milliseconds, because we're able to use ID now. This is only, you know, nine or 10 rate codes, but it's over millions of rows of data that it did it. So again, if you have time scale installed, and you have the right index setup. This works very, very quickly on very large sets of data. And so, very, very fun to use. A couple other quick things to show you and then move over to the other data set I'm doing this specifically to show you what happens beyond the scenes. So this is a compressed chunk of data there's probably two or three chunks in here quite honestly I think my trunk size was 14 days. This is compressed and so what what you'll see is because it's compressed. These columns are stored individually so now Postgres timescale to be only has to go and get those columns off of disk. It doesn't have to get the entire row. So we see a couple things because of that. Number one, this data probably wasn't in memory. Again, we have 16 gigs. Although I have been querying a little bit so this might be a memory I shouldn't lie because I actually don't know. Oh there is shared hit so it wasn't memory. But 7980, 80,000 pages of data read to do this and you'll see what happens is transparently. I missed it somewhere. Here we are. We just do the decompress for you once we find the rows of the columns that you need. So it happens pretty quickly. And then the exact same thing happens here. The difference is this is actually a smaller. It's actually a smaller set of data, because this is a little bit of an incomplete month, but you'll see that the query works otherwise generally the same but we have no others a little bit decompressing there I must have gone a little bit too far back. You'll see it's tell you what let me come forward a little bit. That's actually what I thought would happen. Notice that it actually needs to read a lot more data. So it was 80,000 pages. And because it's now reading the uncompressed, it has to get the entire row off of disk to get that column. Yes, if you had other kinds of composite indexes and indices and stuff it might be a little faster. This is one of the other ways that compression can really help you just physically reading fewer pages on disk. The last thing I thought I'd show you is this is one of our hyper functions percentile aggregate. So what is the percentile so how many rides had at least two passengers during some period of time so we're going to go back pre pandemic. And this is a CTE that creates a percentile aggregation it's a new form of data that we're providing these functions. And then I say hey about what rank, what percentile did the two riders fall. So how many rides were at least two riders, or two riders or less I should say, now this is going to take I think about 15 seconds and the reason is, it has to get all that data out has to create a new form of the data in the percentile aggregate into buckets of one hour over that month. So it's a lot of work going on here. I think it takes maybe it's 20 some seconds, and you'll see that this ends up being about I think the percentile before the pandemic was like, 80% of the rides where two passengers are fewer. That means 20% had more than two passengers like okay people are willing to pack into cars. Now what I'm going to do is I actually ahead of time I created a continuous aggregate which I'm going to show you in a second. And I'm going to run that so I basically with continuous aggregate I just took this query. And I stored that in an aggregated table so the data is now already aggregated to one hour. And now I can just do that query. And you'll see that the exact same query with oops the same time happens in less than a second. Alright, so because we had pre aggregated it would give the same values. And that just means you can, you can, you know, you can ideate over your data a lot more quickly, which is super fun. We can see that by the time the pandemic was in full swing, only 10% of the rides generally were more than two passengers so people were certainly not packing in we saw that the number of rides way lower. And then as we started to come out of the pandemic, you know the lockdowns more and more. This starts to go down just a little bit. And, you know, we're starting to get back into the mid, you know, 15% or three or higher. So that's some of the ways that this can work we're talking. Honestly billions of rows of data here that I could just keep flipping those those dates around and because we're pulling compressed rows off of disk. It just makes things a lot faster and super easy way to do it. So what I'm going to do now is quickly run over just to run through how most people would do this. We have just enough time for me to show you this. And how this is really working the one problem with most of these data sets and I would say my biggest challenge as an advocate for time series database is that time is always moving on. So no matter what data set I use it's very quickly out of date. And so I've set up a database that is updating right now on its own. And I'll show you how I did that in a second. Answer that question just a minute sir. So approximate row count this is not nearly as big of a database. This is only 37 million rows because I created it last night using functions in the database. I did a little blog series recently about creating sample fake data and so I use some of those principles to try and do something here. And the data looks like it's a fake trucking company that's tracking 500 trucks their fuel level and fuel consumption per every five minutes and so forth. And then I compressed the table. And I said hey anytime data is older than one month go ahead and compress it and that's pretty typical. Now what we would say is this if you are going to be deleting or updating raw data. It can't be compressed right now that's still feature we are working towards and we know that something a lot of people would like. Hopefully you understand because we literally transform the data into something that is not a row form, updating one value in the middle of an array of 1000 things is just, it's a challenge that we're working towards. So if you need to update data you either you would need to decompress it, update it and then recompress it so we usually say hey figure out how often how old does your data normally change. And then if data never comes in after months, maybe you start compressing at that point. And the current stats on this fake data, it's about 85% compressed. Alright, so it's integers and decimals and maybe some text in there. And so I just was going to play with a little query so time bucket is our most visible function. It is the function that allows you to aggregate time, based on just or any interval you choose one hour one hour and three seconds whatever you want. We will create buckets of that interval for you you don't have to do date trunk and try and figure out how to make that happen. So, I'm just gonna say hey what is the total miles fuel miles per gallon now you see that I had coded this last night to have trucks park some for a few hours a day. I forgot to tell them to stop using fuel. So they're getting zero miles per gallon, which makes sense. So I was trying to play with some of this. So that is actually from an uncompressed part of the database. And then again like I said we can and I could do any. I could go across a larger section. The query planner knows which ones are compressed and not independently gets the data appropriately out of the compressed trunk. But let's say I got to do this this query over and over again. I take the exact same query is my select star from without the predicates so I don't have a where clause, and I create that continues aggregate. All right so same thing over 36 million rows of data so this is about eight or nine months of data I created last night. And so I already did that and then I set up a policy that says hey every hour. Go ahead and update anything that has changed update so if I've updated data in the past, or if I've added new data go ahead and actually materialize it to disk. But because it's a continuous aggregate by default they are real time. So we already said so now I can run the exact same query off of the continuous aggregate. So what happened in less than 100 milliseconds, got the same exact data that took three or four milliseconds, three or four seconds before. And it's real time, meaning if I say anything for the last week, I have data being inserted once a minute. So what we do is you say hey, pull everything out of the materialized portion of your hyper of your aggregate, and then anything that has arrived since the last time we materialize it, append that on to the end. So you're getting current data. It is currently the 1800 hour where I am in in time. And so if I were to sit here I'm not going to just because we have only a few minutes here. But because data is ingesting, if I were to run this, I'll run it right now it might change. There that I don't know if you notice this number changed ever so slightly is because a row of data was inserted while I was talking. And so for that hour we still appended the current row of data. At the end of the hour will actually materialize it save it to disk and then it's, it will be faster for that little portion, but you get real time aggregation out of the box you can turn it off. You need to or you don't want that but by default is real time. And then the last thing I'll show you is is data retention. And so, you know, we have 75, 70, 735,000 rows for this specific time frame this is more than six months ago. I did not do this ahead of time so let's see what happens. I set up a policy. Give it a second. It might take longer than that just because of what's going on. No guarantees but now that data is gone. However, if I create that same time range from the continuous aggregate that data is still there. I dropped the data out of the hyper table. I did not drop the data out of the continuous aggregate. So this is a way to keep a historical record at some aggregation level for your data. I can also put a policy on the continuous aggregate as I said earlier. At least I just want to show you the user to find actions so you see everything we're doing from time scale in here. Here's my compression policy and aggregate and so forth. I actually have this user to find action, which is a store procedure that is running every minute to insert a new row of data for all of these trucks for me based on a little Agra than I chose and so we show you last time it was successful. How many times has been successful and so forth. This is the main features of timescale to be that hopefully maybe some things there you didn't know. Now I did want to say, you know, and this by the title is petabyte scale. I didn't show you anything petabyte scale. Why would we say that. Well, the reason we do say that is because of multi node. And it's something we're seeing in users. We know that that scale is necessary. We know people are getting there. We have tutorials and how to set it up on your own. Honestly, one of the easiest ways to do that even a timescale cloud. This is our hosted solution that you can, you know, just like you would do for any other database as a service. You can set up you can try 30 days for free 100% free no credit card no nothing we just need your email and then you start an account. We can currently do up to 16 terabyte disks per node, and you can do currently a five node system on timescale cloud and we're about to allow you to do more than that with compression. At about 93%. That means you can store over 200 terabytes per node with a five node system that's very quickly becomes a petabyte of data. And so it really is actually very possible. We have a number of users are starting to do this. And it's, it's something that we are have our eyes on growing the ability to use Postgres across really large swaths of the data and make it usable for everybody. I love what we're doing. I love that we can use SQL and it's really just been a dream to do that. Here's who I am. If you want to reach out to me personally, you know, I always make sure that you know where to find me. I'll tweet about you know timescale often and there's my email, you can find me in our community slack as well. So there's our documentation site. Lots of good stuff going on there and a lot more is about to come with tutorials and some other new stuff we're planning, including tutorials that will have data updated every day we're about to release a stock tutorial for instance that has real time stock trades that will update every night so like three months of real time data. So you can really play with this exact same stuff if you want on your own and follow our tutorials. We have a community slack that's been 8,000 strong. I'm in there you're welcome to message me. Lots and lots of people from timescale are in there, but please come find us. And then last but not least, you know, honestly, the easiest way to try this is to sign up for timescale cloud. It literally is 30 days for free know nothing, use whatever you want. And there's been a lot of discussion lately, you know, on Twitter is people asking about pricing and we're very very transparent about it there's no ingress egress those kinds of things. A lot of features already and we are aiming to be one of the best places a host postgres on the Internet at this point there's some really fun features planned. Just in the first quarter first half of this year so at least keep up with what we're doing and see if at some point it could be of help to you. Thank you so much for listening and I hope that this was valuable to you if you have questions I'd love to love to answer them. Alright, great job. Thank you so much. So we've really only gotten one other question given that you answered one of them already. And I know you mentioned the tutorials but Sarah asks if that she looked on your site and would love to know if there's a quote beginners class or video that you can recommend, or if there's any half day virtual class should be really interested in knowing. Yeah, that's a great, great question. Thanks, Sarah. So I will say this there are a number of tutorials there we are about to if you're comfortable feel free to reach out to me. Yes, my team developer advocate team is small but we're growing. We are currently working on a complete redo of the getting started on our documentation website, and hopefully we'll be up there in about three weeks. We're working on some videos to go with it as well so you can watch the videos as you do your thing. And then there are just a number of other really good tutorials in there we give you everything we can to get you started we give you the schema, give you access to the data if we can, or the script to get the data. So if you want to see your question. Yeah, that's a really excellent question as well you're right. When you start to get to petabyte scale data that presents a whole bunch of other challenges, particularly around restores and so forth. The answer is absolutely it's it's something we have been working on ways to make it more performant, because it's our honestly that's when you host with us that's the only way we make money this is an open source database. You can install it on your servers and do whatever you want with it, except for compete with us as a host of provider and that's a whole different discussion but you know, lots and lots and lots of people are very successful with what they do. So yeah it's something we're looking towards I can't speak to anything at the exact moment but keep your eyes open in that space probably through this year. And I, one of the first things impressed me timescale. If you've been around Postgres for a while, you will know a tool like Petrone. We have a lot of really excellent engineers on our team I might be biased, but a few of the original or near original were actually members of the team that did Petrone. So they're very knowledgeable and just what does the ha look like, how can we do this more effectively what are ways we can make it even better. So, we're having this conversations. Wonderful. Hey, Ryan that was that was fantastic. Thank you so much. Thank you so much. Been up to quite a lot over the last few years. We have. And for all of our attendees. Thank you for spending a little bit of your day with us. I hope this was helpful. And I also hope to see you on future Postgres conference webinars so with that, everyone have a great rest of your morning evening afternoon depending on where you are, and we'll see you soon. Thanks everybody. Take care.