 All right. Hi, everyone. Thanks for coming. I'm Craig Kerstins. Hopefully you're here for the right talk. We're going to chat a little bit about Postgres. A little bit of really quick background on myself. I'll apologize up front. I'm going to go really fast. So if there's questions afterwards, feel free to kind of catch up with me. We've got about 60 slides to get through in 25 minutes, so we'll see if we can make it. I head up Cytus Cloud working at Cytus Data. We work on distributing Postgres to be a scale-out database. Before that, I was at Heroku for a little over five years, launched our Python support there a long time ago, then hired Kenneth Rites to actually do it well. So a little bit of experience in kind of the Python world, mostly kind of around Postgres these days. Kind of a really quick anecdote. This is a fun email from Tom Lane. He is probably 50% of every new release of Postgres. He's been contributing to it for quite a while before that, helped create lib.jpg, lib.png, a number of kind of prolific things. This is an email from 10 years ago saying, naming it PostgresQL was probably the worst mistake we ever made as a project. So to simplify, you can call it Postgres, PostgresQL. Just kind of a fun historical bit there. Is anyone here not using Postgres today? Okay, one person. If you're not and kind of looking for a reason why, there's some previous talks I've given, like this is a great summary of why it's a great database. It has a mix of no SQL data types like JSON-B, a lot of really powerful kind of reporting pieces, a lot of extensions, basically a ton of things packed into it. If you're in the Django world, Geo Django is probably one big reason why a lot of people look at it. This is a quote from a friend. I think it actually summarizes kind of some of the direction of this talk and you'll see this at the end. But it's becoming more and more of a platform for building things and data than just a traditional database. It's kind of shedding some of the relational database kind of world for where it's headed. Although personally, I'm a Vim user myself. So a really rough outline. I'm gonna hit a little bit on Postgres95 that came out a few months ago. Postgres96 is already, basically we know it's in it or at most what's in it. Postgres has about a one-year release cycle and from the time it's kind of feature frozen, there's still about six months until it actually is out to the world. So it moves at a pretty slow cadence, which for database I'm pretty okay with. I don't want my database changing every few months breaking things. Looking at a few extensions which we'll get into a little bit. Here's the summary on Postgres95. I'll go ahead and just kind of dig right in here. This was the highlight feature. I think every Postgres release has one big highlight that everyone gets excited about. If you're not sure what this is, because it was horribly named and horribly marketed, this is absurd. This has been asked for forever, I think, in Postgres. It was one of the remaining things that MySQL had over Postgres. If you're unfamiliar, this is the case where you try to insert a record. If you have some key or identify or some primary key like email, it's gonna say, okay, error out. But if you want to basically do a creator update, which you can do in something like Django, this is actually how you want to do it. So I actually struggled for a little bit thinking of what's a good example. Pen tweets are a good one. If I'm on Twitter and say, I wanna save this pen tweet, it doesn't matter if I had one before, just save this one for me now. So before absurd, this was possible. I actually didn't look into how Django does it under the hood. There might be a few in the audience that might know. But in short, whatever it did was not transactionally safe. So before absurd, you could do something like this. This is the most common way to do it with a common table expression. You basically try to do it and see if this returning value comes back and then if it does, you do something else. The problem with this, if you're working on something like a pen tweet, you probably don't care because the same person isn't trying to update their same pen tweet at the same time. But in general, if you're in any kind of high velocity application, you will run into race conditions here. You will run into errors and it's up to you on how you recover from them. Now this is pretty straightforward overall. The name of the syntax is kind of awful. No one in the Postgres community likes it, but they could all at least agree on it enough to get it committed. So this is kind of what it's gonna be going forward. You start with a basic insert and then from there you say on conflict, do this. So you can actually do the update and commit it. You can say do nothing, so it just fails silently. It's kind of up to you on what you wanna do with it. I'm actually, I should find out for the next time I give this talk on if Django's actually gonna convert to that. I'm looking at someone in the audience that might know, but ideally Django just kind of moves onto this so that it's a transactionally safe kind of experience and you don't even have to worry about it. So indexes. Postgres about every year seems to pick up a new index. If you do create index, it's just gonna be that standard B3 index. But if you're looking at me and you see this list and you even read through the docs, this is my experience. Even reading through some of these, like it's unclear of kind of when to use which. So really gross over generalization. B3 is usually what you want. When you do create index, like this is the one you learn like very first in CS class. It's a B3 index. This is usually what you want. If you've got something in order, create index works fine, 99% of the time this is what you want. If you're using some of the more unique data types, you may start to look at some of these other ones or if you have really large data. So genindexes, if you're using hstore or JSONB or an array data type. So is anyone here using JSONB? A few hands, okay. So JSONB, if you're not familiar, is a binary representation of JSON directly under your database. Think of it as the same thing as what Mongo does. It just doesn't lose your data. I shouldn't take shots in Mongo. So if you have multiple values in the same column, genindex is what you want. So if you've got two values like in an array or a range type where it's got a start and an end or JSONB where you've got a bunch of different ones, genindex is gonna index every key and value inside of that column. So then searches on anything in there when you're doing a where condition can be extremely fast. Just indexes. So this is most common in geospatial stuff. Also full text search. So think of it as you've got some value in a column but it can overlap with values. So if you're doing like text search and you've got three words in there but you kind of wanna match on one of them, right? This is where just indexes are extremely efficient. There's a little kind of nuance but sometimes just can be better than gen and vice versa for the same use cases. But this generalization usually works pretty well. So KNN, that's used for usually finding similarity between something similar to it. SPGIST is space partition GIST and I say this every single time I talk about it. I've said it to the Postgres community before. What's it good for? And the only thing I know is phone numbers. I can't actually hear another use case for it but basically if you've got a lot of data that kind of naturally clusters together like phone numbers, that's the rule of thumb for it. And then block range indexes. Also kind of a similar use case where data is naturally clustering together. Block range indexes are a new one in 9.5. And actually every single one of these since B-Tree has come from the group in a friendly way known as the Russians within the Postgres community. They tend to propose a crazy new idea for indexing for this one unique case that'll speed things up unbelievably. Everyone says there's no way to work. They go into a cave for six months, come back out and it just gets committed to Postgres. So block range was the newest one from them. If you're over a billion rows and data naturally clusters together, often time series or again possibly phone numbers, it's the one you want. All right, shifting a little bit. Import foreign schema. So Postgres has this concept of foreign data wrappers which are wonderful and powerful. So if you've ever had to do ETL across two systems it's usually slow and painful. And you find yourself kind of rebuilding a data warehouse. Foreign data wrappers allow you to go directly inside Postgres to query some other system like another Postgres database, like a Redis database, like a Mongo database. There's, I want to say about 90 foreign data wrappers that exist. Some I don't even know what their purpose is like Twitter or LDAP ones. But for the most part it's going from one database to another and you can query it directly in Postgres and it just looks like it's there. So instead of having to do a big complex ETL process it can be really fast for I want to pull and aggregate disparate systems together. So before Postgres 94, excuse me, before 95, if I created it I had to come in and basically create the mapping of this foreign table somewhere. I said here's my foreign database, here's the connection stream to it, here's my schema. This isn't too bad but this is a small pet peeve of mine having used them a number of times that I've got this other schema somewhere else even if it's a Postgres database. And for every table I'm going to go and have to pull out that schema and then recreate it in my database. For something like Mongo that's just a bunch of like keys and values maybe that's not bad but when I've got a Postgres database that's 20 tables and 20 columns in each this becomes a huge pain. Postgres 95 has the ability to just say import this schema over here locally. So then I can have a mapping of this remote database directly in my local one and query to it. Grouping sets. So a common theme I think with Postgres 95 we'll see it with 96 is a lot of small usability. We're probably past a lot of the big exciting things. Sands Upsert like JSONB, like window functions, like a lot of the really powerful things that people get excited about. Grouping sets is one of those that if you're using this it's really handy but you'll probably only use it a few times in a year unless you're a data analyst. So it's really handy instead of having to do a bunch of kind of case statements to do all of these rollups and groupings. So now what you can do is say group by and then this grouping set kind of ID. In this case I've got an example of department roll age and what I wanna do is find different combinations of how many people are in which department, how many people are what age, et cetera. So here's a really quick example. If I were to say group by grouping sets and have all of these in there, you would get this result. So basically I would have a rollup for each kind of individual thing I'm projecting out for. So for department, for roll for age. I could also cube this and basically say give me every single kind of different combination in this order. Depending on what you're looking to do for a report, this can be really handy instead of having to pull out all the raw data and then roll up the report yourself. So JSONB, I already talked about it a little bit. It came in nine four. If you're using the JSON data type in Postgres, you probably don't want it. You probably want JSONB because of its indexing, because of the ability to query it. So JSONB came in nine four. It's binary JSON, it's Mongo. You can index it. So GenIndex, you're just gonna create one GenIndex on it and it's gonna index the entire document for you. This is gonna be extremely fast whereas before you had to do like a functional index on some specific key. So much less flexible. There's not too much here on JSONB that's new and really exciting except for usability. So there are a bunch of small things before. If you wanted to concatenate two JSON documents or add a new key to it, before you actually had to pull this into your application and then update the entire document at once. Now you can concatenate it and it's bolded on the screen but I don't think you can see it at all. There's basically kind of two lines up and down for concatenation. So it's gonna concatenate one JSON document with another which you can see there at the bottom. Similar, removing specific keys. Before you had to pull that down locally and then actually remove a key if you wanted to save it. Now you can just say remove the key in Postgres, pretty straightforward. This is probably one of my favorites. So if you really cared about the JSON looking like JSON, for some reason the JSON type in Postgres will actually preserve that but because it's a binary representation with JSONB, it's all one giant long string. So if we kind of go back here, let's see. You can see right there, that's one long string that's run together. Not so bad when I only have two, three keys but when I've got 20, this is horribly unreadable. Prettyprint does exactly what you'd expect and makes your JSON nice and formatted for reading. This is if you're using PCQL or in general it'll kind of format it for you. Really handy when you're writing reports and kind of introspecting things with JSON. All right, so that's a really quick kind of fly through on 9.5. 9.6, 9.6 was feature frozen about five months ago now. Or no, I think about three. There's one big kind of highlight feature here which is parallelism. This is going to be the thing that everyone's excited about 9.6. I would encourage people to not expect too much from it yet. Any new Postgres release when the feature's there for the first time, it's kind of the foundation for what comes next. But here we can kind of get an idea of where it's headed. So Postgres today when you run a query it runs against a single core. For the most part because a lot of workloads are memory bound this isn't a problem but sometimes when you're scanning a bunch of data sequentially this is really slow. So today if you want to do a select count it's going to actually scan all of that data and actually do that aggregation. Now we have the ability on sequential scans so not everywhere but sequential scans to do that in parallel. So now if you say set my basically parallel workers to some new number it's going to coordinate across them and then scan the data in parallel. You can roughly expect a pretty parallel speed up here. There is some coordination overhead but from something that's like 750 milliseconds rough benchmarking shows it goes down to about 200. So overall on specific queries in this case where you're doing parallel scans it could be extremely, extremely efficient. This will only get better in future releases as time goes on. Bloom filter. So a bloom filter if you're not familiar is a space efficient probabilistic data structure. For certain type of cases and workloads where you're indexing it can be extremely efficient. So let me give an example schema here. Let's say I've got columns A, B, C, D and E and I'm often going to be querying against these. So in this one case I want to say where A is some value and B is some value. So if I wanted this to be really fast I would probably index on A and B. But what if I've also got a workload that is A and C or A and D. And basically I let my users say which of these five things you want to filter on and look up for. So what I would probably do then is go and create five different indexes on each of those. But even then you're going to miss a lot. Like the optimizer isn't always going to pick that. So a bloom filter instead, what I can do is come in here and say index on all of these. It'll create basically a heuristic model. And what it may do is return false positives but it's not going to miss data. When I say return false positives it's going to do that up to the planner and then it's actually going to double check that all the data was accurate. So if you've got a case where you're doing a reporting application and filtering on a bunch of different columns but not always the same ones this can be really efficient for you. So that's kind of the highlights in Postgres 9.6. There's a couple of others in here. Full table vacuums, this one I have to talk about it because it got a lot of attention on Hacker News and it probably has no positive impact for anyone. So if you have a table over a billion rows basically Postgres in the background has to do a full table vacuum to kind of re-set the transaction ID. If you're not doing heavy transactions on a table with over a billion rows this doesn't mean anything for you. Postgres under the covers is really a giant append-only log. So even when you delete data you're writing to it. Postgres in the background has a vacuum process that comes in and says okay now this is actually gone I've cleaned up the space. People tend to hate vacuum it's good and bad it has to exist to actually work. People got really excited about this. Unfortunately it's not gonna change anything for vacuum for 99% of people. So that was kind of a really quick run-through of 9.5, 9.6. Going forward I suspect this is my opinion but I think we can see it within the community pretty clearly that a lot more is gonna happen outside of the core of Postgres and this kind of comes back to that Emax quote. And more as extensions. So Postgres has a nice extension API that's getting more and more powerful. PostGIS is actually an extension. HTOR is an extension so you can have custom data types. You can have as we saw a Bloom filter custom indexing. You can have just about everything that you could want to do in your database as an extension that doesn't have to get committed to the Postgres core code base. Which means there's a slightly lower barrier for getting it committed. So I'm gonna walk through a few here. One's obviously a bit of a shameless plug but I think it is an interesting one. So I work at Citus. Postgres in general works best when data is in memory. As soon as you start hitting disk it's gonna be 100 to 1,000 times slower. For most applications the easiest way to improve performance is to scale up. When your cache hit drops below 99%, adding more memory to the box, moving to a bigger box is the easiest way to do things. But at some point you might need to scale out. Citus actually takes and allows you to short data from one Postgres instance across multiple Postgres instances and basically turn Postgres into a distributed database. So in cases that you have tables like events, logs, messages those are things that I've seen that hits north of a terabyte of data and typically need to scale out. So you can kind of think of Citus as a, the same idea of the parallelism that's going on in 96 but across multiple machines. So Citus is an extension to Postgres. It takes Postgres and connects and when you see a query that's kind of run against one of these shorted tables it'll automatically distribute the workload. So what you would do here to set it up is say create extension Citus, create my table and then I wanna turn this into a distributed table based on some hash, some key. So for example if I'm Twitter I wanna hash based on the ID of the tweet and this is automatically gonna distribute it to some other Postgres database that to me as a developer it looks like just still the main database but under the covers it's a completely separate instance that I've set up behind the scenes. You set up your logical shards, you say hey I want 16 shard counts and a shard is just a Postgres table. So that means you get all the benefits of Postgres all the indexes, all the data types and it's just transparent to your application. And then you just insert to it and it just works. So overall a pretty interesting kind of example of what you can do with an extension. Another one which is one of my favorites that I have not gotten to put into production yet but it's just really fun is Hyperloglog. So this is based on a paper. I don't think it was by Google but I think Google did some contribution to it. If you read the paper you see things like this. I can't explain what most of these things are. This is kind of my reaction. But I've followed it enough to kind of understand when would I use it? So in more layman's terms it's probabilistic uniques with a small footprint. Meaning I don't have to store all the raw data. I can get close enough to like any unique count. Even more kind of layman's terms, close enough counts. Think of something like Google Analytics showing you unique visitors per day. There's no way they're actually counting the raw data there when they show you that. So Hyperloglog is actually a data type. So you come in here and say create extension HLL and then I would have this data type called users. And what I'm gonna do here is I would parse over the raw data or in real time say every time I see a user insert them into this HLL. And what it's gonna do is store an aggregation by day for this HLL. So I've got one HLL that's a pretty small kind of footprint and a date. So for each day I'm gonna have a set of unique users. Now what's interesting here is I can come back through and do something like this. So I can extract the month and then say give me the rollup of all the uniques over this month. So it's not just the aggregation of uniques per day but it's uniques per month. So it knows how to actually combine this data structure into the covers as you do rollups, unions, intersections, all that sort of thing. Really common use case here is if you're looking at web analytics traffic, ad network stuff, things where you want uniques across a very large set of data and you couldn't otherwise get them. So in terms of extensions, there's an absolute ton out there. There's almost 100 FTWs alone. There's a ton of other extensions, a few interesting ones. Multicorn is actually a Python library for you to write foreign data wrappers if you don't wanna write C. So if you wanna connect to some other data system that doesn't exist, even an internal one, you could actually use Multicorn for it. Hypo-PG is for hypothetical indexes on what happens if I were to add this index to my database, how would it perform? And a number of other ones in here. AuraFCE is a Oracle kind of replacement to give you functions and tools that are in Oracle directly in Postgres. If you're looking for extensions, I would say check out pgxin.org or Google. Those are both great resources for searching for them. So a really rough recap. Postgres 9.5 came out just a few months ago. If you need absurd, I would encourage upgrading. Otherwise, I don't know that there's a ton of highlighted features. It's always nice to kind of move up, but I encourage people doing it every two or three years instead of every year because it is a big migration. My slides are slightly old here. 9.6 was a feature frozen a few months ago. I'd say we should expect it around the end of the year in four to six months from now. And in terms of kind of the future roadmap, extensions are really where we're gonna see a lot of traction, both from custom indexes, types, capabilities. The plus there is that we don't have to wait for a major Postgres release. So I think it'll overall actually increase momentum of what we're able to see out of Postgres. Cool, and I think that's kind of the TLDR. I flew through that even faster than expected. Sorry if I lost anyone, but I think I actually have a few minutes for questions as a result. Correct, it is for shorting. I mean, it is very common that you have, when you have a large multi-tenant application, you do need to shard, but yeah, it's essentially for shorting. Unless you're at, on the lowest end, 100 gigs of data, then you don't need it. Below that, Postgres works great by itself. It depends. You can run it in a replicated mode. Happy to chat, I'll find on it. There is some redundancy abilities. It depends on if you're using open source or the cloud or the enterprise version. So it is entirely open source, so you can just go and grab and run it today. So it kind of depends there, but there is some in the core of Citus redundancy abilities that if a shard fails, it could be replicated to other ones and automatically know how to reroute that. So for everyone, the question was what's going on for native GIS support as opposed to post-GIS? It's a good question. I don't know that I could definitively say. I don't, the post-GIS has a great kind of community and collaboration with the Postgres community, but it's also a very separate, large community of its own. They tend to work together, but I don't see a world where post-GIS comes out of the box in Postgres. I think the more likely route is that extension APIs continue to evolve, become more powerful. Most people don't need everything that comes with post-GIS if you're doing a standard kind of web app, right? It's a matter of if you need GIS capabilities, then you need it. So I don't see there being a big change, though I think the communities do tend to work fairly collaboratively overall. Great, thank you.