 My name is Corey Hinker. My family bought a vowel at one point. And so this presentation is on some work I've been doing at moat.com, which is an advertising analytics company here in New York City. And this is a rework of a meetup presentation I had done for them in New York a couple months back. OK, officially, what is this talk about parallel processing in Postgres? But what it's really about is how you get to parallel processing in Postgres and the hacks you can do along the way before it becomes officially available. So as you've probably heard already, one of the big things that people are needing out of Postgres right now is the ability to do parallel queries. It is very easy to become CPU bound on your queries, a big sort, a big hash, big joins, can easily swamp one CPU. And on a 32 core box, you could have been doing something with those other cores. Oracle has parallel, SQL Server has parallel. All of the big vendors have parallel. We don't. And there are ways around that. Are there commercially available offerings? Yes, there are. As a matter of fact, the talk following mine is one. So that's awkward. Are there open source offerings? Well, yes, PL Proxy and things like that do exist. The problem with them is that you essentially have to build your database from the ground up using PL Proxy. You can't have the option of having the tables you're querying be just on one node some of the time. They have to be on all of the nodes. And you have to use that distribution every time. This leads to a bus-like problem where you want to move one child a few blocks. You still have to get a driver, get the child onto the bus, get the driver onto the bus, start up the bus, wait for the diesels to be ready, put it in gear, go. Big overhead for moving just one child. So we need something where the parallel can be optional the way it is, say, in Oracle. Now, in the olden days, you would do this sort of parallelism at the application layer. You would decide that you could split up your data alphabetically and give each CPU that you had a section of the alphabet that you thought was equal to the other ones and let them do that work. Other databases like Oracle itself, when you say select star from this table in parallel, what it actually does is it figures out the block ranges on disk that are out there, divides them up into 16 equal parts, and sends each process after a different block range within that table. And that can be a very efficient way to do it. It can also be really horrible if you had time series data and the first half of your table is expired rows. So half of your CPUs are doing nothing. The other half are swamped. You run into these problems in other ways like your data gets lumpy, basically. One way around that is that you get into, you can hash your data, you can take a synthetic key, you can apply modulus arithmetic and distribute the rows out to individual processes. The problem with this, while it has the advantage of being fairly evenly distributed, if your artificial key didn't have large gaps in it, the problem then is that if you have to interrelate that data with any other data in another CPU, now CPU1 needs to talk to 2 through 32 to accomplish its sort. Or you need to have them merged up. One talks to 2, three talks to 4, and so on and so on. These all lead into why we have not gotten this in Postgres. So what we want to get to is something that doesn't have that level of interprocess chatter. We want to find something where we don't have to leave the application. We can have all of our results coalesced and then used inside the query that you're on. We would like to be able to use the CPUs on this machine. If we know our data is available on another machine, it would be nice to connect to that too. We would like to have these collected in a table function. And we all want to do it in one single query. The hash distribution I was talking about has a big problem in that once each individual worker has done its job that can do independently, it needs to talk to all the remaining CPUs to find out which parts of their job they need to join to. It creates a lot of interprocess chatter. So you're better off if you can break it up logically rather than a pure hash join. We don't touch any of these things. PMPP does not address any of them. You may wonder why they aren't in Postgres already. Oracle tried really hard with theirs and their block distribution method of parallelizing. It also meant that when you did parallel in Oracle, you absolutely could not use indexes on that table. You were doing a table scan. That's bad. And the problem is that they've baked themselves into this solution, and it's very hard to get themselves out of it. Postgres people have always taken the approach of let's do this right even if we can't do it now. So they are trying to learn from all the mistakes that other vendors have made. That also means that progress is slow. To me, that's perfect being the enemy of good. I need something right now. So PMPP has given up on a lot of those issues of good design and safety in the interest of getting something out now. OK, everybody likes looking at code. But this is a little confusing, so I annotated it. There's two basic functions to PMPP. There's the single node one and the multi node one. And the basic function is a set returning function. It is polymorphic. I'll get to that in a second. You pass in connection strings. Actually, let's jump to that. Show of hands, how many people are familiar with polymorphic functions? About half. OK. Real quick, if you have a set returning function where the set itself is not defined, you must pass in one of these in any element as one of the parameters. And then that function takes on that result set shape. And that's essentially creating a contract for all of the subqueries we're going to be launching. It's really, really powerful. And somebody asked once, why does it have to be null? It doesn't, but you're not using that parameter for anything else other than to give the shape. So the actual value of it doesn't matter. So yes, polymorphic function. You have the DSN string that you're going to be connecting with. Notice that this is a string in the clear. If you're doing a connection that has passwords in it, this query will be viewable through a PGStat activity. You are, that is a security risk. I'm sorry. And then there's an array of SQL statements that are going to be passed in. And then there is the CPU multiplier where we can specify how many workers we want to get on that machine relative to the number of CPUs present. And that's fun to do. There is a much more involved method that allows you to query multiple hosts at a time. And there was no good way to parameterize that, so we made that into a JSON document. And I'll get to that a little bit later. So very simple example here. You need to create a result set type. Practically speaking, there's probably already a table on your system that matches the type of the results that you want, so you wouldn't need to do that. But for the sake of example, here it is. Must be a composite type. It must be something that is a row. And then you see we make the call to PMPDistribute as if it were a table. The row spec that I mentioned before, the loopback query. In this case, to avoid showing passwords, I'm using the loopback model for local connection. The thing to watch out for here is there's nothing to guarantee that the connection that called this function used the same connection that's being specified there. So your permissions might not line up. That could be a big problem. The queries that you're sending out could fail. Again, we're sort of running with scissors here. So that's just the sort of things we have to live with until this stuff can be handled more elegantly inside Postgres itself. And then beneath that is a trivial array of SQL statements that are going to all return a result set that exactly matches temp int roti. That's important. If one of them doesn't, this thing crashes. You're not going to get a query out of it. But knowing those queries like that ahead of time isn't really how this gets used. More often what you do is you do some metaprogramming or meta SQL. You make a select statement from a driving table. And from that, you create the actual queries you want to run. And you then cast that into an array and pass it in. So in this case, now if you had a partition table, you could look in PG class and all of those to come up with the partition list. But more likely you've wanted better clarity. So you create a view of some sort that has the partitions that you care about for this particular query. So for every row in the partition table, we are going to get a count from that table. Each query is shaped the same as temp row int. So these results will all be executed individually in parallel and then coalesced up to the calling function, which will sum the partial sums, or counts in this case. And that, unfortunately, has to happen single threaded. So you're getting a little bit of parallel here, not the whole thing. So if you want to use multiple machines, you need a JSON document. And that's not the clearest thing. So I decided to break that out. The JSON in this case is an array. Each element of the array is a collection that has the connection, the DSM, an array of queries that need to be run on that connection, and then the CPU multiplier that is available. You have to decide how you want to load down that machine. We're not going to tell you. In this case, it's saying 0.5 means that we want to use exactly half, or at most half, of the CPUs on that machine. In the second example, it's essentially the same thing. Again, you see that the queries involved, I only put one in each case just for screen room. The queries have to match in shape to each other because it would help to think of these as being union-old together. You'll notice that we didn't use a multiplier. We used a specific number of workers. This is because when you make the call remotely to figure out how many workers to launch, it asks that machine, through a PMPP function, how many you have. Well, PMPP might not be installed over there. That database might not have store procedures on it at all. That database might not even be Postgres. We have this working on Vertica and Redshift. And those are massively parallel machines. And you don't want to be firing off more than one there. So we settled on polymorphic functions after trying a few other things. The first obvious one was JSON. This was back in 93, so JSON B wasn't around. The advantage to it was you didn't have to have a type, so no polymorphic. You'll also notice that there was no connect string back then because it was hard-coded. This never made it outside the company. But it was essentially the same in the metaprogramming aspect, that you have a query that generates SQL statements and those are put into an array and that's fed into a function. The problem then was that MPP disk JSON returned a set of JSON. So you had to have every row that came out of these arbitrary queries encoded in JSON and then immediately decoded at the next level up. So at some point, a lot of that serialization, deserialization had to create a performance impact. And it just made it hard to read. And it required you to know just as much about the result type as you did before, but now you had to do it explicitly with every column you referenced, every time you referenced it. Same problem with hdoor. This one, we didn't spend too much time on because it had the exact same problems as JSON. It might have been slightly faster. I did not try this with JSON B. That wasn't out yet. But it had the same serialization, deserialization issue. And it just wasn't making for very clear queries. And this thing was, frankly, messy enough. So I didn't want to add to that. So what's under the hood? DB link, a whole lot of DB link. We're using that in the async mode. You send off queries, and then you pull them, the individual connections, to see if they're done yet. When they are done, you use DB link get result, which has its own challenges, to fetch those off in a return query statement. And then you feed another query to that worker if one's available. If not, you close down the connection. So very, very simple from a worker monitoring standpoint. One problem I run into was that DB link doesn't do polymorphic result sets. As you can see, it returns a set of record. So you need to specify this column spec every time you fetch from it inside the function. That's a problem because that code doesn't know what that type is. We've got to figure it out. And then once I've generated that, it's a string. I have to execute it dynamically each time within the function. It can't be prepared. There is no prepare statement within PLPGSQL. So that got me thinking about rewriting this in PLV8 for PLC. But the more fun option for me was making a modification to DB link so that it could do polymorphic functions. And you would just, in turn, pass the row type down to DB link on the get result and not have to worry about it. Don't worry, these slides are going to be available online. This was just a fun little thing, more metaprogramming. If you know the type of your result set, can you make the column spec for that? And this is how you do it. This is one thing I really like about Postgres that the data dictionary is there. You can find out anything about your data you want. In this case, I query that. I use those results to generate a SQL statement that will get called basically once per query completion inside PMPP. And that's how we generate that. What else is out of the hood? Not a lot. I had written this once, trying to do link lists in PLPGSQL or some sort of array management. And those things worked fine. But in the end, I just threw it into a bunch of temp tables and used those for the work queues, deleting rows out of it as it went. Because we are polling the connections to see when they have a result set to give, it makes sense to not be hammering them constantly, asking, are you done yet? Are you done yet? Are you done yet? So we had to add in a little exponential back off sleep algorithm. It stops the final granularity is five seconds. So if you have long running queries, then the most they pause afterwards is just below five seconds. The question came up, how do you know how many CPUs a box has? And in Postgres, you don't. And I thought about writing a C function that would check proc for that or something like that. That seemed like a lot of work. And I thought about writing PLSH. And that seemed like a lot of overhead and an additional extension that I didn't want to add. So I needed a way to hijack PSQL into giving me the output from a shell command, in this case, nproc. And the way you do that is by a slight abuse of the copy command. And copy will take a program. And it treats the standard out of that program as its input. So create a temporary table that is exactly the shape of nproc's result, which is one row, one column text string that happens to be an integer. And copy it in. And that's great. Now I know how many CPUs my box has. But I can't be doing this every time. I want to learn one of these queries. The overhead is terrible. So I need a way to get it into, say, an immutable function so that it's just a constant. One problem that I ran into is you can't reference a PSQL variable inside the dollar-dollar portion of a function definition. PSQL doesn't exist there. So I needed a way around that. The way around that was to, again, use metaprogramming. In this case, creating the entire function definition as the result of the SQL statement. And as you see there, that's what happened. The end function basically says whatever multiplier was put in, multiply that by the number of CPUs we've got. If that multiplier happens to be less than one, just return one. And so now I've got that, but I need to use that string. And the way you do that is with another PSQL trick called gset. What gset does is whatever one row query you just ran, it creates a PSQL variable of every column name in that query. And in this case, nproxyql was the name of the variable that it creates. And as you see, I turn around and use that as the entire function body. And this is run at deploy install time. And after that, it's an immutable function. So it is essentially no overhead. And the beauty of it is that the only time you would need to change it is if you change your hardware and then just reinstall the extension. How are we using this? Well, the easiest thing to use it for is index rebuilds and partitioning. Now, previously, we had done this with Python. And the multiprocessing module, figure out the list of partitions that need to be built or rebuilt. Figure out the list of indexes that need to be rebuilt. Great. Fire off a bunch of connections and monitor them. The problem with that is that you now have to build that particular part of the ETL around Python or whatever multiprocessing module that you were using. You can't just have this be a big long SQL script. With PMPP, you can. It will fire off all the connections. All you need to do is have a queryable way of knowing which partitions need to be rebuilt and a queryable way of knowing which indexes need to be rebuilt. We also use it in deployment scripts, partition creation. Very simple. If you know, our data is very, very much time series data. So we want to, before we start loading up a new, we have these machines that are called caches. And if we want to load up a new cache, we need to have a certain number of partitions we need to create back for like two or three years in time. We can have those created automatically in the deployment script with one SQL statement. The other thing it's good for is big question type queries with a big caveat. If you have a very partition table, maybe even a sharded table, you can write partial sum queries that go out to all the CPUs that you need. Those partial sums coalesce back at the initiating node. And then you can do the rest of the summing in your outer SQL statement. The big caveat there is that there is no transactional property to that. If this data is being modified while it's going on, you have no guarantee of when any of those queries ran. And you could get inconsistent results. We only use it in a data warehouse context where there is a nightly ETL. And obviously we wouldn't run them during that time. So that's not a problem for us. That is one of the big challenges that parallel query is going to have is making sure that everyone is treating all of these multiple connections are treating it as the same transaction so they have the same view of the data. And then the holy grail that we're working on right now is that we have a data warehouse that I've sized. If we put it in Postgres with regular tables, it would be approximately 31 terabytes. So that's not presently viable. In addition, we have an in-memory data store for current-day data. We've written a foreign data wrapper for that. What we would like to do is have far archival data in Redshift where storage is cheap. And we would like to keep using Vertica for the more recent rolling two months of data. And we would like to be able to query all three of these at once. With PMPP, you can do that in a trivial case where one query is sent off to the lab memory cache. One query is sent off to Vertica for the recent data. And if necessary, one is sent off to Redshift for the far archival data. And then Postgres is still the one doing the final summations. And it is then passed back to the client. And the client front end didn't need to have anything more than a single Postgres connection. Questions that I thought you would have that I probably went over a little too fast? Is there password danger here? Yes, there is. Anybody with PGSTAT activity on the calling node will be able to see the passwords passed in through the connection strings. That's dangerous. That's bad. You better have security built around that. So that that doesn't happen. Or the people who shouldn't see those passwords don't have PGSTAT activity visibility. Actually, that's kind of a problem for foreign data wrappers, too, because the passwords are stored right in the data dictionary. So it's not really unique to my project. Other things you can run into. You don't know how many connections are actually available on that machine. And we're not going to tell you. You could run out of connections. If you do, your query is going to die. Your connection string could be pointing to a pooler, which, well, it'll give you the connection, but it's not actually what you wanted. You're just creating an additional overhead for yourself. Because if you say 30 workers and it gives you five, well, it's going to have all those out there, but they're going to wait. And it will not work out well for you, although you would probably get an answer. Let's see. If the other machine doesn't have PMPP installed, that's easy. You just can't use the multiplier. You have to use the JSON model where you specify the number of workers. Practical considerations. In Oracle SQL Server, it's been my experience that the number of workers you fire off is some multiple of the number of CPUs on the box, 2x or 4x for Oracle, which I have more familiarity with. Now, that could have been the version of Oracle I was on. That could have been the sand we were working on. That could have been the type of machine, but it was pretty consistently that you wanted some multiple of the number of CPUs as your number of workers. Not the case in Postgres. Nearly everything we do is CPU bound. That could be that we're on EC2 boxes using SSDs. It could be that our data is very much grouped by some sort of stuff. I don't know, but we get very CPU bound early. So I don't go higher than 1.0. Just get as many of the CPUs working as possible. What am I hoping to do with this in the future? Well, I had hoped to have it on PGXN by now, but haven't had time to work on the licensing issues. That will happen eventually. I would like to have a CPU detection extension to couple with it so that we don't have to use Nproc anymore. Nproc is available on Ubuntu machines, but I'm not so sure about other Linuxes. There is a patch 2DB link currently in the 2015-06 commit test, which may not even be for 9.5 as I learned last night. So I hope to get that one accepted, and I'm learning the ins and outs of the patch process through that. And most of all, the goal of this project is to become obsolete. Everything that I've done here is a half measure to carry us through to a time when Postgres can actually do a lot of this stuff itself. And this whole talk seemed very, very quaint. I now open up the floor for questions. Sir? A couple of observations and one question to start with. Yeah, the 2015-06 is not for 9.5, but the 2015-02 is. Yeah, I've got some waiting to do. But if it's a small enough patch, you might be able to work that. As far as being able to prepare and be able to do SQL, I mean, that's true as far as it goes, but you should still get pretty clean patch going on there. When I was benchmarking it out, there was additional overhead. It seemed there. I assumed it was reparsing the string all the time, so I didn't actually know what was going on there. Because of the call to DV link, I mean, there are conditions under which plants could be cached and there are conditions under which they can't. I honestly don't understand all the rules. So you might be running a foul of one of those rules, and it's not actually caching the plant. Yeah, that was it. Even if there was no performance benefit to it, I would still like to have the DV link option just because it takes probably a third of the code away. And makes it just so much cleaner because the polymorphism has been pushed all the way down. And that's power. So I'm just a little bit more careful. Yeah. Just one other thing I was going to say is also, I think, when they did parallel PG dump, they created a way to connect multiple back ends to the same snapshot. So I'm 99% sure that's not available from TLPG SQL, but if you're looking at a module, maybe you may be able to do that from a C module. Actually, multiple back ends to the same snapshot because they do that for PG dump. It's available at the SQL level. So you definitely is available at the SQL level. It is. OK, I'm not aware of that. When you write it, if the first statement after a begin, there's something like export snapshots, and then you can do begins on other connections, and there's an import snapshot or something like that. So it's available at the SQL level. Anyway, so you might want to look into that because that would address the consistency issue that you were having with your sum of sums operation. Yeah. That was actually not a serious concern for us given our workload. But it was one thing that I realized that presenting it out to the wild, I would definitely want to say this is a serious concern. There's a lot of that here. And that was really more what the talk was about, not so much the parallel. I mean, that was the goal. But I had a chance to use and abuse four or five different not so commonly used postgres features here. And this was fun. So that was the best part of it. Any other questions? Sorry. One quick note, you can keep password that pdactivity by creating a foreign server definition, which you can then point pd to again. So the user can still see the password, but it doesn't show pdactivity. And OK, I'd be interested in hearing more about that. I can show it to you if you want to. Please. I haven't had your question, though, which is, so you're a Redshift user, why is, like, this is cool, but why do you need this in our user Redshift? Because the data we have isn't all in Redshift. Some of it is live in an in-memory database being presently collected. So today data is always forever in memory and gets dumped out once a night. So even if Redshift had the correlation functions we wanted, which it doesn't, even if Redshift had the ability to realize that some of our clients are big websites, Buzzfeed, things like that, Forbes AOL. These are very large clients. We have other smaller clients that are 1,100 the size. It doesn't make sense to do queries that are parallel for them, because they're going to have so much less data, all of this overhead is something you want to avoid for them. And in the case of Redshift, you've got to load everybody on the bus for even the small queries. There's just times when you know it's going to bring back 1,000 rows, and you can't justify firing up 256 cores for 1,000 guys. But even if there wasn't that, we have some of the data just isn't in Redshift, and we wanted the ability to transparently query them both. In front of him? Yes. I'm kind of confused how this paralyzes. Is it paralyzing one query to run 32 times or however many times? No. You are. OK, back to the super, super trivial example. There's three distinct queries there. Extremely trivial ones. But this will generate three workers each with that connection string. And the first one, it will give the first query, the second one will give the third, second, and so on. Now, if your box only had two CPUs, it would have fired up two workers, giving out the first query and the second, waited for them, pulled them until they completed, and then it would have fed in the third query back to one of those workers. So it's just doing, you know. But it doesn't paralyze a single long running query. It would paralyze maybe small. You know, this is more of a map reduce kind of thing where you have to do, you have to delineate which what are the units of work that are themselves logically distinct. It's not, there are, peel proxy is, for instance, a good way to do something that is transparent. But it comes with that limitation that now all of your data is proxied. You don't have the choice. You can't know better, you know, when it's small enough. So if you wanted to paralyze a long run, let's say a select count start from a trillion row table, you have to yourself create buckets of ID ranges or something. ID ranges would be one way to do it, yes. And, you know, that's the old way. And, you know, it has problems. But yeah, that's kind of how you'd have to do it. In our case, our data is dead for the stuff that's on for the pure Postgres database, different product, not the one that's in Redshift. That is heavily date partitioned. So it very much lends itself to that. And the partition exclusion thing there was very powerful. So this became much more handy for queries on that. Anyone else? All right, thank you very much.