 Carnegie Mellon vaccination database talks are made possible by Ototune. Learn how to automatically optimise your MySQL and post-grace configurations at ototune.com. And by the Stephen Moy Foundation for Keeping It Real, find out how best to keep it real at StephenMoyFoundation.org. Thanks for everyone coming. It's another session for the Vaccination Database Seminar series. Today we're excited to have Marco Aslat from Citus, which was acquired by Microsoft in 2019. So Marco lives in the Netherlands, so right now it is what, 10 p.m., 10, 30 p.m. for you. So we appreciate you staying up late with us. He is a principal software engineer at Microsoft where he leads the development of Citus as part of the Azure Database for Postgres Group. Prior to that, he joined Citus in the early stages of the startup in 2014. And he also has a PhD in self-driving cars, which this talk will not be about, which is a whole other life for you. And he has a master's degree and undergraduate degree in computer systems from BU in Amsterdam. So again, we thank Marco for being here. As always, the way we'll do this is that if you have a question for Marco, please interrupt him at any time during the talk. Unmute yourself, say where you're coming from, and ask your question. We want this to be interactive, so feel free to stop him at any time. Marco, the floor is yours. Thank you for doing this. Go ahead. All right. Yeah, thank you very much. Very excited to talk to you. I mean, the lineup of this series of database talks is super impressive. So I feel privileged to be here talking to you. So I'm going to talk about the project I've been working on for the last seven years, which is Citus, which is an extension that aims to turn Postgres into a distributed database. An extension that's like a Postgres terminology for plug-in. So you just use the regular vanilla Postgres binaries, and then you can install the extension on top. And the goal of an extension is to add sort of new features to Postgres without actually changing the underlying database code. And the two features that Citus adds are distributed tables and reference tables. And everything else we do is pretty much about making existing and new Postgres features work seamlessly with these two types of tables. And the goal there is like Postgres has a lot of versatility. And it's that also actually makes it simple to use because pretty much anything you can come up with any SQL query, it always just works. And we want to kind of provide that versatility at scale such that you can use the extensions, things like PostGIS, any type of index that Postgres has, we want to make that work on more than a single server. That also means we're trying to make Citus multi-purpose. So we're both trying to scale transactional workloads through routine queries to derive worker node and scale analytical workloads by adding parallelism and also columnar storage. So far we've done columnar storage via a separate extension, but we're now kind of integrating that into Citus itself. So why be an extension rather than a fork, which is traditionally Postgres has been forked quite a lot. Like a lot of projects have started with the Postgres code base because it does a very liberal license. It's quite easy to read, it's very well structured. But the problem is like it just keeps going. So every year there's going to be a new release packed with features, performance improvements, new ways to extend the database. And so on. And then that also means that not only do you kind of have to keep up with that either by integrating those changes or by just saying, OK, we're going to be older Postgres. But there's also this vast ecosystem of tools and other extensions that ultimately always converge to the new version of Postgres. So no one's going to maintain their extension for your fork. So if you want to continue taking advantage of that ecosystem, you have to kind of be an extension as well. And so it's been very beneficial for us actually to make Citus an extension. So why Citus? Like why essentially scale out Postgres or make Postgres distributed? Well, obviously, like Postgres is limited to a single server and that leads to a lot of different capacity execution time issues. You can obviously have a pretty large database just by attaching a lot of disks. But if your working set doesn't fit a memory, what we see nowadays a lot in cloud environments, everyone uses like network attached storage, which because it's replicated and encrypted. But then you have a limited number of IOPS that you can use on your server. A lot of operations in Postgres are actually either single threaded or do have parallelism, but it's not as always as efficient or versatile as you'd like it to be. And also a common sort of Postgres specific problem, like because it uses MVCC, like an update inserts a new row and then the other row is the old row is left in place. It needs to clean up, but also clean up is a single threaded process at least per table. It can do multiple tables in parallel, but per table is single threaded. So sometimes it cannot keep up with your transactional workload. Everything gets bloated or your queries get slower. And so this is like some typical problems you might experience if you put too much pressure on a single server. And but that's the Postgres centric view of why you might want to scale out. If you look at it from a different perspective of like other databases looking at Postgres, there are really a lot of capabilities in Postgres that are actually hard to find. We're hard to find all of them in other systems. This includes like Postgres has a very versatile set of indexes, expression index, covering indexes, partial indexes. There's geospatial extensions like PostGIS. There's of course just joins, functions, constraints, good JSON support, also updates. And there's a lot of data processing systems that have some of these, but not all the ones you might want. Because sometimes also a particular extension like a particular sketch algorithm or even something you built yourself can really make a huge performance difference for a lot of workloads. And so the only thing missing from this list is really scale, like the ability to scale across beyond a single server. So that's what we're trying to add. Of course, not every Postgres workloads can necessarily benefit from the ability to scale. So there are a lot of like, especially smaller scale LTP workloads, which maybe they're bottlenecked on lock contention. Maybe they have a lot of super complex torque procedures. In any case, they don't really benefit from scaling out. And similarly, there's a lot of workloads that benefit from scale out, but don't really have any business with Postgres. Like machine learning is not something that Postgres currently shines at. So in this intersection, what we found over the years of developing Cytus is that there's these four patterns that we see a lot in terms of. So I agree with you. Is that process thing you may not always want to scale that out? But what you said store procedures, why would that be an issue? Well, it often is not just specifically about stored procedures, but it often creates a lot of dependencies between, let's say, different tuples in the database that make it that every stored procedure essentially has to touch on every single server in the cluster. Oh, it's just like the TT problem, yeah. Right, right. Like, yeah, you get distributed queries, distributed transactions. Right, right. And so it can also just be a very common to grow over time as well. So to just get more and more complex. OK, that makes sense. So what we've seen in this intersection of Postgres and Scaleout is that there are certain recurring workload patterns. Like we've talked to really hundreds of companies, probably much more. And some patterns we see a lot is software as a service. So where you have a web application or mobile app maybe as well. And you have a lot of independent customers. And with this kind of setup, it's very easy to add customers. So those businesses can grow quite quickly. And also, because you have a lot of independent customers, something that happens is the working set can get quite large because you have no real correlation between what different users are doing. And then also, these application centers have pretty complex data models, especially by the time they usually start thinking about scaling out. They have complex data models. But fortunately, because they're multi-tenant, we can kind of shard them by the tenant dimension. So that also makes them quite like they need Scaleout, but they also can Scaleout. Another common workload pattern is real-time analytics. So it is these user-facing dashboards that maybe do thousands of queries, analytical queries per second. That's touched upon a fairly large amount of data. Which is not something OLAP stores are generally pretty good at because they are good at scanning lots of data quickly, but then doing thousands of queries per second, usually not so much. But then old speed stores are usually don't have the parallelism in the scale. So there's this intersection where if you can do parallelism, plus also having indexes that helps you kind of serve these real-time analytics use cases. I actually also find things like updates are very important here, like you often ingest bad data that you might want to clean up later, or you have to delete specific data for compliance reasons. So that also kind of makes Postgres quite useful there. Other workloads, I think with Citus, we focused a bit more on the first two because they were, while we were doing this, a bit more niche. The other two workloads we've also just seen a lot, but we focused a bit less on, but also still can reasonably support them. One is the traditional low SQL use case, so key value storage. Postgres is quite popular for that type of workload as well, particularly because it has quite good JSON support. But one of the downsides it has, it's like if you start updating your JSON objects, you get an enormous amount of table bloat. So this is where this auto vacuum problem comes in. And the sharding there helps a lot with that. And finally, there's more traditional OLAP reporting use cases where you have a large amount of data, fairly complex queries that you want to answer, ideally as quickly as possible, but it can easily take minutes or something hours, but preferably not days. So what's interesting about these different workload patterns is they kind of have a different, very different notions of high performance and what it means to actually scale. So in analytical workloads, like things like parallelizing queries is very important, but it's, for example, a multi-tenant workload, it's useful, but it's not really essential. But then for a multi-tenant workload, it's very important that we can root queries with very minimal overhead and actually also support very complex queries with very minimal overhead. So they have actually require quite different capabilities from the database. And, you know, we can make a much longer list also looking at like existing Postgres capabilities and including foreign keys. But so we need to kind of somehow implement all these different things in Postgres. And so the question is, how can we do that as an extension without changing any Postgres code? And I guess Citus history sort of starts in 1986 with this second design goal of the Postgres database, which was provide extendability for data types, operators and access methods. So this was a really from the beginning, Postgres was designed to be extensible. Now, over time as the Postgres, later Postgres QL project developed, a lot more, many more things were made extensible, including the planner, the executor, parts of the transaction manager. There's the foreign data wrappers for querying other databases. An extension can make a background worker, it's a background process that can do some maintenance. So, and sometimes this was just a good compromise within the community, let's say one company because Postgres is very much a community project. So if one company wants to go left and then I want to go right, a simple solution is to put a hook there and then they can both go in the direction that they want by building an extension. And yeah, people also like, we've been at making Postgres more extensible and there's now even more than on this list, things that you can alter as an extension. And so what is an extension? It's basically, well, it is actually a native SQL object in Postgres, like an extension is a thing you can create. But the extension consists of essentially a SQL file that can have tables, it can create tables and functions and types and whatever you can create in the database. And then optionally, there's a shared library. So this is usually written in C and it has the implementations of the functions, but as a C binary that's loaded into Postgres, it can also modify any global variables call any global functions. And so that's how a lot of the changing of the behavior is done. So you can basically an extension is a bit like a massive packaged hack almost, but it gets loaded at runtime. So to set up a Citus cluster, so you set up a bunch of Postgres servers and then you do create extension Citus that creates the metadata tables and the functions. So after that, we can use those functions to configure it so we can add the nodes, the IP addresses, set up the authentication. Usually we start from one node, the coordinator node, and then add these worker nodes. There's also other setups we're not really strictly tied to having a single coordinator. You can have the workers act as coordinators so they'll be able to respond to distributive queries as well. You can also have multiple coordinators. So there's some flexibility around the exact setup, but very often we find like for many workloads, a single coordinator is actually okay because I mean, it only really does free planning and some data transfer. So you can scale up to pretty large sizes on a single coordinator. So which means that a typical production workload looks something like this, where each node typically has a hot standby. So for things like high availability, we just rely on existing Postgres tools or kind of if we're building a managed service on the same thing we would do for just single server Postgres to do auto failover that the primary fails in 30 seconds or so, failover to a hot standby. Same for backups at the server level, we archive the write ahead logs into blob storage and make snapshots of the disks. So for, so Cytus is mostly concerned with charting unless with like high availability because there we have other tools. And then in the single coordinator model, you just kind of connect to one IP, which also works a little better with most Postgres tools. So when you have a load balancer, load balancing across multiple worker nodes, you might connect to one node to create a table and then reconnect and then the table is gone because now you're actually on a different server. So also for simplicity, we usually recommend this setup. Now, once you've set up Cytus to actually use it, first thing you do is create a table. Now we don't change the default behavior of Postgres. So creating a table just means create a table on the coordinator on that Postgres server. And then you call select create distributed table. So this calls one of the Cytus functions and there you specify the distribution column. And then this function goes and connects to the worker nodes and creates the shards. And the shards are actually just regular Postgres tables on the worker node. So they can have indexes, constraints. We don't seamlessly support triggers but they can have triggers as well. So whatever Postgres supports, you can have that on those tables. And we use hash partitioning. So each shard contains a different range of hash values. And we also put multiple shards per worker node. And we do this for two reasons. One is we can then parallelize operations even across multiple cores by querying multiple shards in parallel. And then that also works for operations like delete and update and create index which works a little better than the built-in parallelism in Postgres. And then the other reason is as the cluster grows so the actual worker nodes, we can move the shards using logical replication without downtime. And so that's the advantage of just having multiple shards for worker nodes because then we can later move some data away from it without downtime. And I mean, if at some point you run out of shards it is possible to reshard but that does incur some downtime right now but we'll probably resolve that in the future. So there's actually two table types like distributed tables and reference tables. And then also distributed tables can be co-located with each other. So if you don't specify anything we'll actually try to auto co-locate based on the type of the distribution column. And co-location means that the same hash range will be in the same place. And that means you can have foreign keys and on the distribution column and also joins on the distribution column will just be executed locally by joining the shards. And if you need foreign keys or joins on a different columns you can create reference tables. So those are replicated across all the workers. And I mean, these are sort of obvious features that you find in other distributed databases as well but we do find them like, I wanna stress it because they're actually super important for scaling relational workloads and especially reference tables which are themselves sort of slow tables that you replicate everywhere. So they're also kind of inefficient in that way but for example, they help you normalize your data. If you're storing a really large volumes of something like website hits maybe you store a user agent string and maybe there's only so many user agent strings or user agent substrings. So maybe instead of storing them with every hit you can store them in the reference table and just store an opaque identifier. So suddenly that can mean your column is like four bytes instead of a hundred bytes. And so at scale that can matter a lot especially in terms of the cost. So once you've created your distributed tables and reference tables, when you query the distributed table what will happen is Cytus answered that query by sending queries to all the shards and then merging the response. So how can Cytus change this behavior of Postgres? So here we're actually changing the query engine in some way. And the reason is that Postgres has all these essentially function pointers declared as global variables for extensions to change. So this is actually almost literally copy pasted from Postgres code probably called planner.c not postgres.c but so the one thing that the main planner function in Postgres does the function that gets called after parsing the query and building the ASD is check if there's a planner hook set, if so call the planner hook otherwise go into the standard planner. And so what the Cytus extension does is when the shared library gets loaded and you have to configure it to get loaded on startup it sets the planner hook which means that every query that gets planned by Postgres goes to the Cytus planner hook and we can do anything there. Like we can make every query return hello world we can make every query send out a tweet but the nice thing to do is look at the query are there any distributed tables there? If so, we go to Cytus query planning if not we just call back into the standard planner. So that way we preserve all the regular Postgres behavior and only do something special for distributed tables. So that is you had to write your own query optimizer now, right? Cause you're operating on the ASD, right? So does that mean is it cost-based or is it heuristics? How comprehensive is the Cytus query optimizer? So the Cytus query optimizer for the most part is pretty simple but I mean I'll get to it in the next few slides and then I can move on. So the thing that the planner function needs to return is a plan statement which contains a plan tree which like Postgres has this volcano style executor where there's execution nodes and it asks the top level node for a tuple and then that node can have child nodes which it asks for a tuple. So it does need to return a valid Postgres plan but the plan can contain custom scans. And so custom scans are how we hook into the executor like when the custom scan gets called it basically calls a custom Cytus function which calls into a component we call the adaptive executor whose job is basically to send queries to worker nodes and return the tuples that come back. And so the merging of the tuples that come back is done just by the Postgres executor by this hash aggregate node in this case. So the actual optimizer organically over time we've kind of ended up with this four stage optimizer maybe at the time it was more out of expediency to improve Cytus for a particular workload but at the time we realized our planner sort of matched exactly those four workload patterns that we discussed at the start that we're trying to target. So one is like if the query is just very simple it's like a select from table where distribution column equals something or delete or update or insert. What we'll do is we'll try to recognize those queries with minimal overheads and then construct the query which where we replace the table name with the short name and send it to the worker nodes. Now next stage is what we call the router planner. The router planner can actually deal with arbitrary SQL queries. Anything that Postgres supports is also supported by the router planner as long as all the distributed tables in the query have the same distribution column filter. And it's somewhat clever about inferring that if there's a join on the distribution column and there's filter on one side it can infer that the other side has it as well. But because it then knows, okay this just goes to a single Postgres server it can just take that query and send it to that server and it does this for deletes and inserts as well. Now after that it means we have to go through kind of merge data from multiple shards. And then we kind of go through one more stage where at this point we do construct a logical plan so we convert the AST into a logical plan that we optimize by applying commutativity rules. So we figure out how to split up aggregates what we can push down and what we cannot push down. But one observation we had is well because we have these collocation and reference table features very often it is either organically or intentionally you can construct query trees which only have co-located joins and we can actually parallelize those in a very efficient manner and actually get back to that. And if there's not a co-located join in the join tree we go through kind of a join order planner where we decide where we look at, okay well we need to repartition this table to match that table and then it generates all the possible join orders with like repartition operations or broadcast and picks the one that it expects to have the least amount of network traffic. Now this part of SIDIS is because we've focused a bit more on the real-time analytics and the multi-tenant we've not invested as much in this part of the planner. So there are some limitations there around like correlated subparies and also we're actually not exceedingly clever about cost here yet because we've not focused on that data warehousing workload so much. So that's kind of in a way to answer to your question as well like we haven't like the data warehousing part of SIDIS still needs some work, let's say. One thing we're looking at so Postgres does also have like deeper hooks in the planner where you can kind of integrate into its cost-based optimizer. So you kind of propose paths to it that it can pick between as it'll pick the lowest cost plan. So we're kind of looking at doing that in the future to make the data warehousing queries a little more optimized. So you don't even go through like the Postgres pre-writer step. Like there's that one part of the code where it's like if it announces and then you get to the cost-based search. You're intercepting or you even get to that pre-writer phase. Right, yes. So I mean, we basically overtake the whole planner. We basically built like do the optimization from scratch. And then what do you, then you send to the nodes you're sending SQL or you're sending the ASD? We're sending SQL. So then on the node then they can get through all that like there's like one equals two, the false like all that is handled on the individual nodes. Okay, that makes sense. Exactly, exactly. Yeah, one of the interesting ones here is like this query push-down optimizer. So when there's this co-located join, so you can have pretty complex queries in the middle here. We don't even really care what's there. The only thing we care about is there a co-located join between the distribution columns. It could also be reference tables. And then I mean, there's certain other restrictions on the sub queries, but as long as those conditions are met, we can just treat this pretty much as a tape because we'll send SQL to the worker nodes. So we'll just take whatever's in here, replace the table names with the short names and send that to the worker nodes. And then we only have to optimize what remains of the query which is relatively straightforward like we push down partial attributes to the workers. And so it would be an example plan where we're execution router where we have a tree that merges the result, but then most of the query actually just gets pushed down to the worker and the worker can use whatever facilities it has for answering this query efficiently, like whether it's indexes or partitioning to do time filters, et cetera. So we leave a lot of the heavy lifting that let's say to Postgres. Also we do some like before we get to query push down planning, we try to just catch like sub queries that we can easily, well, we already know how those are gonna be planned anyway, like we'll have to merge stuff on the coordinator and broadcast the result. So we try to upfront take those out and create sub plans for those and then change the query tree to read the result of the sub plan because then we can actually also treat the main query as something that's push downable. And basically the sub plan will get executed but the results will not get pulled up but we'll get pushed into the cluster or broadcast into the cluster. And then we can join the charts with the intermediate results that was created here. And we also have some other ways of doing sub plans like repartitioning but this also makes the query push down planner a little more effective because we can make a lot more things go through query push down. So there's a lot more to be said about the planner especially the joint order planner but I don't think Citus does anything unusual there and there's still work to be done. So I wanna talk a bit more about transactions. And again, there's a set of hooks that we have or callbacks that we can set. In particular, there's a pre-commit, post-commit and a board callback, some other ones around sub transactions. And so if we're dealing with a transaction and we have touched on a single node, we pretty much just delegate the decision to the worker node. So when I think in the pre-commit hook we'll send the commit command to the worker node and then it decides whether it commits or crashes or so single node transactions are fairly straightforward. Then transactions across multiple nodes use the built-in to PC infrastructure in Postgres. So we have the possibility of preparing a transaction. So in that case, I mean the like, we're no longer in a transaction block but all the locks that the transaction have to get preserved even across restarts or failover. And then later we can commit that transaction or rollback. So that's the infrastructure that Postgres has it doesn't implement a full to PC protocol. So we do that. But so basically in the pre-commit hook we would send these prepare transaction commands and then post commit, we'd send commit prepare to where if anything goes wrong we'll send rollback prepare. And then of course like what if the commit prepared fails or what if we leave a stray prepare transaction? So there we use this backgrounding worker infrastructure. So we have a background worker that periodically compares the prepare transactions on the worker to a local log of commit records. So the coordinator just before it commits a to PC it writes a several records or records for each of the prepare transaction into a local table and then if it actually commits then those records will become visible. So that's how the background worker can then determine oh, I see a prepare transaction on that worker and I have a record for it in my local log so that should have committed or if there's no record I guess the coordinator reported so it should be rolled back. And there might also be records that have already been committed so that it can just delete the record. So that's how the to PC recovery is done so that as long as you have like a replication also the commit log is replicated so we can survive failures as well. The other thing we need to do is death lock detection. So we don't use something like wound weight to prevent death locks. The reason is like most Postgres clients expect Postgres to be in read committed mode which means transactions aren't expected to restart. So like they'll just treat restart as failure. And we also cannot easily restart transactions internally which is necessary for wound weight because the Postgres protocol is interactive. So if you send the begin and then an update you'll already get some result from that update like how many rows were updated. And then if you get into a point where you have to restart the transaction you cannot go back because you already told the application you updated two rows and the second time the transaction runs it may not, it may update three rows. So then the application might behave differently. So yeah, given that we cannot use wound weight we kind of just do that lock detection. And that means we every two set the background worker that does to PC recovery also every two seconds it asks all the worker nodes for their lock graphs. And because when we open a transaction block on the workers we call this assigned distributed transaction ID UDF which tells the worker that this local transaction is part of this bigger distributed transaction. And then that is returned by the local weight at just function. So then we can build a global lock graph of the distributed transactions. And then if there's a cycle we'll cancel one of them until the cycle or we'll keep canceling until the cycle is gone and cancellation is just local sigint. So every coordinator does this for itself. So if they're multiple coordinator it does its own two PC recovery and its own that lock detection. We don't have anything we don't have distributed snapshot isolation. So if you have a concurrent query it might see a transaction being committed on one node and not yet committed on another node. There's kind of two reasons for that. One is the snapshot manager is one of the few pieces of Postgres that is not very extensible. The other reason why we haven't really changed that yet is that within the workload patterns we target it hasn't really been an issue like for multi-tenant workloads they tend to just have transactions going to a single machine. So you just get the Postgres guarantees same for correct workloads and then for analytical workloads you usually have different concern especially like real-time analytics workloads they tend to have noisy missing data anyway so it doesn't really come up. So basically we haven't really had a customer come to us say, oh I have this problem or I really need this so that's the other reason but it would also require some changes to Postgres. Right, so you said that your customers are running recommitted by default with the default isolation level and for that reason you can't use win and wait How is that different than deadlock detection? Like if my transaction gets imported I don't care what it is. Yeah, when it waits a transaction might have to restart because like if you're waiting for a lock with a process with high priority to you I have to like... But how is that different than deadlock detection? Like I have a deadlock I gotta get imported. Well, like if there's an actual deadlock there's no way forward anyway so we might as well stop. And well with that lock detection you can it's usually the consequences of how you've ordered let's say the statements in your transaction. So then you have an option of like reordering your transaction. Whereas wound wait just restarts for internal reasons for I'd say some priority inversion occurs. And in that case like the reason recommitted is relevant here is because if you were running repeatable read or serializable you actually expect Postgres to occasionally restart after restart transactions. Because that's the way the SSI and Postgres is implemented. So if you're assuming SSI you've already built your application in such a way that it'll restart transactions but most applications don't do that. Okay, so it's an application standpoint you're saying basically the customers the applications are not written assuming that they're gonna have to restart. Right, right. And we're often in a situation where we're trying to migrate very complex Postgres application into Citus. So it's like that level of change is often too much. So that's where kind of wound wait isn't really an option and that lock detection is at least gives you the option of well the deadlock occurs because you did something in opposite order. And so you can hopefully at least maybe change that part of your application like reorder your update statements. That's what I don't think matters because wound wait, again, it's arbitrary when you show up. But I think you're basically saying is people, you expose that there was a deadlock and then therefore that's a more intuitive to the application developers rather than like this, you know, technical wound wait definition. Right, right. Yeah, it's not like one is technically better than the other it's just one more Postgres friendly and compatible than the other because Postgres itself also already does that lock detection. So it has the exact same semantics of. Yeah, okay. I'm sorry. Yeah, so like sometimes we don't we don't always pick the best technical solution but the most Postgres compatible solution. So everything is kind of interesting inside us. And I'm showing this also because there's like one more important hook that you need to be able to turn Postgres into a distributed database which is the process utility hook or process utility hook. And it comes in for anything that's not select insert update and delete. So in select insert update delete go through the planner but then everything else goes through this hook. And so one of the commands that goes through there is the copy command, which in Postgres is just a command to append a CSV input to a table. And because Postgres has a heap format like the tables don't have any kind of primary index. So everything is just stored in an ordered way appending to the table is actually very fast. But something that happens with CYOS so the extension if it sees a copy command the extension actually starts reading from the socket. And then it does some lightweight parsing on the individual rows tries to figure out the extracted distribution column, pick the shard and then sends it to the workers using the copy protocol as well. And because the copy protocol is asynchronous so you don't wait for the row to get written it can immediately do the next row and send that and the next row and send that. So you get sort of partial parallelization of your copy commands as well. And that's especially useful when there's a lot of indexes. So you can probably have more indexes and then that way your reads also get faster. Another kind of interesting command is insert select. So especially in real-time analytics use cases users use this a lot to kind of create roll-up tables or like incremental materialized views where you kind of pre-aggregate your data along certain dimensions. Because this is very important to us we actually have three different implementations of it or three different execution plans for it. So one is like, okay, the tables are co-located so then we can just do the insert select locally between pairs of shards or sets of co-located shards rather. Another one is if they're not co-located we can do the select part on the shards in a way where each shard produces or the output of each query on the shard produces a set of temporary files that correspond to the destination table. Then we fetch those files to the destination shard and then insert into the destination shard selecting from the files. So this actually also scales pretty well up to the bisection bandwidth of the cluster. And then finally if we cannot do either of those things because we need to merge I have a merge step in the select then we will pull the data to the coordinator and actually use copy to write it to the destination table which again is actually pretty fast. So insert select is a bit of a superpower in Citus at least compared to Postgres. So I want to do a short, very simple demo just on my local laptop. So I have three node Citus clusters of two worker nodes. This is the development branch of Citus the way I won't do anything too fancy. So to get started, I would create a table and then I use this create distributed table function and I have it set up to actually show what it's sending to the workers. So you can see it's doing the create tables and the create on each of the workers multiple times for all the shards. And in the end it does this prepare transaction commit prepared. Maybe if I'm fast I'll see the commit log. Okay, so I can see these records that it wrote to remember that it did these prepare transactions. So now I can do just a query on my table and I can see the SQL queries that gets sent to the workers. And I can also see the Postgres query plan where this corresponds to the slide I had earlier where there's this aggregate node on top which calls into the custom scan which uses adaptive executor and that has 32 tasks to do for 32 shards. And this is just showing one of the queries it's going to send to the shards. So I can also do more. So this is an example of a push-downable query where it has some subquery with some joins of other joins but in the end it's actually joining on the distribution column. So by detecting that we can take this whole subquery and just essentially treat it as a table but then Postgres on the worker nodes does all the optimizations for the subquery. I can also change this query where same query but I add this like distribution column filter. So it'll kind of transitively see, okay, actually both sides of the join have this filter so I can just send this to a single worker node. Other parts are so let's actually have some data. So the insert select. So if I just do this type of insert select I can see it's just goes pretty quickly but here I can see it's actually inserting into destination shard just selecting from the source shard. I mean, in this case it's the same table and this table is implicitly co-located with itself. But then if I, for example, swap that around it does a whole lot of stuff. But this is basically the repartitioning going on. So it starts with this worker partition query results which runs a SQL query on the shard and then it has a whole bunch of hash buckets where it writes the results to. Then once it's done doing that it fetches the buckets to the destination shard and then inserts into the destination shard and reads from these buckets. So all these things like repartitioning are implemented using just function calls that we can include in the SQL queries that we sent to the workers. And then all of this is also transactional. So if I do this and then either cancel it or report we'll just send rollback command. So I can do all this reshuffling in a sort of transactional way. So some things we've learned just over the years. I guess the first lesson is being an extension is a pretty good idea. Postgres has all the extensibility you need to build an entire distributed database system in a very kind of seamless manner as in it's very, like there's no operation that we cannot cover basically. There's maybe a few very deep down inside but pretty much it's like we've created this whole facade on Postgres that makes it look distributed. Now we do find it's very important to make good distribution choices. So to pick your distribution columns well and pick your co-location well but it can be difficult for users. It's like a new skill to learn apart from indexes and other data modeling features. So what we've done a lot is we've just like helped customers actually on board on Citus which is not a very scalable thing to do for small customers. But we've noticed over time they tend to scale out and then it actually gets kind of paces back. And another thing we learned the hard way is like, I mean we talked a bit about Postgres compatibility and this notion of okay we use that lock detection because it's more compatible. We found over time it's very important to look at exactly what ORMs are sending to the database because they send a lot of, well very questionable SQL to the database with like weird subqueries and they use a lot of safe points and unnecessary transaction blocks and foreign keys, they don't want foreign keys. What's the worst? What's the worst? What's the worst ORM? I'm like what crazy shit do they send you? Well, recently we saw it as one which was like doing fairly complicated subqueries in the returning parts of an insert. And I was like, why, why, why? Can you name names? Well, it is a popular GraphQL ORM. Done, okay, good. But yeah, so they're like it's, before we actually were, you know addressed all the sort of different things that common ORMs do, it was very hard to actually convert applications from Postgres to Citus. That's become a lot easier, but you have to also like do all these like little database features that you might not realize are necessary or even exist like cursors and stored procedures and different ways of using safe points and various other things. So most of our time goes into just solving adoption blocker. So we're not necessarily looking at, okay, what's the most interesting technical problem to solve or the most interesting way to solve it actually like what's keeping our customers from adopting Citus and also like what's bugging them in operation. So we've got three customer workload focus. And then finally, like something we learned the hard way is like we have to be very prescriptive about the problem, about the workload we're addressing. Because I think very early on we started kind of more with generic, this is scalable Postgres message, but we got a lot, if you remember to Vendiagram we got a lot of users were more on the left side of that diagram where they cannot really benefit from scale out but they do have a performance problem. And so then by being a lot more descriptive about, hey, actually if you have a software as a service multi-tenant apps, real analytics apps that's when Citus is a great fit. That also makes it such that if you start helping customers adopt your database you kind of spend time on the ones that are likely to actually be successful. So that's something we kind of just learned the hard way. Like some open challenges. So one thing we started exploring a bit more is this notion of kind of, well, actually having a single server Postgres database is pretty good. It can do quite a lot of things and it can run pretty complex a little to be workloads pretty well. So why would we actually even change that? Why not just distribute one table if it happens to be big? So this is kind of more of a hybrid model where some tables are still like local to the coordinator and some are distributed. So it's kind of an, well, it's something we've only just started doing well but we're still kind of, okay, then you have all these choices. Do I keep this on the coordinator? Do I make this a reference table? Do I make this a distributed table as well? So we have to get better at like recommending to users what to do or even doing it automatically when possible. And the other, another area which is actually quite difficult is picking good rollups. So if you look at a dashboard, it often has like some set of queries, let's say 20 queries or 50 queries. And the way to make those queries really fast is by making good, to your last few is good rollups that can, and preferably you make one roll up for maybe 10 of those queries. But you can easily end up with a situation where your rollups are much bigger than your source data. So we kind of need an optimizer that can essentially look across query sets or across a set of queries and then decide, okay, this is a good distribution column, this is a good rollup. So that's kind of an open challenge. Another interesting one is like post GIS is quite popular on Postgres. This is almost like a fifth work load pattern but often the type of SQL queries that involved in post GIS applications are quite complex or just unusual, like distance joins are very common. So we're trying to look at how can we optimize those? And then another interesting area is like nested distributed transaction. So okay, I send an insert to a shard and that shard has a trigger. But then if that trigger inserts into another shard, I kind of have like this nested transaction and I cannot actually see that insert from the outer transaction. So that's something that's, I mean, most maybe a bit postgres specific or specific to the way we use Postgres but that's a bit hard to solve. And then there's other things like arbitrary foreign keys and like connection scaling is also still somewhat of a challenge because of the process per connection model. Now this is also something we're trying to fix in Postgres itself. Do you guys use PG Bouncer in front of that? Or is it? Yeah, so, yeah, Postgres currently is especially not great at handling large numbers of idle connections. So it's better to have a PG Bouncer in front of it. And then especially if you guys run. Yes, yeah. And then when you have multiple coordinators like internally we use PG Bouncer to kind of manage the internal connections. But I mean, PG Bouncer can also use some improvements I guess because it's sort of single threaded. And so that's all I had. So here's some links to, you know, Cytus is on GitHub, so you can track all we're doing. Also, if you want to get started with, or you want to try it out, it's easy to set up locally. You can also in the Azure portal if you create go to Azure database for Postgres there's a Hyperscale Cytus option. So this allows you to create a Cytus cluster. Finally, also there's this quite nice GitHub page that documents all of the hooks in Postgres in one place. So if you want to build an extension that's maybe not a good place to start but definitely a good place to consult while you're trying to figure out how to hack Postgres. All right. Okay, Austin, so I will absorb on behalf of everyone else. Thank you for doing this. All right, so open the floor to anybody in the audience that you want to ask questions. So unmute yourself again, say who you are, where you're coming from and ask it. Otherwise, I will full list, I can just take rest of the time. I just have a little bit, my name is Lin. I'm a PG student here at CMU. So I just have a little bit high level question, which is, so when Microsoft bought you guys, right? You integrate this into Azure, Microsoft Azure SQL. Did you guys do additional things like solve additional challenges when you're trying to do that? Or are you guys already wrong in Azure anyway? So it's just nothing special, you need to be done. Well, so we had our own managed service as a startup that was built on top of AWS. But like, you know, Azure is a much more complex beast. It's also kind of much more integrated. It's like there's, there's let's say one UI to Azure and one API. So I mean, definitely it was quite a lot of work to make the managed service work on Azure. I mean, ultimately a lot of the underlying logic is around auto failover and backups is ultimately the same. But yeah, so it definitely wasn't sort of an instant thing. We had already started working on the managed or integrating Citus into Azure before the acquisition. Izzy, thank you. Anybody else? I have one question, two questions. Do, does application needs to connect directly to coordinator host or is direct routing to worker node is possible? And second one is, how does reference table data synchronize to all worker nodes? All right, good question. So the first one is like, can you directly connect to the worker nodes? So in general, yes. So like, if I hear, for example, so here my worker node is on this port 1301. So here I just see the shards. Also, there is an option for making the worker nodes essentially act as a coordinator. And then I actually see the distributed tables on the worker. Then I can also do all the queries that we do on the coordinator from the worker. So yeah, like, so there's basically these, like either your workers just storing shards or it's also kind of acting as coordinator but you can route queries via to workers. So the second question is, how do reference tables get replicated? It's fairly straightforward, like just, you know, the statement-based replication. So an insert will go both to both workers in this case and then we'll do a two-phase commit. So writes to reference tables unless you're using copy or generally a bit slower or at least worse, I mean, there's some locks we take around updates but yeah, we better use statement-based replication. Thank you. You mentioned at the beginning of the talk that you are working on integrating column store into the CYBUS extension. Can you talk a little bit more about your plans with that? Sure, so far we've had another extension called C Store FDW and that was like a separate extension using the foreign data wrapper API. Like currently we're adding it to Decidas extension itself so you can, which uses the access method API so you can create a table that uses the columner access method and in that case, it's going to be compressed like the C Store FDW extension does but it's kind of a bit more evolved. So it also supports the replication that we need for using it outside this but this is currently not yet entirely released so we're still working on that but it's coming soon. Anybody else? Okay, so that's the, my first question would be how do you handle the cases when functions that produce like serial output? I think it like generate series or sequences because if I call generate series, if you just send that to each node they're all going to generate the same series but that's from some, building a single logical database that underneath the covers is physically partition that would be incorrect. So do you special case for things like that or how do you handle it? Yeah, so the place the function gets called is always like a little, so this is an interesting area. So I mean, normally if we'd see a generate series in the query what we do is we do this recursive planning trick so we'd call it on the coordinator and send the results, broadcast the results and then do the join but there's other cases like if it's a lateral join we wouldn't do that. So if you're using an argument like using one of the columns from the table as an argument to the function we pretty much always have to push it down. Which is usually actually okay because it's kind of semantically correct still but it does create maybe sometimes unexpected situations that it's sort of hard to predict whether a function gets called on the worker on the coordinator but in this particular case it would always get called in the end. Well, what about sequences, right? Because like if I do an insert do you still maintain that global unique counter or do you do like the cockroach thing where it's like it's globally unique but it's not sequential? So if you, I mean, that's one of the advantages of having a coordinator would be that you can say I should bring back table. Serials, yeah. So then I distribute it. So let's say I insert into test values three. So it'll just, you know, this is the X big serial. So it'll just take from a sequence on the coordinator when it inserts or when we do the copy. But if I was going through the worker node, you know, I get different values, they're much higher. And that's because we put the identifier of the worker node into the first 16 bits of the sequence. So it's sort of globally unique still but not incrementing yet. And then if I do a select on that table now not going through the worker node, do I see, let's see the appended one or do I see, okay. Yeah, so if you do an insert via worker node it would generate a weird sequence value but still globally unique. But if you're always connecting via one coordinator then you don't have that issue. Okay, and you're smart enough to like you're just updating the sequence on the coordinator which is just a counter and not like actually materializing it the insert into the table. Right, right, yeah. I mean we're basically putting it into the actual SQL command that we send to the coordinator after. Okay, so then my next question is sort of more broader things. Again, and the Postgres extension sort of, you know, I asked much in our series, I asked my team with this. How would you, whether or not you've done this but how would you quantify sort of your compatibility or your coverage of Postgres in finished, you know, under distributed operation with Citus? Like you're sure that like weird queries you've never seen before but is there, have you guys ever had a way to quantify that? Yeah, I mean there's, I guess particular specific SQL constructs probably like usual suspects like non, like correlated subqueries that do a non-colocated joins. I mean, we support almost anything as long as the collocated join is in there. We support basically everything as long as the distribution column filter is in there. So it kind of goes through these stages but if there's like, it's a cross-charts non-colocated joins, yeah, then there's like limitations around subqueries especially. So, yeah, we don't, I guess what I'm getting is I wonder if there's an automated way to take like the grammar file from the parser and be able to say, I mean, yeah, there's built-in functions in UDF that like that one not gonna be parsed correctly. Like there's a way to think the grammar file, we cover of all the keywords in the Postgres SQL dialect like what percentage we cover it but I guess it's tied to the data too. So it's tough to say. Yeah, yeah, because it's sort of specific to distribution and Postgres of like zones, if you're in this zone, we support everything, anything, if you're in another zone, we don't support everything. You'd probably have to come up with some tool that, well, that's in the end pretty side is specific. But yeah, I mean, so I have a qualitative description but we don't have really a sort of model of it right now. That's right. All right. So my last question is a moral question about like the Postgres distributed Postgres market. Now you're biased, obviously, the site was definitely, you know, it's a simple acquisition but I wonder if your comment on like what, what made society be the one that succeeded, right? Cause you're not the first person to try to do distributed Postgres, right? There was Postgres XB, Postgres XL, Transladus guys, right? Like what was it about, do you think that Citus did right? That made you guys the one that like, you know, outlived the rest? That's a very good question. I don't know if I have the answer but some reasons I can sort of propose is and being extension was definitely sort of a secret weapon like every year our product would just magically get better because there'd be a new Postgres release. And I think like the burden of being a fork is just ultimately too high. So I mean, but I don't think like being an extension is a sufficient condition to succeed. I mean, there's an amount of luck involved of being in the right place in the right time like having the right set of investors like we went to Y Combinator that kind of gives you a pretty good starting point. And then I think we also had a sort of very specific customer focus which allowed us to actually win customers early on. And for those, what's especially important there is that you, I mean, it doesn't matter how much they pay but you have to learn from them. Like what are they actually doing? And so we tried to focus on that and then it kind of becomes this flywheel where you can keep converting more customers because you're applying what you learned from your previous customer. So that's definitely also a factor. And yeah, I mean, there's some amount of luck involved I guess, but on the technical front I think being an extension is just in our super power essentially that has where other projects have failed or just really struggling to keep up. We never had that overhead. Like if a new Postgres release comes out, okay, I mean, it does take us like two or three weeks of engineering to, you know, all the headers have changed all the signatures of the functions have changed. We end up with a bunch of if def statements in our C code but still it's okay. Like not the whole database changed. So, and we don't really have to worry about, okay. Now Postgres is a different right-of-head log or something like that. It doesn't matter to the extension. All right, so Marco, it's great for you. Thank you so much for doing this. I really appreciate it.