 Hi, everyone. Mike picking me up, OK? I'll take that as a yes. So welcome to Toasting in Elephant. I told my girlfriend the title of the talk, and she had the idea for this and is a great artist and had some fun. And so I'm very excited to use that. So she did the illustrations. The bad photoshopping of things together is me. And yeah, so this is me. I'm a data engineer at Moat, and we are an advertising analytics company. So we measure attention, which means we are not doing the tracking stuff that a lot of online advertising measurement companies may try to do. We're all about trying to measure whether an ad was seen, whether someone maybe hovered over the ad, whether it was actually a human viewing the ad, not a bot, what was the context of the ad, things like that. And we try and be, at least I think of us, as a neutral third party so that we're measuring for publishers like the New York Times or Facebook and advertisers as well, like a lot of big brands use us. And we're a neutral third party so that they can agree on a fair price for the ads. So we're tracking billions of events a day, and that goes into our real time system, which is not what I'm going to talk about. After the data goes into our real time system, at the end of the day, we stick it into CSVs on S3, and then our databases load it because we are powering an interface where people can look at the data, and those databases tend to do the most work for that interface. So the CSVs on S3 are the source of truth for us, but we use our databases sort of as a more ephemeral store. We kind of don't care about, for these ones anyway, we have all TP workloads, but we separate them. And those databases are ephemeral store to power this thing. So we have an interface. If you are an advertiser or a publisher, you can go and look at how your ads are doing. So we have all of these metrics, and there are actually a lot more than that. We'll get to that later. We have filters. You can drill into filters. So you click on an advertiser, drill in. You see dots of different sites or something like that. But of course, you can also just sort of look at the sites on their own. And so that gives you a little bit of an idea of what we're doing here. So this is basically what our schema looks like. We have a couple of what I'll call partition keys if any of you guys went to the multi-tenant talk. But a client is advertiser, publisher, whatever it is. So all of our queries for this sort of data are going to have a client in them, and they're going to have some date range. So a date client is sort of that CSV on S3 is there for them. And when we load it, we bulk load a client-date combination. We have all of these filters. So all those things on that left column, and I'll go over this in a sec, that you saw, those are the different filters that we have. And we have about 10 of them. So the cardinality for different clients depends on where their filters sort of expand or contract. They're not all hierarchical, but some of them are hierarchical. And there are sometimes different types of hierarchies. So normally, we have sort of two separate sort of cross-purpose hierarchies that we can either roll up on or see various things in our interface map. And then we have about 170 big int metrics. So those metrics are our counts that have come out of our real-time system, and that's what we use to produce all the stats that we show folks. Those metrics aren't exactly the things that show up in the interface. You can almost think of them as the bases for a partial aggregate. So all we do with them, for the most part, on the back end is sum them up to whatever level or over several days, depending on what we're doing. And then before we actually show it to the client, they get combined any divisions, whatever else needs to happen to actually make the metrics happens then, just sort of order of operation stuff. So a typical query might look like that. Select a couple of filters, sum of metric 1 to metric n. By the way, we select all of our metrics on every one of our interface queries, basically. Not quite all of them, but most of them. So we do the sum of all the metrics. And then the interface gets to figure out which ones go together in the right way. And if not all of them are showing, then they're easily cacheable, and they're all lined up right. So you can sort of click and say, I want to see this metric or that metric or whatever. And so they all have a client. They have date range. And then we'll group by our filters. So just to go back to our interface for a second, this is sort of how those map to the data in our interface. We have a client. We have these filters that we can choose up there. And the filters are over on the side there. We have a date range. And we have all the metrics that we have. And like I said, there's a lot more metrics. That list goes on for a long way. And people don't necessarily show all of them all the time, but they do have different subsets that they like to view. And we have a lot of data. This isn't the raw events. These are the rows that we are adding at our lowest level of roll up that we keep in our databases. So we have about 170 billion rows. And these are all, by the way, in basically one table. They're actually three, but that's only because we have three different kinds of metrics that we measure to display content and video. But it's a lot of data. We're adding 10 billion-ish rows a month right now. And all of them have all those columns. So exciting. And we're growing pretty fast. So the basic requirements are this. We have just some large amounts of data quickly. And we're summing all the columns of very wide rows. We want to compress the data that we're storing, ideally, for storage and for I-O. We end up being I-O bound if we're not careful. We need to support pretty reasonable read concurrency. This is a client-facing app that people are accessing on the web. People will come on and more than five people can actually query interface at the time. How funny. And so we need to make sure that we can handle concurrency when it comes. The data is derivative, like I said. It's somewhat ephemeral. We need to easy out quickly because having the data available in the morning is useful. So should we choose a row store or column store? Anyone? But we also have some a lot of data. And row stores can be not great at that. So our answer was yes to that question. And we were using a combination of a row store and Postgres, but it was getting way too big. So we were putting like seven, eight days on a Postgres instance, and it was getting too big for us. We're up to about a month on a Postgres instance, but that's pushing 20 terabytes at least. And so we have a whole set of things that we do to deal with that. So we use PMPP, so Corey, who's giving a presentation over there right now. They decided to troll us a little bit and stick us at the same time. He actually hired me after I came to this conference last year and accosted him after a talk. So we use PMPP, Poor Man's Parallel Processing, to go and send off our queries to various backends if they span a large enough date range that they hit multiple databases. And then each of them returns, it's part of the summation and the query, and then they get re-aggregated in another Postgres box and then sent off to PHP where it does stupid stuff. But yeah, so we have all these systems, and they each have their own strengths and weaknesses. What I found was that the column stores, A, they do have problems when you're trying to do things that you really like were breaking column stores. You shouldn't be doing 150 column sums in a column store. That's not what it's made for. It's really good if you're doing more traditional BI stuff where it's like, I want to sum all of the rows for one column. But when you're summing all of the columns for some of the rows, because again, our largest client, maybe 2% of the data, so that's still pretty big, but it's not a huge amount of the data. So we're sort of breaking column stores. We're also breaking row stores. How do we deal with this? So let's talk a little bit about row stores and column stores just so that we understand the difference between the two and why these things work. So in a row store, a table has a collection of rows. Each row is split into columns. So on disk, we have pages. Each page has a set of tuples on it. And each tuple then has all the columns that are in that actual in the row. OK, cool. So we get that. This is what they're good at. I'm not going to read it all. You can read yourselves. So good for all TP workloads, not necessarily as great for really large OLAP-type workloads. In a column store, we sort of flip that on its head. We say a table is a collection of columns. And then you can take the adders and split them out. And so then each page just has compressed values. And they're related to each other basically by their ordering. So as I scan one column, I can scan another column at the same time and get what row things are in by looking at that. That tells you a little bit about why it's a little hard if you have a lot of columns. It is nice because it compresses your data and all sorts of other stuff. It can do something really fast. But it can fall down under concurrency. We noticed with Vertica and Redshift that they would work great. And then suddenly they'd stop working because we had more users suddenly come on. And they would just sort of like you'd see these nonlinear spikes in latency that were rather odd. And we were just like, what is going on here? It might be that I don't want to just shit on Redshift and Vertica because it may well be that we just didn't manage them all enough. I'm totally open to that idea. But I also think it was because we were using them for loads that you really shouldn't be using them for. So the next thing we should talk about is what happens in a row store when you have an attribute that is too big to fit on one of those pages on disk. Probably a lot of you know the answer in Postgres anyway, it's toast. So toast is the oversized attribute storage technique. So you have an adder that's too big to fit into a single page. And instead of storing that adder, what we'll do, and it's actually to be clear, it's not exactly a, doesn't have to be too big as the adder. We actually start doing this at around 2k. Our pages are 8k. But basic idea. So we take that adder, we compress it, and we store it in another table. We split it into chunks, call that a segment. And we store each of them in this toast table that is automatically created for you. Postgres handles all of it for you, really nice. You get some compression. You get out of line storage. And all you have to store in the row on the page itself is a pointer. And we only go and get the thing out of toast when we actually need to access the data inside of it. So we can pass that pointer around without actually passing around all this large thing of data, which is nice for all sorts of reasons. And something that we take advantage of here. So the basic idea for us was can we use toast to do something that's somewhere between a column store and a row store and use that to our advantage. And so we came up with Project Margery. Project Margery is Corey's name for it. And it was named, as most of our things are, after Jim Henson character's Margery B Trash Heap from Frackle Rock, who is also a teller of truth. And also stores all of our data for all time. So again, just as a reminder of what the interface looks like, we have those client date range, filters, metrics. That's our original schema. So now what we're going to do is we're going to take a part of that schema. We're going to create a subtype of it. We're going to take those filters and metrics that we have, make them their own type. So really create table that's just filter one to filter n, metric one to metric n, blah, blah, blah. And when we put some constraint on it that says, don't put any rows here, it's just a type. I usually put it in a schema that I call types. And then we're going to make arrays of that and store them in what we're calling a mega row. So mini row, mega row. And what does that do for us? So we get about 8 to 10x compression from doing this. Not everyone is going to get that. Our types happen to be pretty compressible. The other thing that we do before we do any of this, by the way, is that we sort them in such a way as to make them more compressible. So we want to sort these things such that the compression algorithm is going to say, oh, I can compress this very nicely. Thank you. And yeah, so that's the basic gist here. And we use something like a segment, that segment that we add there. So we have our partition key, date client, then our array of the subtype. And then we have a segment which is really just there. We don't really use it in our queries. It's just there so that we have a primary key and maybe eventually there are things you might be able to do with it down the line if you wanted to store some more information about what's in that array. So I'll pause here for a second and make sure that people are sort of with me here before I go on. Are there questions that I can, clarifications? Definitely. Yeah. This subtype. Yeah, sure. Yeah, so the filters might be, some of them are like, well actually let's go back to that, right? So this might be a publishers page. So a publisher has filters that are like, okay, I know that these advertisers had these campaigns running or whatever it is, or this might be even not a publisher, but like one of the ad businesses that help them run their ads and run ads for multiple advertisers. So they'll have different campaigns and they'll have them on different sites in different placements. So those might be some of the filters and domain browser. But we also have other things that we can do with that. So they, as part of the JavaScript tag that we put in the ad, you can define what you want stuck into those different filters. So they are, so each event is associated usually with a session and each session has these filters set somehow. And then we aggregate that in our real-time system and what we get is a daily aggregate across those sets of filters and various other things. Yeah, so your cardinality can happen basically because of the number of different sets of filters that each client has. You get one round. Yeah, and so, but then there is many set, there's many different values that each of those filters can take. So I'd say that on a tip for like our smallest clients have a few rows, not much, right? And then our largest have three million rows a day. And multiple events go into each row. Aside from that, so, and I'd say it's sort of like a normal size client is about 150,000 rows a day. It's like kind of normal, cool? Yeah, yes. So yeah, I mean, we did think about that to an extent. It's more expensive is part of the reason for that. There are reasons that we didn't do that. We thought that this would be more interesting and we probably wound up sharding this. So this will end up being a sharded solution to what we need. Like this actually is a very nice representation of our data and as a disclaimer for all of this, like I'm not going to advocate that all of you use our technique. What I'm trying to give you here is like some ways to think about the tools that you can use to do this for your own like system, right? What we've done here is look at our data and think carefully about how best to represent it in a way that makes a lot of sense for us, both in terms of storage space, in terms of IO, in terms of cost, right? All of those things come into play here. Like we make decisions that not everyone is going to be able to or want to make. This all eventually is sitting on a box that has 36 cores and is using, this is on AWS with local store spinning disk drives, 24 two terabyte drives in a RAID zero configuration. That's stupid if you don't have derivative data, right? Like if this system, and we have five of them, so like if one of them goes down, okay, we'll bulk load the rest of it and go. So I'm not saying you should use this technique, but you can use the stuff that Postgres makes available to design a system that works for your data. So I appreciate you asking about this data because it is important that you understand our problem in order to understand how we thought about it. So then you can hopefully say, okay, here's some of the trade-offs that we were thinking about. How can I apply this to my data problems? So to make this a little more concrete, this is what now a typical ETL query looks like after this. So we'll make a temp table for ETL from like copy loading that CSV. We will now sort the data, assign it a segment by grouping it into basically 5,000 row chunks. And then we will make that subtype by saying, okay, filter one, filter two, all the way to filter in, all metric one, all the way to metric n, stick that into our subtype, and then array ag that group by segment, date and client, of course. And our reporting queries actually just, I mean, for our real reporting queries, of course they're going through a view that does a lot of this for us, but this is basically what it looks like. You say, okay, I'm gonna do a slight client date, filter one to filter in, whatever I'm doing, metric one to metric n, and from the array table, my like crazy mega row table, unnest the subtype and hey, I got it back. And I can filter on it, it's nice. So now I'm gonna show you some of the strengths and weaknesses of this approach. For us, I think there were strengths, it's not gonna be for everyone. So this is a comparison to Redshift. So this is for one of our sort of more typical clients. We did a 10 day query, 150,000 rows a day. Looking at our Redshift cluster, this is a Redshift cluster that has eight nodes, 32 cores each, spinning disk drives, not random IO. So you can see what we were talking about before that like column stores, if you select all the columns, it's stupid and they don't work as well, right? This approach allows you to still get some of the compression advantages and we're getting eight to 10 X compression just from this without some of the problems of the column store, which means that all the columns are separate. So if you're selecting all the columns, it doesn't work as well. So that's cool. So that's about 1.5 million rows total. When we're looking at larger numbers of rows, we don't do as well. Our time is going to scale with the number of rows, right? And if you're selecting one column, doesn't matter that much that you're only summing one column. So these queries for comparison sake, we're basically select some number of metrics, one or 250 over the date range and just sum it all the way up to the top. And then pass back one row. I didn't want to have like weirdness where we're passing back all the rows and that's what's dealing with timing issues there. So this is one client, one of our large clients that has about three million rows a day and we're summing over that. So you notice that redshift here basically is doing, sorry, wrong way, the same time for that 10 day period, whether or not we're summing three million rows, whereas Marjorie basically is scaling with the number of rows you need to sum. That is true when we have the same 10 day period. It's not so true when we have a larger number of rows but it goes off the scale here, right? But they're over a much wider date range. So with 18 million rows for redshift because it's over a four month date range and date is first in our sort order and we could change that, right? We could say, oh, I'm gonna put client first in my sort order. Then this query would work a lot better. The problem is that we add data every day. So now you have to vacuum the whole table every day to put it in the right order and vacuum was just failing, was not working well for us. So now we're looking at 18 million rows and again, Marjorie is scaling basically with the number of rows that you have to sum. Redshift was scaling with the number of rows that it had to scan, which meant four months of data for all of our clients. Marjorie, however, if I wanted to do a single column scan over four months of data for all of our clients, I would choose Redshift, right? Because it'll do it in 22 seconds and Marjorie will fall down. That's not what it's good at, it's not what it's made for. So this is basically the good and bad. We get a lot of wind for this for our client-facing queries. We know a fair amount about the queries. If you wanna do just completely random SQL and you're selecting single columns, like this sort of approach isn't for you. You could do something similar where you take out, and I know Jim's been talking a lot about this, where you take out individual columns and store them in toast. Or you do various other techniques that would make Postgres a lot faster. You have to think about your own use case. We built this for our specific use case. And by the way, all of this is done in PLPG SQL. I didn't do any C code for this. And I put that as both the strength and the weakness. This would be more optimized if I'd use C. And there are ways of doing that, but I didn't have the time or the expertise right now to pick it up for all of that. So I guess part of what I'm saying here is you can do it with some of the tools that are built in. There's a very powerful set of things that you can do just in basic Postgres to make these sorts of things work well. So we're basically trading generality in our storage for a fit to our use case. And that's what we wanted to do. That's a win for us. So let's talk about a few more optimizations that we did that are kind of fun and interesting to help you think about how far you can go with this sort of stuff. Yeah, yeah, yeah. So you stick 5,000 of those rows into an array. It is going to toast every time. So toast sticks it in, it compresses, it does it on its own, I did nothing to do it. We could come up with our own data type and tell it not to compress it and compress it ourselves first and have our own algorithm to do that. That would be interesting and something we might do in the future. Yeah, yeah, we did about 5,000 per, but that was a design decision entirely. Yeah, we are doing that at ETL time. Yep, totally. We'll talk a little bit more about that as we go forward. So indexes right now are obviously just on date and client. So what we basically have is sort of like a clustered table that's compressed and stored in toast that we can index into by just looking at the date and client because all of our queries for our interface go through some date and sort of range of dates and a client. So we can go and get the specific set of 5,000 or however many rows or sets of 5,000 rows for that date and client and pull them out and process them. So it's almost like a clustered table with a brin index. You could think about something kind of like that, but I did it so that that clustering is maintained because you're toasting at this level, if that makes sense. You're toasting it by date and client, which is the thing that I want to be, I know that my access pattern is about. The table is partitioned by date only basically to keep my toast tables from growing too big and therefore my index is not being able to fit very nicely in memory and that sort of stuff, but mostly, yeah, it's a, so you could think about doing it completely unpartitioned, but I don't know that it would work that well. Also, you have to worry eventually about oil wraparound in the toast table, which is, Eric, but yeah. Yeah, you could, but that'd be a lot of maintenance. We have six, 7,000 clients, good luck. Yeah, so I wouldn't necessarily want to do that. We hardly ever aggregate across clients. When we do, it's like our internal data science team. So we can work with them on coming up with fun ways of doing that. So let's talk a little bit about rollups. So we also, in order to power this interface and make it work a bit better, we pre-compute rollups along our hierarchies. So these filters have hierarchies to them. We know that we're gonna get a lot more at the top level than we are at, I don't know, placement for the most part. People aren't looking at granular data, like that granular data as much as they're looking at the higher level things in our filters. So we try and pre-aggregate, so also the other thing about this is people kind of understand if a query takes 45 seconds, if they get back 10,000 rows. They wonder why a query took 45 seconds when they get back three rows. They're like, really? That took you 45 seconds, right? So it's kind of nice to say like, okay, for these things that have cardinality that's lower, we can pre-compute this. Now in a, so this would be something that we would do during our ETL. As we do a group by grouping sets, then we insert into our rollup tables. Probably a lot of you know about this type of design pattern. It's relatively common. And the thing about a normal type post-grass instance is that I have to keep all of my rollup tables. Well, we'll get to that in a second. So let's look at how I stored that in my macro, right? So now I just take each of my rollups by whatever filter, group by whatever filter, so that's why we call them by filter. By filter one, by filter two, there's some sort of hierarchy, the cardinality reduces along it. That's kind of cool. I stick them all in the same row. They're all like, I know that they're all there, whatever it is. The other cool thing that can happen is suppose my cardinality doesn't actually reduce for one of these things. In a normal post-grass world because my interface knows, okay, I can select from that rollup table, all my rows have to be there. I can't say I'm gonna go select from that table, but if I don't have rows there, I'm gonna go check this other table over here. So all of my rows have to be in all of my rollup tables, even if I didn't get any cardinality reduction from them. But not this. I can null out the rollups that don't get me any cardinality reduction on a mega-row basis. So at ETL time, I say did this rollup get me to at least 30% fewer rows in the next level down. And then when I select, I just coalesce. I say, oh, do you have that one? Nope, okay, go down to the next one. I have to group by the right stuff on the way out, but our interface was already doing that. So that's not hard. And it really saves us in terms of, this gives us another 2X compression over our vanilla post-grass instance because it allows us to just store the things that actually give us some value in terms of the rollups. So that's a cool little technique, kind of fun. Now, the next little thing that you can do, and this may be more useful to a lot of folks, is the way that I started thinking about this stuff was dealing with time series data, and I was at a battery company, and we were dealing with hundreds of channels of batteries cycling, giving us data every minute at least, and sometimes they would give it to us at five millisecond intervals, and we really wanted that data. It was very particular time in the cycle of the battery, so we wanted to preserve that level of detail, which a lot of time series databases would aggregate away. But we liked the aggregate type things for a lot of stuff. So something like this would allow you to store the raw data in the array, but then pre-compute some of the aggregates and store them in the actual row. So now I can access just those aggregates, whatever they are, you choose. And that access happens in the main table. I never get the actual stuff out of toast unless I need to, but it's still there, right? So when I do a count on this, I don't wanna nest my arrays, I just store at ETL time, how many rows were in this array, exactly. And when I wanna count, I say some of total rows in my mega row, and hey, I'm done, right? That's a hell of a lot faster than unnesting all of them and counting them. It's like, that would be silly, right? I have this way of caching it that's transactional. Like, all of that stuff is computed at ETL time though, and if I can define some of those right from the beginning, I know a lot of what I'm gonna wanna do there. Okay, it works. So yeah, that's basically the count query. No, you can actually, you can do, so what I was thinking there and what I probably would have done would have been a date range or a timestamp range, and then the test that was running, right? So we would often look at tests on their own, so a similar sort of access pattern where you have a client, right? And so you'd take a TSTZ range, you'd say this is whatever set of data that maybe I actually on the way in to my arrays, I calculate like how much shit changed underneath that, and then I just have timestamp ranges based on some delta. Zip up all the underlying data in case I want it later, and then, but normally only access the aggregated data and I can compute whatever I want, right? There are all sorts of little things that you can do on that. You mean like that? Oh. Thank you, Jim. So one of the cool things here is that then I can also start storing other stuff in this metadata. I can actually kind of index into the arrays themselves. I can say, look, here are my distinct filters that occur at whatever roll-up level, or I could do ranges of them if I wanted to. I could do ranges on columns. I could create basically a brin index that just applies to the mega row, right? So the stuff that you store in a brin index is basically the range of various columns, and it's for a block of blocks. I can do that with this mega row. I basically have a compressed data structure there, and I only need to unpack it if I think it's going to help me get out what I need to. Of course, there isn't native support and postgres for this, so you sort of have to rewrite your queries a little bit. So now, I can't just say where filter one equals whatever. I actually have to add in something to my where clause that says, oh, I know that I want only this one filter one. I'm going to go into and pull out only unnest the arrays that have that filter in it. That might reduce the number of arrays that I have to unnest to find the thing that I want by a lot, depending on how selective that filter is. So, there are lots of different ways of thinking about that. It's also going to depend on the sort ordering of your data and the way in. Thinking about that sort ordering is also important for how much compression you get, all sorts of stuff like that. So, we can go over that in a minute, get through a couple other things and then give it up for questions. But basically, we now are fitting all of our data from 2012 onto single instances that takes like 18, 20 terabytes, which is about the same size as a month of data, which we still have in a normal Postgres instance that we use for recent queries. We have five Marjorie instances and they are functionally independent. So, during ETL, we do some fun stuff where one of them loads and does all the grouping and stuff for a client date combo and then copies out the sort of internal representation back to S3 and then the other Marjorie's pull that down and load it instead of doing all the computations themselves, which speeds up our load pretty nicely, allows us to use lots of CPU for that. And generally, it's a lot cheaper than what we were doing before with all of this crazy Redshift, Vertica stuff. I don't know that I said this, but when Redshift was doing that query over the 150 columns over the four month date range, it was using eight boxes, 32 cores each, at 40% of their CPU for 12 minutes. Marjorie did it on eight cores on one machine in 30 seconds. So, thinking about your data and making sure that it's the underlying structure is right for what you're trying to do with it is important. And the overall trade-off is that we're gonna do more work in the CPU. We have to decompress this stuff. We have to do all sorts of stuff. But we are an IO bound anymore. I had to do a little bit of work to make sure that the planner worked and parallel worked for this stuff. Make sure you have good statistics on your partition keys. So your client, you need to know how many rows are in a client pretty nicely for the planner. So set that statistics very high and make sure you don't have statistics turned on for your arrays because then your analyzes will never finish. Cause you're trying to make a, like, I don't know, you're trying to make basically a histogram of this random 5,000 row, like what is a histogram even, like they're not useful statistics. Trying to have the analyzer dethost stuff and make statistics about it is just silly. So turn that down. My tables end up being like 50 megs with my toast tables at like 50 gigs. So those tables work very nicely for what was happening in 9.6 with parallel sequential scans cause you don't ever have to dethost it unless you know, oh, this is the right thing to dethost. So you can sequential scan a really small table really quickly and those are the partition tables, by the way. So the other thing is you need to make sure that you create specific functions for your data types that have the right number of rows that are going to come out. So if you have a possibility of getting 5,000 rows, just like overestimate because if the planner's off and the number of rows it thinks it's going to get and then it gets 5,000 rows out, multiply any errors that it has by 5,000 and you can get some pretty bad plans, right? So just make sure that you have your statistics set right. Also, unfortunately, if you inline, if I was using PLPG SQL, which I didn't really want to, I worked for a while to get it to be a SQL function so that it could be inlined by the planner and that was great except Unnest has a standard cost and there's a standard number of rows it returns and it's 100, which meant that I was getting horrible plans. So I ended up doing this. Jim, of course, told me that all I needed to do was copy the C function, call it Unnest 2 and give it a number of rows that it returns to 5,000 and that would have been fine. That would have been smart. And then you have to think about this stuff and basically, futz around until you get it to do what you want is my basic answer. For me, I had to set minparallel relation size to like one kilobyte because it seems that the planner doesn't really take into account the fact that the toast table is much larger than the actual relation there. And yeah, there's other things that you can play around with there. The other thing that was interesting that I found briefly is that just setting the max worker processes really high way over what people might recommend was actually helpful for me. It meant that sometimes you were sort of swapping between processes in the OS but Postgres would then plan your parallel workers for all of the queries that it thought needed them. It wouldn't ever say I don't have enough workers for these, I can't plan you any workers because you're in the middle of some 30 second query, right, you're at like second 20. It's already allocated those workers to it so it can't get them back. And some other big query that's gonna take 45 seconds if it has workers comes in but if it has only one core on it, it's gonna take like five minutes. I would prefer that for the first 10 seconds of that, they both take a little longer and then that 45 second query speeds up even if we're sort of swapping out of in and out of the processor between queries then that query doesn't get any workers at all. So by doing that, I was able to basically show what I've seen under high load and we have people sometimes like we had our internal business intelligence team hitting our API at 50 concurrency or higher, right? And it still worked and you see that as you increase the concurrency your throughput in terms of the number of rows you're spitting out or the number of rows your processing goes up but the latency on almost all your queries also goes up. So that's fine for me. At high load times, your latency is gonna go up a little bit but your throughput still goes up and with all of that we really were able to get it to be CPU bound. That is every one of my cores maxed out and that happens relatively frequently which I don't think most people ever see on their databases, right? That is not a usual thing but it's nice. It's nice to be CPU bound. We can think about how to scale that better or how to do all of that in the future. So that's it. I think I have a couple minutes for questions but happy to stay after and chat also. Yeah. I don't wanna mention anybody that's interested in that setup for the easiest way to find that is to go to Git Club and search for Elephant Stack. That's gonna get you a repo that has nothing but a readme that has a link to the demo. So for, if you couldn't hear or for the video, Elephant Stack, search for the repo on GitHub, find us or email me. Yeah. So, okay, so yeah, so don't update. I actually forgot to mention this. I wouldn't update the individual things that will probably go badly. We though had problems with our column stores because we do on a fairly regular basis, ideally not too often, but we do sometimes have to replay data for a date and a client. So we'll get a new CSV because something went wrong in our processing of their data for a day or whatever it was. And so we get that new CSV and we have to delete everything we had for that date and client and reload it. And that works fine. So if you're doing bulk updates based on your partition key, this works pretty nicely. If you're updating individual rows, not gonna be so hot. Jim and I have talked a little bit about possible ways that you can do that. But that would be something that you need to think of it about if that's what you're trying to do. Sorry, I'm confused. Yeah, no, I mean, so that's kind of what happens in our real-time system. So our real-time system is using this sort of massively distributed in-memory thing to process all of our events. But this reporting stuff happens over some random date range. So if I just have the stuff as of today, that ain't gonna work. They can go and look at some date range from two years ago or all of last month and suddenly that won't work for you. Or you're storing it all in memory which is incredibly expensive. If I'm following it, I might be missing the point here. Yeah, but if you do that from 2012, I guess I'm not quite following. But oh, well, because they can also look at the lowest level, right? There are different ways that you look at the data and for a lot of reasons we couldn't do just that because it turns out the cardinality of even those things is quite high. The roll-ups can be still pretty large. I mean, sure you could do that with like by date client. I would totally think about doing that but that's not our usual query, right? We don't usually look at just by date and client. We have a lot of people going down to lower levels very frequently. So. Okay, thank you everyone. That was a lot of work. Thanks, guys.