 Okay, hello everybody. Good afternoon. My name is Hans-Jürgen Schoenig. Just call me Hans. It's easier, maybe Yeah, I'm over there from Europe. You've might heard of the country of Austria maybe once in a while Which is roughly here So I'm glad to got the invitation to come here to the conference to talk to you about some small amounts of data And how to scale I would say post-course up a little bit So I'm delighted that the room is somewhat full. Is everybody somewhat comfortable with my accent or Okay, okay guys, so this is about one billion rows per second and The second slide is a bit boring it says it's reaching a milestone. It's a bit pathetic Basically, this whole thing got personal. I was attending a GPU conference in Amsterdam and Every minute somebody was walking by and asked the question. Can you scale? And at some point you start to be angry, right? Because every every time you get this bloody question Can you scale right? What do you say? Yes? No, maybe don't care go away. No so I looked up some stuff on the internet and I Stepped over some oracle paper where oracle announced and break billion rows per second So as I told you it was a bit personal from this point on so we decided to What can we do actually to do the same thing in post-course, right? So what does it take? To process a billion lines per second in post-course. I mean if you can do one billion lines You can do five billion lines. You can do ten billion lines. You can do a hundred million lines, right? So the question is really What can be scaled? How can it be done? How do you do that? And what's the current state of the project in order to achieve that? So First question for the audience who is more than a billion rows? Okay, how much more give me a number? five ten Trillions, okay, that's that's nice. We only know that from government debt home in Austria. So trillions usually Okay, so As I said the goal is processing a billion rows and see what's the path to get there silence the discussion about hopefully about scalability at least to some extent and Basically the second the next important thing is to figure out where are the limitations in post-course? so where do we need to improve where do we need to do things better and Secondly, how can it be done without any commercial tools without any data warehouse without any hyper fancy Whatever stuff so use as much vanilla Postgres as you can and see how far you get, right? So word of caution I gave this talk two weeks ago in in Moscow and Somebody approached me, but it doesn't work very query. Of course it doesn't This is a if you have this amount of data Doing everything is Not what you're aiming for right? You're trying to solve the problem you got But given a certain amount of data everything is sometimes harder than it sounds, right? So if somebody promises that oh, we can do everything at this speed that's most likely a lie or Marketing guy or something like that. Okay, which could be equivalent. I don't know Okay, so As I said, it's a path to enlightenment if you know what enlightenment is Okay, so as you might know many years ago back in the old days. That's where I come from We used to have one core For for each query, which means that if you have a large amount of data and if you do this with one CPU core You're always somewhat limited because the the ability to build ever faster a single core Systems is basically I wouldn't say gone But it starts to be harder and harder to get stuff out of a single core so obviously what you want is you want to use many cores and Starting with postgres 96. We've got the ability to use a parallel query and as you might expect we're gonna use that Okay, so that's the the basic idea Basically, that's not the first time in in postgres history that somebody tried to do a lot of data Traditionally back here. I think it was 12 for 13 years something like that a brilliant guy from Skype Had the idea to use to build a thing called appeal proxy Which is a ingenious concept because the idea is to have a stored procedure language whose only purpose is to scale We're gonna call a function. It does all the nasty stuff. It dispatches Spreads the query all out in your cluster and it's wonderful for all the beef stuff at least back in the old days But the problem is you have to do everything yourself So the question is if you have 50 servers, you have to do anything yourself. You don't want that So the question is really how can we do that you issue a query you get an answer? No strings attached. Okay, so that's that's the whole plan for that Any questions regarding scope, etc? So, uh, by the way, there are some seats over here in case you want to be more convenient. Okay, okay? So 1 billion rows We've decided on a fairly complicated table consisting of three columns only and This is gonna be the data How to approach that basically if they keep one thing in mind It makes no difference if you have three columns 50 columns 500 columns. I mean if you have more columns, it's gonna be slower, right? How do you counter slowness? Well, you add more capacity. Okay, so if People criticize the fact that my my table is reasonably simple The only reason was to do it that in such a simple way is that we didn't want to render out so many machines Okay, that's the only reason but imagine you had the same thing with 50 columns Okay, so for the general strategy of what to do in order to get there It makes no difference if there are three columns or nine columns, right? So it's basically Technology-wise. It's the same thing. It's just boiled down to a minimum example in order to see where we can get here The query we tried to do of course we did some more queries, but you will see along the way All the implications we got here in order to to make it work. So very simple query. We have a couple of groups we count it and Then there's a demo table and So one counter per group. So simple analysis That's basically I mean that's a pretty common thing actually have a reasonably large size You could calculate count average for a couple of groups You could add a couple of filters. You could do all these kinds of stuff, right? So that's a fairly common type of query I would I would suggest But again as a word of caution It's the most simplistic the most simplistic example I could think of to get there in reality You're gonna have broader tables. You're gonna have some more filters. You're gonna have some some other data types Etc that it makes no difference that the point is if this makes a difference Then something is wrong by definition anyway Okay So before we start scaling out we get to see How much can we actually get out of a single server? Because if we know how far we can get on a single server It's gonna show us the path of how much hardware we are supposed to throw with the problem In order to to reach this kind of goal, right? So single server performance how far can we get with a single server? How far can we get with a single machine and How far can we get there? So as you know or as you might know or as I hope you to know In postgres parallel queries were added in 9.6. Of course, it's not feature complete nothing is ever feature complete so features are added as we speak and For the test as I said, it's basically postgres 9.6 with some extensions. I'm gonna talk about them couple of minutes so One thing when you're using parallel queries in postgres is that if you fire a query the system is gonna Determine the number of CPU cores It's gonna use and the reason is quite simple if you have 30 cores if you're alone You might want to use them But if there are more people working on the system, you don't want one guy to take all the resources So obviously we have to plan for for more people on the same system Which makes somewhat sense and that's pretty much exactly what it did not want. I Wanted to be alone, you know, I'm a greedy person. So I wanted to have all the resources, right? So Postgres basically determines the number of CPU cores it uses for a scan By using this variable called Mean parallel query a relation size and by default it's eight megabytes It means that if you have a four megabyte or a three megabyte table Postgres is never gonna go go for a parallel query because it's not worth setting up infrastructure It's just not worth doing that. So if your data is small postgres is not gonna go for a for a parallel query What happens instead is if the size of a table Triples it will get one more worker Okay, so if your table is nine times larger, you will get two more worker processes Okay, so there's basically a formula here which says okay take this value and if you multiply it by three You will get one more worker Unfortunately, I did my test with three billion rows and you cannot set this value large enough to get all the CPUs to work right so fortunately You can overrule the thing so you could say all the table on Did email set parallel workers to 32 so you can explicitly tell the system Okay, if you're doing a large scan on this table use all the hard way you got Okay, so I'm throwing everything at processing a single table because if I keep using this standard setting I tried it with I think a hundred kilobytes and it never gave me more than 13 cores or so because if You do the math you cannot set it low enough in order to use 500 CPU cores or so so Basically, you have to cheat a bit and tell the system. Okay guys. We are alone. Let's use all the hardware. It's party time We really want to cook the main board, right? So that's what we did. So basically putting it all together the question is we did the whole test in memory and That comes for a very single simple reason because you wanted to figure out how postgres scales and not how your storage systems scales Okay, if you got a single SSD, it's gonna give you let's say 400 megabytes a second And guess how far you get with 400 megabytes a second on my private box at home In some cases, it's not even enough to fill a single CPU core okay, so Intense CPUs or power 8 or whatever you got is so powerful that you really get into trouble with such simple queries to fit the thing Okay, so of course you can add more servers. You can stick 50 SSDs into your disk into into your box But at some point if you want to know how postgres scale Scales for the sake of simplicity. We did it in memory and that's not necessarily realistic but The only thing it does if you do it in memory is that you need fewer boxes Because otherwise what you do is you just keep adding servers. So what's the point? I mean once you're linear once you're linear You don't care if it's 10 or 12 servers. The question is can you be linear up to a certain number of servers? So that's the real question. Okay, so don't be shocked if everything here is in memory because you will would see exactly the same behavior With simply more disks or something like that So we did not want it to be a Disk benchmark for this and this is what's gonna happen on a single server Okay, so we rented out some the largest box we could get on Google For the staggering price of 28 euros for the whole test Okay, so it's a highly expensive. It's a Roughly a speeding ticket at home Yeah, I know that they paid some of them In Germany they have those wonderful white, you know round signs on the highway. It means do what you want That's freedom German car good thing to have seen some on the road So basically if we start off with with with with one CPU here We are somewhere down here and as we keep adding worker processes, right? so up to here we added 16 worker processes and The important observation here really is that's a line a linear thing Okay, so if if there is no a limitation Let's suppose. It's only memory What postgres actually gives you is a linear line for a sequential scan group by things like that If you add more course Okay, if you have to add more workers, that's highly important because if the line would go like this Then it would have a hard time if the table gets larger or something like that So you don't want that so you really really want to be linear here and what you see here everything above 16 if you add workers, this is Intel hyper threading Okay, so the Google rents out a 32 core box in reality. It's 32 threads 16 cores took me two hours to figure out Okay, so everything you see up from this point is that you see that this I would call it cheaters at the right English word It's gonna it's gonna it's gonna increase and that's the impact of hyper threading okay, so That could be you see the variations go up. So but basically linear to hardware capacity that that's that's the main observation and what you see here is that we we top out somewhere in the area of 30 something million okay, 35 40 depends on on how lucky you are But that's roughly something you can get out of a machine on on Google. By the way as you might have noticed before What they did was at its count data and not count star There's a semantic difference and also a performance difference if you put in star it will see a couple of million more Okay So there is a difference here between count star and count data But basically what it does is it shifts the curve up by a couple of million so we don't care Okay, the reason is I was too lazy to write the aggregate we need here It's the only reason why why this is consta not consta but basically this is hyper threading which is somewhat interesting So we see here that we are at 40 something million. So if we do the math We can go for 30 servers and see how what happens here so as I said 16 core books and Everything provided for the staggering amount of 28 euros and 14 cents from Google Compute Engine and Yeah So of course we want to have more than one note Okay, so having one computer is somewhat boring even your neighbor has that you want to be better than your neighbor, of course so What we did was we placed in order to stay in memory and all those kind of nasty stuff What we did here is we put a hundred million lines on each chart Okay, so generating a hundred million lines and Putting it all on a chart, which basically means that you stay in memory okay, so a line of this size is something like 30 bytes or so 35 whatever and Basically here it fits nicely on one chart. So this bottle neck is basically limited and of course on every chart Parallel eyes the hell out of it. Okay So everywhere maximum workers, which means 32 servers 16 courage leaves you with roughly 500 cores And that's basically what it takes in a virtual environment to do that I'm gonna I'm gonna elaborate on the number of cores a bit later again Okay, so this is what it looks like of course you have a server and this is where your query ends up and What it do here is you create a table and what we decided on the easiest way to do that in 9.6 It's basically to have a partition table, but every partition is a foreign data wrapper to the other both Okay, so here got an empty table With 32 partitions and each petition is a foreign table Okay, so if you have a single table on this on the server you shard it As much as you want and then on all the on all the notes You have some different all the data we got here, but here it looks like one table and it's partitioned So if you carry that suppose we have two nodes only in our system, which is of course not sufficient what we got here is we scan every table and For every table we want that the aggregation is gonna happen on the other side The key here is that if you start sending data around Don't even get started Okay, if you want to send 3 billion rows over the network don't get started right no way to do that in three seconds I mean no way meaning no useful way Okay, I mean if you are some large company Sending gigabytes of data with a network is okay, but this is a single database Installation so you certainly don't want to exchange data because you can afford it because you need all the time in order to compute something You cannot send it over so the key here is really to push this aggregation over So the count is gonna happen on the other side 32 times Okay, and at the end of the day it has to be added up And here is the first crucial thing. How do you make the thing push over the aggregate? that's not so trivial actually and Then you just add up the numbers, but then you get I went for 10 groups. So 30 times 10 groups so the final stage here is something like 300 rows So it's not even worth thinking about that. So the real party really starts if you manage to push over the aggregate and that's not trivial, okay, that's not trivial Okay, as long as the partial results are comparatively small This is gonna be somewhat linear Okay, so if we do that on the million groups, so if you got a million groups here a Situation is gonna change because you have to fetch all those intermediate results and process them, right? So you can really only guarantee linear scalability if the intermediate results are somewhat small Okay, haven't missed anybody so far is and is everybody sufficiently lightened or did I lose anybody already? Anybody asleep? No It's a neighbor's leave. No Everybody looks fine. Okay, so So we scale up to billion rows So of course it takes a bit and the first thing you have to keep in mind is that if you use a foreign data wrapper the bloody thing well the wonderful thing Uses a cursor behind the scenes it has to Because it has to assume that if you're if this is a database link and if I juggle fetch over all the data You have to assume that it's a lot of data. So it uses a cursor And the trouble is it's not so much an issue for for this kind of query But it's an issue for different queries that if you do that In postgres, there's a thing called cursor tuple fraction which tells the optimizer don't optimize for the whole result Optimize for the first 10% Okay, so it do declare cursor for select star and it doesn't optimize for the whole thing It optimizes for the first 10% assuming that you're not going to read the rest And that's a bit of a problem because it changes the plan Okay, it changes the plan which is not an issue in this case, but it has some impact on how things work Okay, so easy part tell the thing do it all at once just to be sure all the plans are fine No bad assumptions get it all at once down the interact. We don't want to talk to you do your work send the answer shut up Okay, everything is fine. So that's the first thing that some would easy because it's there Next on the list We have to make all those people work at the same time So we cannot say okay you do a hundred million lines and we wait for you then comes the next guy Let's wait for him then comes the next guy we wait for you We don't want that because then we would be sequential and there is no point in charting and aggregate if you do all the Stuff in a sequential way. There's really no point. So what you want here is if we go back to the query all those parts here have to work concurrently and At this point Postgres 96 cannot do that because it simply has not been implemented for that's the point. So fortunately, there is a bit of traffic on postgres mailing list and There are patches around to make this happen for for future versions of postgres and we did that so we took the stuff and patched it in and Everything is fine and then you need a second thing you need a synchronous fetching. So you want parallel append So all the data is fetched add it up So this has to be in parallel and secondly you want a synchronous fetching So send off the query send off the query send off the query send off the query and then a synchronously consumed data as it comes in Okay, so that's something you need Of course, this will be in some future version of postgres But that's key. You have to make things work in parallel So 9.6 cannot do this test out of the box. It needs a bit of patching to get it done. So to put everything in there and That's fine. So for future of postgres. This looks very bright because Most of the stuff is basically on the way postgres then Comes something really nasty Normally so in postgres versions you are using so it's a two-class society. I got better version of you So Normally what you do is that aggregations Happened before joins and append and all this kind of stuff because usually if you take an old traditional plan There is no aggregate down here. It's up there All right, so this is gonna fetch the data Aggregate it and the key to success is really to get this aggregate down here into the other side and That is nasty That is really really nasty that needs some patching which we did Actually, there are two patches out there I think one was published by EDP and the other one by my fellow colleague Tony Huska So we basically had to spend some time on on patching the optimizer, which is a really really nasty part here That's not trivial to get it done To actually do the aggregation before you add it up again So if this append node in between that's Joe is already looking scared He knows the implications, right? Okay, okay, so this is not trivial to do that But as I said it was for American conference and we are people from Europe So we wanted to show that we have technology over there Okay, so that's also that's key that that's really key because in order to make aggregates Shard all over the place you have to make the aggregations on the other side and if something like that It doesn't mean that our patch has to make it into core if any patch providing this functionality makes it into core This is gonna be the real key milestone in order to have insanely distributed computation Okay, because everybody does its work. You add up the answers. Everybody is fine Some people call it map reduce style aggregation if you want to use that term for whatever reason So that's important. I can't stress it enough So of course if you do something like an average or if you do some more complicated thing I mean if you do an average locally You add up the numbers you count the stuff and at the end of the of the day you divide, right? So final function is gonna divide you can't do that anymore because if 32 nodes are gonna send you an average You cannot average the average It's not correct, right? It's not gonna give you the same answer It's gonna give you roughly the same answer, but roughly the same answer is only okay in politics, right? So we are in database business. We're insanely conservative people at least on the data side and Of course, they're for therefore, sorry bad English. I got this horrible Schwarzenegger accent, isn't it? I'll be back or the query is terminated Have you seen that they come from the same area Schwarzenegger? Would you have guessed? No? Yeah, have you heard about it? It's not Australia no kangaroos, but you can eat kangaroos we've tried that it's So over there Easy extension you put in some extension to have those partial results coming over so I cheated a bit what you've seen here is Not the standard postgres count It's my count, but don't worry. It's no fake news. It's real data So it's really gonna work. Okay, so bottom line Dissect the aggregation send the parallel stuff execute on the stuff add on the main node be happy That's the whole thing. So here it is For 28 euros and 14 cents. So two coffees at the airport Here's the query and what you see is It's roughly 3.2 billion lines. So each chart is 300 something million and here we are in 2.9 seconds Okay, do the tune of 16 times 32 32 times 16 cores Which is something? Which can be improved on which is basically the next topic any questions so far? Yes, sir The foreign data read by infrastructure can we used the postgres FTW here and Applied the patch for that so that you can push down aggregates Okay, there is a patch which was committed already in 10 zero in order the interface could already do that before But the specific part to talk postgres to postgres needed to push down the aggregates Okay, so we also added this part in case I forgot to mention you could do similar things with oracle and Don't know by heart if oracle can already push down aggregates into the foreign data rapper. Do you happen to know Joe? Yeah, me too. So we have downed my friends with doubts So But I know the guy who is also sitting in Vienna by the way He has a lot of time to work on foreign data rapper for postgres because it's working for city government bureaucrats basically so But he's very good guy You will love him. He's so great He grabbed the code for the compiler. Yeah could happen. Yeah. Oh, well, you can basically overrule You can decrease this this value this mean relation whatever size for this particular table So you could do all the table on that so that's one way to do it and Secondly fix it in core That's the only strategy. I'm aware of but basically you could you could just configure it for this table as you know What's gonna happen? No Excel uses a different strategy here. We really tried to go as vanilla as we could To use a stand a much standard postgres as possible. No external stuff. No magic whatever So I really want to give you an outlook what should be possible with standard postgres without using a code fork Without using a commercial product without doing anything. So the the point is really for me So I hope it I got it across To really do as much as standard postgres as possible to see what are the missing parts to do things like that done Of course, there are solutions out there green plum which does it stuff Excel which does it stuff some other stuff Which does it stuff? So I was short on names. I couldn't remember anything else. I just say this was Okay, so everybody's doing his own stuff here, but this is really astound to postgres core is would potentially get Okay So as I said 28 euros 14 cents You were really worth it. Thank you Raj Uh, yeah, as I said a lot more parallelism in postgres 10 There was a blog post out the other week by robot house And it had so many items on it that I can barely remember What it all was there was so much work going into postgres parallelism for for 10 and beyond So this has got to be a I would say game changer in this field But let's do a word of caution. What you've seen is this a trivial query your counting numbers. That's not too hard Let's try the same with roll up group grouping sets windowing functions Things like that within group order, etc game over Okay game over that's simple Roll up If you have roll up means male female total right it gives you this This summary line. So all the states all of us, right? That's roll up It requires sorting Try to sort something across 32 shards without moving data around I would say not in three seconds Okay, there is Something on the way I think to to make this work with with hashing as the normal query But basically what I was trying what I'm trying to say is that It's not going to work for an arbitrary query Okay, so it's wonderful for min max things like that. That's easy But if you go for a median for instance, that's not so trivial anymore. It's uh It's a bit. It's a bit. It's a bit nasty or if you go for windowing functions over order by partition Okay, things starts to be fishy So there's a long long way to go in order to have a larger subset of sql working on this amount of data Okay, so as I said Try to get there See see how far we are from a practical use case typical data crunching Average count sum whatever Okay, and let's see what's needed to go from there But I wanted to say is if anybody has the impression that hey cool We're just free patches away from scaling anything to a billion rows Okay, that's not the case That's not the case. So this is something which is partly Theoretically not possible and partly it's it's going to be a couple of years in the future Okay, so this has to be straight that this has to be stressed. Okay, so that really this this point comes over Okay, how how things are gonna speed up? I've seen on the to-do list of the russians That uh, I think they're free What what a funny about russia Uh, that uh, three teams, I think one in, uh, germany out of germany Yeah, nobody out of germany Yeah, west from russia. Yeah, yeah, west from russia. Yeah, west of berlin Um I had relatives in germany makes wife was german and always told them we should build a wall build behind you not in front of you Never got through So, uh, just in time compilation. So I think there are three teams out there who are working on different approaches for just in time compilation There is one team in russia. If I got the name is Is something russian team. There's also somebody South south west of germany who is doing different approaches. So what you're gonna see here is for this kind of query Uh, there is a lot of potential in postgres to do this with fewer than 500 cores Okay, so if this is really nicely, uh cheat compiled, etc We might be able to do it in 100 cores, right? So which means not ferdy machines, but six, okay Assuming that with enough memory. Okay, so this is gonna really really gonna rock big time once we got it in Um in order to speed up this kind of analytics Okay, so that's really really important stuff going on The second thing here is which could be possible I was short of some some other word here deeper execution Short of english that was the main issue here um At this point there was one note The handsome aggregation that's it right but for some calculations you might Go deeper you might have a tree something's gonna happen here Something is going to happen on the next stage something's going to happen on the next stage So did it basically funnel the whole thing into a A system across multiple uh servers, okay If you do some trickery with views and things like that it would basically already be possible with what we got here We did some experiments for some gearbox detection error detection stuff It's basically already possible to have Nodes feeding into nodes so have some sort of pipeline out of service. So that's something which certainly gonna be Interesting to uh to examine in the future And then of course there's this huge thing called column store Right the point here is that currently postgres stores rose Okay, and if you want to have the fifth column you have to go through it and fetch it And maybe you don't even need the first three, okay So for many of this kind of workloads for time series, etc It's a lot easier to store things as columns because there's usually a lot more redundancy in a column than you have in the line Right, so if you have time The next line might be two seconds ahead and the next one is three seconds and two seconds, etc So you might only be want to store the deltas and not the whole thing So there's a lot more redundancy in a column usually than you got in the line Okay, and if you only need three columns out of 50 right column store is is Is a big deal for this kind of workloads So it's also going to reduce cpu consumption dramatically in the future as soon as somebody figures out how to implement it the best way Yeah There is a lot of stuff out there is contrip extension, but it doesn't work with parallel query There are forks, etc. There is a lot of stuff So but the aim is clear. It's I think here everybody is is Sufficiently aware of the notion that this is going to boost analytics because that's that's what it does vectorization, etc Okay, so any questions regarding the whole thing? Yes, sir Yeah, if you can show me the query, I can tell you Add if a sub select in the count Why do you want to do that? Oh Why not I can tell you a couple of reasons because maybe you don't want to do it for every row Maybe there is a better way to formulate that Okay, uh more questions Yes, sir. Well, there are two ways to approach the problem Let let me start with the inconvenient answer first If you are in warehousing Analyzing historic data There's not gonna be coming sales data or temperature from yesterday So one approach is I mean, I'm not I'm not saying that I'm for that. I'm just saying that That's a philosophy which exists Put it that way So in this case, you might not even care that this is the case that you don't see the same snapshot on all the the service At this point given this approach, there is no guarantee There is a guarantee that you have 32 or 33 separate transactions going on and mvcc or whatever ensures That every chart in itself is consistent But there is no such thing as a global snapshot because they are all in different transaction ideas They're all doing different. Basically what you do is it's just like googling three websites They have nothing to do each other, but you got the feeling that the data is accurate Okay, there is no global transaction manager involved here And that's a major difference to some other approaches That's so there's no thing but again when postgres opens the You plan the stuff you're in a certain In a certain transactional view and then basically all those snapshots get opened pretty much at the same time So it's not guaranteed to be precise, but it's for analytical stuff in many cases good enough. I would say Okay, so next one Uh postgres default is always raw It's always wrong. It was no didn't say that There is a c-store extension for for postgres, but it's uh, we haven't done it with parallel grids Okay Go any more questions Okay, so thank you very much for your attention I hope you liked it like us on facebook follow us on twitter tell all your friends about us and uh, I'll be back