 Like a survey who has heard about cubes dealt with post-rest cubes Not from Trey Gecko No, okay. All right, so have you ever dealt with? Analytical processing like you're using your database not to build your like fast small queries But to actually build an analytical system or a tracking system of user behavior Nope. All right. Good. And that's gonna be useful all right, so Postgres introduced Like let me first explain what the like the problem that we're trying to solve We have you have a like a sales tracking system, right and your customers want to know exactly how things are going compared like to products to locations warehouses to salespeople and you want to be able to dissect your data in In a way that you can for example say get me all locations Knowing that I have these salespeople Between this period in this period get me their total sales for example You want to traverse that in a way? That is almost real time. The problem is once your data grows big You have no way to do this in real time in normal like scenarios So cubes coming to play basically these structures will help you do this For big projects for analytical projects and you can basically use it in your company or your site projects when you want to build that So well, unfortunately, it hasn't been discussed a lot in the rails community. Yeah Yeah So unfortunately, it hasn't been discussed much in the rails community, but It is it has been worked on a lot lately actually and discussed on active record and arrow. So Basically the schedule for this talk would be a Bit of introduction about post-wrest cubes We'll have like an interactive demo of how to build up cubes through a sales database And then some discussions I want to actually have you know your opinions and your thoughts about architecture of how would you place this solution within your current project and How it has been it is being now adapted and used to like refine active record hopefully so Basically From technical perspective we came across a problem in our reporting system Customers who used to have you know Like an average amount of data we can handle the analytical processing pretty fast As soon as the customer grows and becomes like you know one year two years worth of heavy load of data Our system would come to a halt We can like we are the time out because we are restrained by Heroku's 30 seconds response time We have problems with memory loading because active record loads a lot of data Before doing the calculations dissecting your data and then coming up with like a concrete result and so The solution was to actually start exploring How can we make this faster and how can we actually get a result that is almost real time? so Postgres coincidentally when we started considering this problem Postgres 9.5 was in alpha version and in postgres 9.5. There was a very foundational difference in the way they started doing Their internal structures one of these changes was the introduction of three keywords that are sql standard grouping sets roll up and cube These three concepts are very much interconnected now. I'll show you how but These concepts have been roaming around in the sql community when is postgres going to implement those there are like like Standard they're important and they open up a lot of business opportunities and finally one brave guy Decided to take on that project and implement it throughout the whole database system so actually This what happened? hmm So this is basically the change the commit That changed basically everything about postgres post 9.5 Postgres now can become a huge analytical database massive powers And the guy is actually like like there are if you just look up his name a lot of articles about him Basically, so there are other more if it like there are other like private companies that built on top of Postgres or my sql and they had like there's like perkona server, which is a fork and Modification to include those capabilities, but now it's in the open source repository and His work basically included everything he worked on the query optimizer the query planner The syntax the changes in the like back back end front-end changes because postgres is a process-based Engine so like every query that you issue Sponsor process and it has to communicate with the front end to know when the execution is done when the data set is ready to be Serialized so he worked on various part of the system to deliver that and so like if you see the implementation It's quite massive in terms of file changes and and That actually touches every part of the postgres slide code base and the changes are again massive so Basically He explains in his commit the shortcomings of This implementation. It's a very Let me know initial implementation. It gives you the power of cubes But it has few, you know worries you have to be careful around using them I'll explain that later. Um, but basically Um Right that will switch every time I won't switch it Okay, so So this new support for these three syntactic components grouping sets roll up in cubes allowed for online analytical processing capabilities and Multidimensional data analysis use cases which are the like the core components of a cube I'll explain that in detail. Just so let's start out with a very simple scenario. You have a very simple database six records you have a ID a warehouse warehouse one warehouse two and you have a supplier one and two and they have multiple transactions on those warehouses and You ship your data like your your basically items to retailers one and two or one or two And you have three items that you're basically just tracking. It's a very simple scenario, right? Let's look at let's actually create this database or this table What right What all right So I created a temporary table as soon as the session ends the table will be auto-destroyed and Basically, I'm going to insert those records that I just showed you before into the table and Here's our table right. This is the same data set that I have to deal with now Imagine somebody asked you can you get me all? All items totals of items shipped But from warehouse one How would the query look like? You'd basically do this Right oh That is not copying right so basically you just group By warehouse you eliminate all other factors and you get the sum of your items and there you go You just got warehouse one warehouse totals If I ask that by supplier, you're basically going to do the same You're just gonna replace your dimension or the group by With supplier that's cool. So this is very simple This is what you'd usually do in a real-time application Like this is a simple thing if you have the right indexing it won't take you much time Now things get a bit more complex if I ask you You know I want a one table that gets me suppliers and Warehouses totals in one query Your query is gonna start looking ugly And it will be something like this so you're gonna basically unionized but because Postgres Makes you have like follow the rule that if you're gonna produce a query result The union parts should have the same columns. So you're gonna have something like null as supplier You're gonna add an empty column for part of the query and then do null as warehouse to the other part To make your table look like warehouse supplier item one item a item B item C That looks a bit more ugly and may be annoying to actually maintain what happened is They introduced grouping sets So they said we can actually get you the same result But with less syntax So instead of actually going to unionize and having to go through two indexes if you have actually two indexes on your database The union will force your query to go multiple times to get the data Before joining it in memory and then putting it outside grouping set scans once only one time So now you've got fifty percent better performance just out of grouping set Right, and it's the same exact result set now Let's go into a bit of like a more complex scenario Roll up is one of the concepts that are usually Used when you want to Spread out a tree So let's say I want to know For all warehouses, I want to get the total and then I want to get warehouse by supplier warehouse by retailer and Then get all the all the contributing factors to the total. So basically you're rolling up from one element all downwards That's a roll-up. So usually use that when you're like somebody asks you get me all our supplies from Person a and show me the details of all the small transactions From that person. So you're basically having totals and then one level of detail second level of detail until you reach The most granular level of totals. That's a roll-up. That's the concept of a roll-up To achieve that you had to do something like this Now it's three unions You had to do basically You had to go all the totals. So you have warehouse is null and supplier is null Meaning I'm totaling everything in the database from a warehouse point of view and then you go based on warehouse one Where warehouse two has a total and then warehouse two has a contributing factor of supplier to Supplier one. So you have the details of each level Now that is Useful, but as you see now, I have three unions. I'm scanning three times worse Here we're talking only about two parts. I'm talking about two columns That those columns are called dimensions everything that is not in a summation that you cannot average or do an aggregation on It's called dimension and everything that you aggregate is called the fact So item totals item one item a and B item C are facts of this roll-up because they are changeable whereas your Dimensions are things that you dissect by so you're rolling up by Warehouse to you're rolling up by warehouse one All the facts about them change based on the level of granularity that you're rolling up against, right? so That is usually common in warehousing applications and a lot of Analysical applications that care about details But as you see it looks ugly and imagine representing that an active record. It doesn't look pretty So the solution is actually This I get the same results better yet. I just have to mention what I'm rolling up against But order matters So the order of the dimensions decide the nesting levels So you say roll up by warehouse then by supplier If I roll up in the reverse order, I get the supplier level first and then the warehousing level second Now here. I had 60% cut of my scanning. I scan once I used to scan three times Right, that's great Now let's go to a more complex scenario a cube so a cube is basically the permutation of All possible changes to your dimension. So if I'm doing dimensions warehouse supplier what I'm gonna get is totals warehouse total warehouse supplier supplier warehouse supplier Basically You're getting every possible dissection of your data imagine how many Index scanning is this like how much performance you have to actually sacrifice to get that result So the query will look like like this So now How many unions do I have here? I have four for two dimensions You will have four unions For three dimensions you will have nine This is an exponential growth Every time you add a dimension. This is three factorial four factorial Five factorial every dimension you add is not a linear growth of your scanning. It's massive massive hits to your database so but With a great one line We get the same result we just say cube Group by cube warehouse supplier. It gets the same data but factorial less scans Whatever's factorial is So this is these are the three new components that are introduced into the new engine that allow you to do massive improvements if you're trying to build up for example your sales dashboard as a company you're probably are tracking your Sales by country, you know your subscription plans your you want to get that data fast And you want your dashboards to be updated pretty fast. That's what you get for free just by upgrading to 9.5 so That's nice. Let me show you now the look of the three dimensions This is a 50 line union to get by three levels By warehouse supplier retailer 50 lines to get 25 rows in my simple database I'm gonna get that with three lines four lines actually because I use order by just for clarity done You got the same result So 50 lines versus four Same data less scans. Let's remember that so basically That's really what the new features of Postgres are about as a company You're probably are thinking of how do you digest massive amount of customer data? Postgres presents it in a very simple manner You don't have to learn much because everything you add is in the group by so instead of saying group by One dimension to dimension you say group by roll-up group by cube group by grouping sets You're just set telling the engine how to group by and it will do that for you. So these are the three Examples and this is the last one with three dimensions and this is the actual cube look So actually like it from from from put like from trade gecko perspective that this is the real motive We did not want to maintain complex code That hits our indexes and database More times and gets the same result like that was very inefficient So we wanted to get that boost and so that's the purpose why we change this so as you see Very simple three concepts grouping set trollop cubes As you saw I can represent them without 9.5 Like you can actually in your current version of database you can build a cube. You'll do more scans you'll just waste more time and And So the the the PR the commits that were contributed actually just built on top of what already existed The nice thing is the performance improvements and the better query of plans that were you know created and Let's keys So I Want to now like discuss with you a bit about architectural stuff How do you think? This would fit into a bigger system I mean in a sense if You're tracking Data and dissecting it and you're a company that has users Would you build the cube for each user? Would you build it per account? How like think about the problem that you would try to solve with this analytical capability? Would you build it per subscription? For example to track all aspects of a subscription all their clicks all the changes per page So you have like a page dimension a click fact Imagine the possibilities. How would that fit into a bigger system? The problem is active record has a static resolution in a sense It expects that every Model maps to a table. That's the default But if I'm building cubes per account or cubes per user I might have the same description of the cube, but I resolve it in real time Based on the person who's querying it based on the API that is being queried Who's the user? What is the login and based on it? I resolve dynamically to the correct Cube to get the data out because remember cubes are extensive as you saw like six records Generated a 25 line cube The problem with cubes is that they do all the possibilities they try to calculate every possible grouping that you might ask for and pre-generate it So if I've had a like a million row table Its cube might be Well like a billion and actually in our own data sets We have some cubes that are beyond 50 gigabytes per cube So like you're talking about a very simple Mathematics that you know you're taking you're trying to optimize for speed, but you sacrifice space. That's a space-time trade-off so and Cubes are not model really Like in a sense. They're not model. They are a calculation when you do group by you basically expect to crunch some records to do some mathematical operation like summation or aggregation and Then you'd get the result so a cube is actually a result of a calculation. It's not a it's all model so How do you store that? If I have a cube that has a 25 or 50 gigabytes of data Every time you request that I have to pre like regenerate the whole 25 gigabytes to get you your own query That's not possible So the problem also comes in when you think how do I store this? How do I cash it? so Of course like database engines have other facilities to store like materialized views which you can refresh Every once in a while the cube will recalculate itself and recache the data and wait and all your queries will hit a cache So you won't have to actually do this every time the user requested same goes for Your pipeline Sometimes your production data is dirty It is not very coherent. It has some technical debt And you need to clean up your data first Before you reach a point where you're like, okay, this is ready to actually be built as a cube So your ETL pipeline, how do you extract and transform and load your data into a cube first? So that's also a concern that you like the concept of cube is very nice But once you start thinking about ah, how does that look like in an actual production system? How does it map to all the abstractions? And as I said the growth of the number of records that you might end up with is huge So how do I optimize like imagine? I have a like as I said 50 gigabyte skew How do I optimize it? Should I create indexes around it? Should I make sure that it's heavily indexed per dimension? Should I minimize dimensions? I shouldn't be very generous The bare minimal that I can get away with I should just build with that. That's it and One other solution possible from the like theoretical foundation of cubes if you read like about that multidimensional Query processing is that they built smaller cubes and then they do a query across them So but the trade-off is a query across cubes is more complex So you're you're sacrificing complexity versus You know speed again. It's it's never it's it's a never-ending thing That's the probably the most difficult challenge You have active record migration Which is very nice in a production system. You might end up with like finding a bug in your cube query It's like, oh that calculation should change or I can't or we can't multiply by this number or like we can't do this equation What do we do I have a production system running relying on a cube and now I have to migrate it I have to change it in real time So how does that look like? You know you might there are I listed a few reasons to change a cube, but I I bet you can come up with more Like you have business rules bugs You want to add and remove the fact you like you tested out you a be tested the cube and you're like nobody's really interested in this type of totals Maybe I should remove it. It's just more space or maybe you want to add the dimension You know your product manager comes in who's like here's what I talked to a customer and if we add this dimension They will love us. They will buy for a year Okay, that's more space Immigration and you have to think about maintaining it. So how do you change your system in real time? There's no foundation That has there's no discussion around how to migrate cubes However, so as I said, these are some of the problems like the architectural problems That you might have now from a trade gecko perspective We've actually had to build that migration system We have in place Immigration system that allows you to swap your cubes in real time and we basically rebuild the cubes offline So imagine you come in and you're like, okay, I want to rebuild the cube now There's already a cube in operation We basically rebuild the other cube under a different alias and When everything is ready to be swapped in one transaction we swap the cubes and we destroy the old one But that is done manually that code is written as a migration framework around the cube concept It is not an active record. It's not in rails. It has to be built so The good news is that the adoption actually started Errol now the like the the the query library that Active record relies on to parse SQL and to actually generate queries Now includes the three keywords for postgres. It already supports That you can actually build up cube queries if you're running natively to Errol now the challenge really is actually how to Build the abstraction inside active record because that is going to be very different than your normal models as I said Cubes are not really models. They're storage is different. They're not actually tables And you've got a problem of active record migration. It's a completely different immigration framework and The resolution can be you know a cube per user a cube per customer a cube or something So it's not really one-to-one relation with your model So that these challenges are all present when you have like to tackle the problem of Building up and extending active record to make it include that. All right. I think I'm done. Thank you very much Any questions? I avoided this project because I hate databases, so I don't know anything about this How long did the biggest query you had beforehand? take so And some of our biggest accounts a query That dissects by three or four dimensions would take up to 12 minutes with cubes We can achieve that in under 30 seconds so let me actually um Show a small demonstration of Our production account This is my test account for cubes basically You can see pretty much that I hope the internet is connected. Well, right so Here I'm doing like a very simple like we are basically Dissecting your sales orders by customer right so we're building Based on your customer and their total sales. Of course everything is fetched. So you have multiple facts in place So you can have all these facts present And you'll get all the data right away Right so all these are facts of the same dimension now. Let's actually add dimensions. Let's start querying for more complex scenarios So let's add a product dimension you know location a channel and An assignee salesperson there you got it That is a sizeable cube of about seven gigabytes of data Right That is a good, you know customer Data, of course you can go as like this is for the last 30 days. I can actually dissect by a year so That won't take much Right the operational power that cubes gives you is that it pre-generates the answers It just pre-generates them. It just has to actually traverse Your cube to get the answers out. It doesn't have to calculate them in real time And of course you can apply as many filters as you want you can complicate your query In a way that you know is limited by You know total sales more than three thousand Gets it right everything is faster when you are pre-computing things But as I said that has a lot of trade-offs around it So you have to think carefully in your own systems. How does that look like? any other questions Cubes have like as a concept have been introduced lately into Postgres I think the real problem is and I think that will be resolved soon Like as soon as developers start interacting with that concept more You will find a lot of small extensions and libraries that come to solve this problem The real issue is integrating it into your Rails code base like that is a big challenge because you have to Incorporate that concept well in a way that feels natural to advanced users advanced developers as much as it's like you know starters any other questions You can't really write to a cube Well, you can't well actually you can't really write a cube because you're storing a materialized view and materialized views are not Right like you can't write to it. I'm sure you've done your research In the past when I was dealing with large complex calculations what I did is Actually create a smaller table in front of it for For those dimensions correct, and then it's for the most queried Queries and then the queries that the user can do is actually method to what is shown in this So we know what kind of queries they're gonna run anyway. That's great So, I mean did you guys do any analysis with you know the point cube versus doing that traditional? Yeah, I think the problem is what you were solving is a completely different Problem than the one that we had to solve our problem was our users wanted flexibility They wanted they didn't want to be limited how they traverse their data and how they look at their operations So in a sense, I could like we couldn't as a team predict You know ah you people are not gonna ask for that Right we had to do like we had we had to build up a structure that allows you to traverse like as you see Imagine yourself as a like a manager or like as a as an admin on this app and you're like, okay Let me try this filter. Let me see this data. Let me see this summation You're interacting with the system and what you're expecting is a system to be responsive So if a system limits you you're like, okay, they're not providing me with what I want. I Might as well leave Any other questions? So actually we are by account we chose by account because we assumed that a We are gonna invent a lot of analytical widgets across the app that As a user you might not be interested in customer data right in like cuss like dimensions of customers But if you go to your customer pages, you're gonna see analytical data showing you statistics and understanding of how customers are behaving So we couldn't actually limit it per user because that would You know be like a lot of loaded data like if I am you and I are on the same account Then we probably have the same cube right and The other part is that we didn't want to restrict it by user because the really interesting thing is that we can embed a lot of information Across the app in a way that makes this application, you know Information rich like more more more information Yeah We have well actually not an inline but active record caused a lot of loading problems a lot of data problems so what we did is we actually like used vanilla SQL for cube building and Data synchronization actually happens on postgres level. It never reaches Ruby level So we use a foreign data wrappers which are extensions to postgres that allow you to synchronize Data postgres to postgres peer-to-peer So our analytical database basically Uses production database to clean up data and pull it up from that level from database level So our application level never pulls the data does the cleaning and then comes back like it's You see That's okay, so that has been resolved by actually building at the correct time We like we did a small survey of when is the least amount of right hits come to our database Yes Actually, you can build that we didn't need to synchronize a lot of nodes But postgres provides that ability if you wanted to if you wanted to you have you'd have to do a lot of like a bit More DevOps work, but we didn't reach that point. We didn't have to This space right, so let's say if the table is standing then the cube is not a thank you. So you actually need 20. Yes How they get all the cubes combined Well, we have around well a bit more than a tarot Is there a product requirement to show data in real time update in real time Actually, that's the that's a problem updating in real time is not efficient So what we when we build a cube we build it per day We wait until your sales per day finish up and then we rebuild Because you know if you think about it like if you have I Don't know like two years worth of data and then suddenly you're you added one order or one sales operation I'm not gonna rebuild like a whole like 10 gigabytes data because I wanted to include one row So like there's a there's there's also effort on making the rebuild operation smarter In a way that it has its own metric You can set up your own business rules around it to say well if you didn't reach this threshold and this update rate Then skip rebuilding the cube wait until tomorrow or recheck your business rule again So in a way that we you don't want to actually consume and as as your name As Marcus said actually the the real problem is hitting your Operational database with reads like you're gonna read a lot from it to build a cube. So It's it's a trade-off. You have to think about it. Well any other questions