 All right, let's get this party started. So this is reporting on Rails, ActiveRecord, and our OLAP working together. My name is Tony. I am Senior Developer at Moby in Indiana. We manage the corporate cell phone accounts for Fortune 500 companies, everywhere from bill optimization to device procurement, including tier one tech support. I work specifically on the billing and reporting team where I basically work in massage with the one million plus devices that we have under management plus all of the billing data that goes back several years. What I essentially do is take information from various sources, carrier billing, tickets from call center, orders placed from carriers, shove them into Postgres, sprinkle some magic on top, and out pops pretty charts and graphs that our clients love. Yeah. So to better explain what reporting is, the best I can do is use my current experience with Moby. Way back in time around 2012, ownership came to the dev team and uttered the dreaded D word. They wanted a bunch of dashboards to report on all the data that we have. And not just one dashboard, dashboards for our client administrators, dashboards for our internal support staff, and dashboards for ownership to actually see how the company's doing. And oh yeah, they want user defined filtering. You can slice and dice by just about any point of data that you want. And various other bits of scope creep that came up over the years. So where do we begin on this? Well, a couple of notes that the dev team laid out in front that we don't want to be bloat in our result set. That means we don't want actual active record objects coming back from the database. Active record is kind of large and we don't really need all that data that's coming back, which means less memory. What we really want is generic uniform data coming back from whatever system we build. This means a plain old bunch of arrays with rows of hashes with the information that we want. That way we can take that, call to JSON on it, shove it into the flavor of the month JavaScript front end framework for charts and we'll be on our way. So what is reporting anyway? Well, I'm going to assume most of us, our apps run against a relational database. A relational database holds data. Data is completely worthless to you and me. Humans can't really work with data. Computers work with data. What humans can work with is information. What reporting in analytics does on a high level is take data converted to information so humans can actually make decisions out of it. And more importantly, reporting answers questions. And this is probably the most important step in preparing your app to generate dashboards is to figure out upfront what questions you want to answer. That means going to all the primary users and ask what do you wanna know? What information do you need? Because if you don't get that and you just start randomly throwing queries against Postgres or MySQL, you'll be throwing spaghetti at the wall and you probably won't end up with a result or a product that people really can use because it doesn't answer the questions that they care about. So here's some examples of questions for each of the three stakeholder groups for mobile reporting. And we wanna answer all of these plus more effectively. So where do we go about doing that? Well, fortunately, there is an industry standard term called OLAP, online analytical processing that is built for data warehousing and analytics. Now, commonly with more traditional OLAP products, everything is rolled up into memory in the form of data cubes, which all the information is pre-slice and diced, pre-group together, everything upfront so you can run queries on it very fast. And OLAP commonly deals with aggregates, counts, max, min, averages. You don't really deal commonly with the individual rows from the data. You care about the grand picture. However, OLAP, you commonly use OLAP with TACON to Oracle products, MySQL products is a bunch of enterprise-y stuff. And when you think enterprise, you think money and we didn't have money. So there's gotta be a better way. Our OLAP to the rescue. This is relational online analytical processing. This is OLAP that runs with SQL, which is what our normal database talks with. It also allows for dynamic queries to be generated on the fly so we can get any information we want out very quickly with just some setup. And what's nice about our OLAP as well is that we can work with both our historical data, which is all the billing information and our transactional data, which is the support tickets that come through to our local support center. The ever-changing lines of service that change on a daily basis. That's transactional, it changes all the time. So our OLAP can work with both while OLAP is more designed for historical information, stuff that happens in the past. Once it's set, you don't touch it anymore. So like anything in a prize, there is a crap ton of terminology that comes with it. Now, we're gonna go through all of these and what's nice about OLAP is that you can relate any OLAP terminology to SQL and you can also relate it to something in Rails. So we're gonna use that as examples to build up the vocabulary because when you work with reporting, you wanna think in OLAP terms, not just simple SQL. The first term is a fact model, also known as a fact table. This is the starting point to get information out of the database. In SQL land, this is the from clause, this is the primary table that has the information that you want. In Rails, this is a standard model. So by looking at questions that our users want to answer, we can extrapolate out pretty easily what a fact model is. In this case, the support tickets table and our lines table. A dimension is a way to take your data and slice and dice it into various chunks. Commonly, these are relations to other tables in the database. So foreign keys linking to other tables. It can also be columns that live directly on the fact model as well. In Mobyland, we have cost center living on our lines table. You can group by cost center to get a report out of that or a line of service has a carrier. We have a carriers table. You can link to that to get information out that way. So a dimension in SQL land is always a group by and if you want to jump to another table, it's also a join. In Rails, this is a standard column or a has one or belongs to relationship. So when you look at questions you want to answer for reporting, I always look for the word by. I want something, I want to sum up something by something. So in this case, we have support tickets by type. So we're going to group by type. We have active lines of service group by carrier. So we're going to join against the carriers table and group by carrier. Next up is a dimension hierarchy. This is a way to go up and down a hierarchy of information of your dimension. The most common example is dates. So when you think about I want a group of stuff, a group of orders grouped by day placed, you just group by the day. But I want to go up a level and I say once I want everything grouped by the past five months, everything from the quarter, everything from the year. And hierarchy basically is a structure you build to go up and down from more general groups of data to more specific groups. In Moby land, we have devices. A device has a model number, a manufacturer, an operating system, and wireless technology. That's another example of hierarchy, where you can go up and down your groups of data. A dimension member, or I prefer calling them dimension labels, is actual information that when you look at a result of a report that you can work with. So lines has a carrier ID on it. We could easily group by that, but when you look at it in a table or a pie chart, a human can't make sense of that. So instead a label that you use and shove into the pie chart is the, in that case, the name of the carrier. If the dimension lives on the actual table, so for example, cost center. The cost center is the same thing you group by and also the label. Next up is filters. This is not really an OLAPI term. This is because OLAP is commonly set up for, once you pre-build and slice and dice your data, you don't really have much maneuverability unless you wanna build a completely different data cube to group by or filter by something else. Because we're working with our OLAP, we can use the where clause in SQL to further shrink the data set down or get more specific information. So in SQL, your filter is your where clause. In Rails, this is your the where method, standard active record scopes, or if you use the ransack gem, that results in a where clause as well, which can come in pretty handy. The measure in OLAP terms is basically the aggregate, your average sum, max, min, count, pretty much any aggregate based function that your database can provide. And what you commonly do is the measure is also the column that you plug into the function. So I want the sum of the total charges for an order. That's the measure. It's the sum and the column total charges. The count is obviously the exception because you don't really count on a column. You most commonly use count star in SQL land. Then finally, we have the metric. This is the report. The metric is just a fancy way of asking is the end result of the question that you want. So in SQL, this is the whole query. This whole damn thing. Rails, it's all of active records shoved together and execute to the database. So using our examples, the entire question can be the metric or part of the question can also be a metric. What you can do with our OLAP is start with a very simple and specific metric that you wanna ask and then tack on dimensions and more filters later. So you can think of the concept of having a bank of pre-built simple metrics that through maybe a user interface or just through your code, through configuration, you can start tacking on more stuff. So you have a base case and you can expand that out however you want. So I know that's a lot. Here's everything shoved together. Hopefully the colors stand out. But we have a complete question. Whole thing in this case is the metric. You have the sum for your measure. Mobile charges would be, we'll extrapolate the table out of that. That's your fact model or fact table. Last billing period is your filter and you're grouping by cost center. So that is your dimension. And a metric can have as many filters and as many dimensions you want. Just keep in mind the more you tack on, obviously the more complex your query is. But this is essentially all of our OLAP in a nutshell with the SQL equivalent and the Rails equivalent. So that's pretty much the industry level of it. How do you go about implementing it? Well first your data has to actually be organized in a way that is conducive to reporting. The most common way is called star schema. There's another setup called snowflake schema which is basically star schema plus plus. I prefer star schema because it's simpler, it's more direct and it's kind of easier to visualize in your head. The idea is if you can take all your tables and chart them out on a graph or link them together, you have fact models or your fact tables in the center of it. And branching off of that is every dimension that you could possibly run against the fact table. And again a dimension can be a simple relation or it can be a column on the fact table. But the end result is when you map them all together it looks like a star. Now a little gotcha in this is pretty much set up for standard SQL is that it's really hard to report on as many relationships. It is possible, you can throw the magical distinct keyword in the front of your query, but if you're on Postgres specifically that can easily result in invalid SQL just because of the way Postgres works. The reason why it doesn't work with has many very well is because when you are aggregating and joining you effectively get multiple rows back, duplicate rows back, and then you're summing against those duplicate rows and then your numbers are off. So avoid wanting to report on has many relationships as much as possible. The other way around this is to use subqueries which is hella slow and I don't recommend that. So using Mobe's example, here's three fact tables that we have identified, support ticket, a line of service, and a row on a bill in all and various dimensions that they can dimension off of. And again some of these are actual relations to other tables and some of them are actual columns on the database, on the fact model. Notice in line, well that actually almost works. We have a carrier and a carrier account. In Mobe, a carrier account has a carrier. So why not just, why is carrier a dimension online and why does line have a separate relation for that? Well, you wanna avoid doing multiple jumps as much as possible with our OLAP. The more joins you do, the slower the result will be. So what we do in Mobe is denormalize a lot of stuff that we wanna group by and put them on directly on the fact model. Now we could get around this with a has one through and effectively have a carrier relationship directly online. The SQL can be generated just fine, but again we're resulting in a double join at that point. Also something interesting to know is created date and the bill date for these two fact tables aren't actual date columns. They're actually separate tables to another relation. This is called a date dimension. And the idea behind this is especially for like a example, a warehouse of sales information. You want information broken down by year, by quarter, by weekday. It's hard to do that with databases, with a common, with a regular date column and have it fast. You can tell MySQL and Postgres take the state column and give me the weekday out and group by that. But it has to be done on the fly and it doesn't use the standard index if you just slap it on the date column. You can make standardized indexes for that, but again Postgres still has to, Postgres and MySQL have to calculate those values on the fly. Instead you have a separate table with a row for every day that you effectively care about. So in case of Moby, it's beginning of Moby's existence to 15 years from now just to cover all of our bases. And instead of having a date column on support tickets, we actually have a created at ID that links to a date dimension. And so what we can do then is say, give me all the support tickets broken down by quarter. So what we do is then we join against the date dimensions table, group by the quarter column, that's the label. And now we can easily report on that. This also allows you to, since each row in the date dimension already has the various parts of the date broken down, you effectively have a very complex hierarchy that you can go up and down the data as you see fit. So great, ActiveRecord can do all that, right? Why is Tony up here with his free ticket to RailsConf? Well it is true that ActiveRecord does have, does provide all the information needed to actually construct our OLAC queries. It can do a join, it can do a grouping. You can ask it for all of the, you can reflect on all the relations. You can ask for all the attributes on the model. And you can select out very specific columns using the pluck method. However, it does have some limitations. There is really no good way to group by the non-aggregate columns programmatically without manually putting this in. This is a specific gotcha for Postgres. MySQL doesn't have this problem, you can cheat. But with Postgres, if you have an aggregate in your select clause in the non-aggregate columns, you must include those columns in the group by, otherwise it considers it invalid SQL because it technically is. And so you have to make sure you balance the select and the group by and ActiveRecord can't really do that. In fact, it's mostly built for the count method, the maximum method, the minimum method. You just plug in one column and that is the number you get back. You don't get the grouping, you don't get the nice dimension label with that. And there's also not a good way to just have your models be described in RO-lappy terms. Like yes, you have a has one, you have has belongs two, but there's no way to actually say that these are dimensions. You can't actually just go out and list all of them without iterating through every possible relation in the table plus the standard attributes. And there's no really good way to just store pre-made queries very well. Yes, ActiveRecord foreign later does lazy loading so you can start tacking on a select, you can override the from, you can put on a where, and it won't execute it until you actually need to iterate over it, but there's no way to just grab something real quick and then start tacking on stuff pretty quickly. So what could we do? Well, we could hard code all of queries. That's great, except good luck trying to define custom where clauses and custom joins because as part of the requirements we had, we could build a courier ourselves, which is actually what we ended up doing. However, for commonly, I would think a normal Rails developer would just start tacking on extra methods to ActiveRecord to say, you know, give me all the dimensions and effectively dirty up the entire class at that point, which we really don't want. Or it could switch to a SQL. There's a, that's a great gem. It's good replacement for ActiveRecord. There's a much better way to build and define very complex queries on the fly compared to ActiveRecord. However, that ship has sailed quite a long time ago and I don't think management was really up for us to rewrite the entire app. So we had to do another way. And so what we did is we defined our own library for reporting. What I did was recently I extracted out most of the non-mobile logic, sort of clean roomed a lot of the stuff and implemented it in an open source gem. I call it Active Reporting because I'm terrible at naming. But basically this provides a DSL-like system so you can tell your app very stuff in RO-lappy terms. It's a very lightweight DSL. And what it does, it uses ActiveRecord and it asks ActiveRecord for bits of information about the database and about all the tables, tells it how to build the query for you and it just executes it directly on the database. And instead of ActiveRecord objects coming back, you just get an array of hashes at that point. Very simple, lightweight and small data set back with the information that you actually want. And it doesn't really dirty up ActiveRecord too much. I think it adds one method and maybe two at this point. I would say it's mostly production ready. Video games got in the way for me to build a demo app. But 0.1.1 is out. DSL's pretty much in a good spot. Documentation is pretty much what I need at this point. But this is effectively how it works. For every model you have in your app, you have a fact model to go with it if you want it to be reported on effectively. I'm calling this effect model instead of a fact table because we're modeling how it's used within our OLAP. Because Rails is convention over configuration, the idea is you have your ActiveRecord model name, fact model as your entire class and it'll just know to link to the proper model. And there's obviously a way to override it if you're so inclined. But the idea is all the reporting stuff gets shoved into these classes and not your regular models at that point. So with your fact model, you then define the dimensions that you want to work with. Now why are we whitelisting all this stuff instead of just saying, hey, ActiveRecord model, give me all of your relations, anything that has one or belongs to, what we can mention by, anything that's an attribute that we could probably group by, let's just use that. Well, what if we want a user interface where we can change the charts and graphs on the fly? Instead of, in Moby, I want all of my lines grouped by carrier, I want them instead grouped by carrier account. And maybe we'll have a dropdown saying I want to change this report altogether, change the dimension. So what we can do then is each fact model we can ask it for what dimensions it can work with. And the gem knows if it needs to join against another table because it's a relation or if it's an actual attribute just grouped by that. You can also define the hierarchy and the actual default dimension label. So if a dimension is a relation, the gem will assume that the default label will be a name. So line joins to carrier, carrier, table has a name, we use that as the label, or we can override that if we see fit. And then we can also define the hierarchy which effectively makes more dimensions. And so we can have a nice line graph of orders over time and I want to see it by date, now I want it by month, now I want it by year, now I want it by quarter. So this allows you to set up a hierarchy to drill up and drill down your data as needed. Your dimension filters, again, these are just where clauses and the fact model can then whitelist stuff that you can filter by. And again, why are we whitelisting this stuff? Because this can be possible user input. Scopes on a active record model are just glorified class methods. What's also a class method? Delete all and destroy all. So we don't want to just blindly allow any input being coming in from a form to call methods that are not really safe. So instead we whitelist on our fact models what a user or whoever's building the report can actually filter stuff by. And this can be done by just listing out pre-built scopes from the model, so it'll just whitelist those. You can define your own dimension filter using the same scope DSL, so throwing in a lambda with an input if you're so inclined. This allows you to not have to tack on all your known filters to your model. If you don't need to use them in the rest of the app you can throw them just all in the reporting. So that keeps your models slimmer. Or if you happen to have the ransack gem loaded up you can whitelist various ransack calls as well. The other benefit of specifying dimension filters manually is now you have effectively full control or mostly full control over what the ware clauses will be in the report. Active record isn't always the smartest in building optimal SQL on the fly. So if you can control that to maybe force using specific indexes or maybe force a union instead of an or you can do that. So that's a setup. Here's the actual execution. The gem has a concept of a metric and again a metric is the question you wanna answer. You build a metric by giving it a name as the first argument. You tell it the fact model you want the metric to be based off of. Then you can pass in dimensions, filters. You can set the aggregate, it defaults to sum. You can set the measure of the actual column that you want to sum max min on. It defaults to value, I think you can override that as well. But this builds an object that holds the question and all the information that it needs to reach out to the fact models to then reach out to Active Record to get all the information and build the query that you actually wanna run. And then finally you shove it into a report object. The report is effectively just a glorified courier that takes the metric and says build the SQL then ActiveRecordBase.connection.execute SQLGo. Yay, here's your very basic result set back. Now why are these separate objects? Well, as I mentioned in a previous slide a metric can be a very simple question you wanna answer and then tack on more stuff. The Active Reporting Report Object will allow you to take a metric and then merge in user input from the interface to say I wanna tack on the carrier dimension or no, I wanna change it to something else. Here's my form of all my filters for these reports. Take that hash, shove it in there. It'll go through the wide list and apply the where clause dynamically at that part. This is the power of our OLAP again because you can define a where clause. You can define pretty much anything on the fly as long as it'll result in proper SQL and get data back on it. So we built those two objects. This is the resulting SQL. You have the select clause is very specific to what you want. We are summing on the total charges column. The gem will give you whatever the metric name is as the aggregate result column. We are dimensioning on carrier. So we are going to then grab the carrier label which is the name. The gem has a nice ID that you can turn off where you can get the identifier back. So if you wanna build filters on the fly, embed the identifier column of the dimension in a pie graph, you click the pie graph and then your filters magically update to have sales impress potential clients which they never use the feature later anyway. But anyway, we built us the clause from the fact table. We then have to join on our dimension. We apply our dimension filter, our where clause and then we finally do the group by because we're in Postgres and we have to be valid query. And the end result for that, we called out run on it and we get back an array of simple hashes. Then we call to JSON on it or whatever we can massage it later with another service object if we so want to and then spit out a pretty chart, a table, a large number, whatever we wanna do. That's effectively all the gem does because again, reporting isn't about getting a table back of rows. We just want back aggregates of actual information that we care about. And so finally, just some pro tips in general for databases if you want to do reporting on them or any form of way of getting information out. As mentioned before, try to avoid double jumps as much as possible with your queries. Sometimes denormalizing is a valid solution. It's much easier in Moby to ask, give me all my active lines by carrier because carrier ID is directly on the table. We don't need to do a double jump at that point. And we just keep the carrier counting, carrier in sync from a very simple, an active record callback or you can even just use a database trigger if you're so inclined. You can also cheat around some has-minis by implementing countercashes, both the built-in countercash plugin or just manually generated pre-built counts through background jobs or whatever processes you want. That way you have some data pre-built and pre-set up for you so you can easily aggregate against that. Also index wisely if you missed the previous talk about some database optimizations with indexes. Before this, look that up. It was pretty informative about when to index, when not to index and current gotchas with that. But the common rule is if you're going into mention by something at the foreign key, you might as well index that. It would probably help if you're indexing, if you have common filters that you're gonna be filtering by a lot. Like I said, you whitelist the filters that you allow users to actually filter by. You can then use that to determine, I'm probably gonna need to index these columns or these groups of columns. And use explain analyze as much as possible. This query is taking 50 seconds. Why is it taking 50 seconds? Well, we have tools that our database provides that tells you exactly what the database is doing. And oh, I missed an index. Now it's suddenly half a second. Yay. So use the tools handy to optimize queries because this is still SQL. It's still a regular database. All we're doing is just dynamically building a query to run. Also as you grow and you start to outgrow, you go from a small app to a medium app to a large app, look into read only replication slaves for a databases. Anything reporting related, have it hit the read only because you're not doing writes and your master can take the day to day operations at that point. Or if you're in Postgres, look into sharding or even schema separation if you're a multi-tenant app. That way you have physically less data for the database to work with per client. So you only have slow queries for your biggest clients and you rest the clients that are smaller don't really have to take a hit. And you can focus on optimizing that one client as opposed to having to take a bunch of unhappy users at that point. So that's about all the rambling I have. Jim's on GitHub and released on Ruby Jim's as well. Copy the slides if you're so inclined. I have a GitHub repo called show and tell. That's where I put all the talks I've been doing. I don't Twitter much, I follow people, but if you're one too, I'm on Twitter. Questions, comments, hate mail, death threats, anybody? Okay, we're done early.