 I'm Rymond Szymonowskis and I come from far, far away from Latvia. Where is Latvia? Well, that was the question that was asked by many Americans three years ago during London Olympic Games when US beach volleyball team was knocked out by a Latvian team. And then there were many questions on Twitter. Yeah, so these are Latvian guys. There are many questions on Twitter. Where is Latvia? Do they even have beaches? So I thought they had just vampires and castles and stuff. Well, therefore I wanted to start with a short geography lesson. So at first, according to some movies, vampires live here. But Latvia is located across the Atlantic Ocean in North East Europe, you see there. Well, according to some other movies, this vampire stuff originated in Transylvania, but that's more to the south of the Europe. That's not us. But we have 500 kilometers long beach. And if you didn't know it, so it's 310.686 miles. So we have a lot of beaches, so therefore beware of Latvians when we play beach volleyball. Okay, now back to our topic, date warehouses and multi-dimensional data analysis. Imagine we are building a Rails application which will track product sales to our customers. And we have several models in our Rails application like customers, which have many orders. Then each order is placed on a particular date and can contain several order items. Each order item contains the price and quantity of the product that was bought. And products also belong to product classes. So this is our simple Rails application. And we went also to the last talk and learned that, well, we should use Postgres SQL. So therefore we designed and stored everything in Postgres. And this is our database schema with customers, orders, order item, products, and product classes tables. And so we are proud Rails developer. We made our good-looking Rails application. Then one day our CEO called us and asked us a question. Well, what were the total sales amounts in California in Q1 last year by product families? Okay, we'll find it out. So let's look at our database schema. So where do we store amounts? Okay, we have order items table. There we have an amount column. Okay, we should probably start with that one. And well, we like Rails conventions. We will write everything in Ruby. So we start with order item, some amount. Now the next question is in California. Where we do have this geography? Okay, geography we do have in customers table. Therefore we need to join to order items. Order stable, join customers, and add condition that customers country is USA, state is California. Now in Q1 2014, so where we do have this time information, it's order date in orders table. We have already orders table joined. We need to put condition. So we could translate this condition to that date is between 1st of January and 31st of March. But we would like to stick to the original criteria, and therefore we will extract the year from the date and extract the quarter from the date. And we will use postgres-specific functions for that and we'll limit to that's 2004, first quarter. And finally we need to do it by product families, which means that now we need to join also products, product classes, and then group by product family and get some of that. So we finally got the answer. So it's probably not the shortest query in Rails. And we can take a look that this was what we wrote in Ruby. So this is generated SQL. So probably we wrote it a little bit shorter, but not much. But we could also do it directly in SQL. And we presented the result to our CEO. But then he asked the next question. But also sales cost. Well, we could write a separate query, but this won't be so performant. Therefore we'll modify our query. Unfortunately in Rails relations, we can't make a sum of several columns. Therefore we need to write some tricky stuff. Yes, select explicitly product families, and then some of sales amounts, sales of cost, and then map it just to non-empty attributes. Okay, but then our CEO continues to ask questions. And you need customers count. Okay, we can add also a count of distant customers ID and return that as well. But we start to a little bit to worry that so these are kind of ad hoc questions and we will need, well, each 15 minutes our CEO will call us and we'll need to write some new query. It would be better if we could somehow teach users to write these queries by themselves. So we once tried it and explained so how easy it is to write everything in Rails console and get the result. But unfortunately our business users so didn't understand that. And something's not quite good there. As well as our business is doing pretty well and the amount of orders and order items is growing. And we noticed that when we need to do some aggregated queries on large data volumes, for example, while we tested we copied some production database to our local computer and we got some 6 million lines in order items table. And if we didn't add any conditions that just wanted to aggregate sales amounts, sales cost, and a number of unique customers, it took 25 seconds to do that, so which is not quite good for ad hoc queries. Well, then we asked some consultants what to do and then some consultants came and told that, well, SQL is bad, yeah? You should use no SQL, yeah? Or introduce some Hadoop cluster and write and map reduce jobs in JavaScript which will calculate everything you need. Well, probably also not that we still like SQL and so probably we shouldn't do that. But let's return to some classics. And already 20 years ago, there was one, the first edition of this book was written at the date warehouse toolkit by Ralph Kimball. So I would definitely recommend anyone. So it's now already third edition, anyone interested in the topic, so to read this book. And this book talks about dimensional modeling and what are the main objectives of the dimensional modeling, I'm quoting this book. So we need to deliver data that's understandable and usable to the business users, as well as we need to deliver fast query performance. And how we do this dimensional modeling, so when doing dimensional modeling, so we need to identify, so which are these terms that we see in these business questions, these analytical questions and model our data structures based on that. So let's look again at this question. What were the total sales amounts in California and Q1 2014 by-product families? So the first thing we will always notice there will be some, these so-called facts or measures. These are some numeric measures that we would like to aggregate by some other dimensions. And yeah, and by other dimensions, which means which we can identify in these questions. So we have California, which is kind of customer origin dimension. Then we see some time dimension and we see some product dimension or product family dimension. So when just talking with our business users, we can identify which are these facts, which are these dimensions that we would need to use. And this data modeling techniques and data warehouse techniques suggest that we model our so-called data warehouse, where we will store the data but organized according to these dimensions and facts that we see in these queries. And the typical database schema that is used for that is so-called star schema, because most often that we will see one table in the center and then a lot of tables with foreign keys linked to this central table and therefore it looks like a star. And these are these factor and dimension tables. So let's start from the center. This will be this fact table. So we are using naming convention that will use this prefix for that, for sales data. And always the fact table will contain foreign keys to other dimensions like customer ID, product ID, time ID. And then the measures, numeric measures, we would like to analyze like sales quantity, amount and cost. And then it's linked to the dimension tables. We'll use this naming convention. Start with deep prefix for them. And we see that this is customer's dimension where we see all the customer's attributes. And then there are some special dimensions like time dimension. So instead of extracting some year or quarter dynamically during our queries, we want to pre-calculate them. And therefore for each date that will appear for our sales facts, we'll pre-calculate and store corresponding time dimension record. We'll have some time ID as well as pre-calculated which is this year, quarter, months, both as integers as well as strings which could be represented to users. How we want to represent, for example, quarter name or month name, et cetera. And sometimes we don't have simple star schemas. Sometimes we have these so-called snowflake schemas that some dimensions like product in our case are linked further to some classes or categories dimension like product classes in this case. And if we have a lot of these ones other than our database schema starts to look like. Unique snowflake. And so we will store this star schema in a separate database schema or it could be even separate database if we want to put for performance reasons and how we would manage it from our Rails application. So we create corresponding Rails models on top of these fact and dimension tables. We would have sales, fact, customer dimension, time dimension, product dimension, et cetera. And as these are separate database schema we need to regularly populate this date warehouse schema with the data from our transactional data. And the simplest case would be that we just could regularly repopulate the whole database schema like truncate existing, for example, customers table and then do select from our transactional schema and insert all the necessary fields in our dimension table. Or in case of time dimension so we need to dynamically generate it so we need to select what are all the unique order dates in our case which appear and then we pre-calculate which year, quarter, month they belong and we store these pre-calculated values in our time dimension table. And finally we need to load the facts and in this case we select the data from orders and order items table and extract the sales quantity, sales amount, sales cost and store corresponding foreign key values to our dimension tables. So one thing what we can see there that to simplify time dimension ID generation we are using convention that we will generate time ID as four year digits then two month digits and two date digits so that we always understand what time ID refers to. Now if we return to the original question and how we would solve that so now all our queries will be more standardized so that we always start from the sales fact table and then we join the corresponding necessary dimensions like customers, products, product classes time and we specify conditions on the dimension tables that we need one just USA, California year 2004, quarter one and grew by product families and get the sum. So probably it wasn't a much shorter syntax but at least it is more standardized than we always know how to approach these analytical queries. But still we probably wouldn't teach our users to write these queries directly and we are still limiting us to these two dimensional table models so we want to store everything into these standard two dimensional tables but much better abstraction for these analytical queries is multidimensional data model so let's imagine that we have a multidimensional data cube so probably we can imagine three dimensions but let's imagine that you can imagine multidimensional data cubes with arbitrary amount of dimensions and then in this intersection of dimension values we store measures which correspond to particular dimension values like in our case we have imagine we have sales cube with customer, product and time dimensions and then in intersection for each particular customer, product, time period we store what was the sales quantity sales amount, sales cost and you need customers count for that and there are technologies that well at first in each this dimension some dimensions might be just detailed list of values but some other dimensions could have hierarchies with several hierarchy levels like for example in customer's dimension case in addition to detailed customer's level we have all customers together then we can expand them to individual countries then countries to states then states to cities then cities to individual customers or in case of time dimension we could have even several hierarchies maybe sometimes we want to make the reporting we start by year, quarter, month and individual day and sometimes we want to make weekly reporting and then the same dates we can group together by weeks and then by years where they belong to and there are special technologies that are better suited and which use this multidimensional data model and so they are typically called OLAP technologies where OLAP stands for online analytical processing by subversa traditional OLTP systems which are online transaction processing so these technologies concentrate more on how to do efficiently analytical queries and there are several commercial technologies for that but as well as open source technologies and one of the most popular open source OLAP engine is Mondrian engine by Pentaho and that's a Java library and where you need to write XML so to define some data schemas well we rubies don't like Java and XML so much so therefore a couple years ago I created a Mondrian OLAP gem which is JRuby gem which embeds a Mondrian OLAP Java engine and creates nice Ruby DSL around it so that you can use it from plain Ruby so how let's introduce this Mondrian OLAP in our application so the first thing that we need to define is this Mondrian schema where we do the mapping of these dimensions and the measures that our users will use and which represent these business terms and we need to map them to the fact and dimension tables and columns where the data are stored so let's look at example so we define sales cube and the sales cube will use this fact table f underscore sales then we have defined our dimensions so we define we have customer dimension with this foreign key it will be using the customers table in the data warehouse schema and we specify which are all these levels that we want to use in this dimension and in which particular columns they are stored as well as we define product dimension and time dimension as well and then finally we also describe which will be these measures that we will use in our schema like sales quantity sales amount sales cost which use sum as aggregator but then we have customers count measure which will do the distinct count on customers ID in a sales fact table to get that unique count of customers for particular query and there we use a different aggregator so and now when we look on the same question and how we could get the results using Mongeon OLAP so it's very simple and nice so as we if we look at this so it's a minimum it directly translates the question to our query we tell that from sales cube on columns we want to put as column heading we want to put sales amount on rows we want to put all product families we take from product family level all members and we put this limitation filter that we want to filter just from customers dimension USA California and from time dimension take quarter one 2014 and we get the result so we don't have any technical implementation details which are hidden and created once in this Mongeon schema definition and Mongeon engine as several others internally are using this MDX query language which is one of the most popular query languages for these OLAP tools which looks a little bit similar to SQL but not quite and Mongeon OLAP J ruby jamia does the translation from this query builder syntax to the MDX query language so which will be executed and as a result we get the results object where we can query and get so what are our column headings what are our row headings and what are the cell values so what we are getting there several other benefits of this Mongeon engine is that so when we execute some large MDX query where we do not do any filtering and again I tested it on some six million rows in fact table so initial query also for large query will take some 21 seconds but when we execute the same query second time it executes in 10 milliseconds because Mongeon engine does caching of the results in this multi-dimensional data cube model and it doesn't do caching of these queries it caches the actual results that when we do the new query we it analyzes okay we have already these data cached and these cells these data cube cells we don't have these ones for these ones I generate a corresponding SQL statement to populate the data and as in these analytical solutions so we don't need very up to date information up to the latest second so we typically just regularly populate our date warehouse schema with the data and then while it's populated so it can cache all the results and if many users are asking the same thing so results will be very fast additional benefits are that now we can much easier to introduce additional dimensions based on additional data attributes that we need for example in customers table we had a gender column which stored f or m as values for a female or male and we want to add to our schema additional gender dimension and we can easily create a new gender dimension map to customers table to gender column in addition so for users we want to decode that f means female m means male and we can put this name expression which will be used for generating these names of these dimension numbers or we and then we can use this dimension in the same way as we used in any others and in our queries in addition we can do even more advanced dynamic attribute dimensions for example we have a birth date for our customers and we would like to analyze sales by a customer age and split it to them into several intervals for example less than 20 years 20 to 30 years 30 to 40 etc and but we have a birth date so we need to calculate it dynamically this we can also define a new age interval dimension where we once can specify this more complex expression so that we put there this SQL expression which will dynamically calculate the difference between birth date and current date and then based on this interval it will output either some less than 20 years 20 30 years etc as well as we dynamically generate the new dimension with these values and whenever we make the queries so we we we can it will be up to days to based on what is the current time and finally one of the benefits of this modern engine is that we can make also calculation formulas like we can make these calculated measures for example like profit which is sales amount minus sales cost or margin in percentage which is profit divided by sales amount and we can specify format string that it should use percentage formatting and as a result here we can query these calculated measures in the same way as stored measures and get the results back and also properly formatted and in this MDX calculation formulas so there is well almost everything what you can do in Excel yeah so there's corresponding function in MDX as well so you can do a lot of more advanced calculations there and as a result here these this date model allows to create us also front better user interfaces and for doing ad hoc queries by users so this is a as we don't want to always have to write these by themselves but then these objects what we are using are the same as customers are asking their questions and so this is just example from the easy bi business intelligence application that we are building where we provide just graphical user interface where users can move okay we want this dimension on columns this dimension on rows filter by these dimensions and then view results in table or or in charts and formatted so this is date model is much better for doing these ad hoc queries okay let's switch to couple other topics so we discussed about how to do the queries but let's come back to ETL process so we talked about three letter acronyms SQL MDX so let's talk about in other three letter acronym ETL which means extract transform load in the simplest cases what we looked maybe we can populate our date warehouse just from the operational tables transactional tables in our database but quite often we need many different data sources for our date warehouse some are stored in our transactional databases some are coming from some external sources as CSV files or from REST API and then this process how we extract this information from other sources then we need to transform them probably pass different formats data formats maybe unify and standardize these data to use the same primary foreign keys etc this is this transformation step and finally we populate and load them into our date warehouse so there are several Ruby tools for doing this ETL so one was done by Square named ETL Jam and I want to mention there is one new Jam Kiba for doing this ETL process which is oriented to the row based extraction transformation and loading so this is example from the readme but there you can make some reusable methods that you do some data parsing and then you define some source as a Ruby class and the source could be either CSV or database or something like that and then you can change several transformations and describe in this DSL how you would like to do the transformations and finally you would like to load the data into the database one more thing I wanted to tell that if you do complex transformations then unfortunately Ruby is not the fastest programming language and if you need to process thousands or hundreds of thousands or millions of rows it might be slow but sometimes if we still want to stick with Ruby so maybe we should do it in parallel therefore I recommend to take for example a look at concurrent Ruby Jam which provides several abstractions and one is which is very well suited for this multi-threaded ETL is thread pool that so thread pool is that we can create some fixed or varying size thread pool and we can push jobs to this thread pool and then when it's complete so it gives some result and so then that's probably processed by the next thread pool and which might suit very well this ETL process that we have some data extraction thread pool for example if you fetch some data from external REST APIs it is much faster if it's for example paginated REST API it is much faster start to fetch all the pages already in parallel and not fetch page one by one fetch the first page after the next one et cetera it will be much faster in terms of total clock time to start already let's fetch in parallel first 10 pages then the next 10 pages so we can use thread pool there then if we need to do complex transformation of the data so then we can use the in parallel threads these transformations but there is one pro tip so then please use jruby as jruby can use all your processor cores if you will try to do it in MRI so then unfortunately in MRI just one thread can run in parallel okay or you need to then to make several processes which run in parallel let's look at a very one simple example well we initially looked at single threaded ETL process where we selected unique dates from orders and then we inserted in our time dimension table and but let's make it multi-threaded in this example that initially we created insert date pool with some fixed as fixed thread pool with the default size four and then we select these all the unique dates but then we push pose them to this thread pool and in this thread pool we will do the date insertion but please note also in that case if you're doing multiple threads please always do explicitly check out new connection from active record connection pool as otherwise if you take new database connections it will automatically fetch out the new database connections in new threads but if you'll not give it back you'll run out of database connections and finally we shut down and wait for time dimension and for example in this simple case I did some benchmark also locally on and then I managed to reduce twice the total clock time for loading the data but please also see and be aware that if you will start to increase this thread pool size even more you might start to get worse results because in this case we still are finally inserting all the data in the same postgres table and then postgres might start to do some locking and slow down the process if we try to do insertions in the same table from too many parallel threads so please do benchmark so if you use jruby there are good standard java tools for that like visual of m or vm or jav mission control and yeah regarding jruby you don't need to rewrite all your application jruby you can use it just for your data warehouse project where you populate the data and then do the queries and finally I wanted to give a short overview of traditional versus analytical relational databases as so most of us are when we are working with SQL databases so we think of these traditional databases which are optimized for transaction processing like MySQL or Postgres or Microsoft SQL server or Oracle and they can deal with large tables and but they are optimized for doing a small transactions like inserting updating and selecting small set of results but as we saw then if we try to do aggregations of millions of records so they are not the best technology for that and there are different set of SQL relational databases but which are optimized for analytical processing for example there is one of the pioneers where open source database monitor DB there are several commercial databases like HP Vertical or Infobright but which have also community additions where you can use them up to some significant size of your data or with some limited features as well as if you are using Amazon web services then Amazon provides Amazon Redshift database which is also this SQL analytical database but optimized for analytical queries and what is the main this magical trick that these databases are using for these analytical queries so they mostly use different data storage how they store the data if we look at the traditional databases they mostly use row based storage which means if we have table and row in a table then physically on the database so this all these columns from this row are stored together in one of these file blocks and when we need to for example to do a sum of some numeric amount as we saw let's do some of sales amount then it will need to read practically all our database table because so we need to pick sales amount from here from here from here from here and therefore that's slow and so what most of these analytical databases are doing they are using columnar storage from logical perspective we are still using them as tables with rows but the physical storage is organized by columns so like in in this case so in this the same example all values in the same column will be stored together the next set of columns will be stored together and what's the main benefit now if we need to do some of or count of some column from all records they are all stored together and we can read that much quicker the other benefit is that especially in this date warehouse sales fact tables we have also a lot of repeating values for example these foreign keys or if we store directly already some values which are repeating some classification information and then if they are all stored together they could be compressed much more effectively and therefore yeah these analytical databases also do better compression of the insert data but the major drawback is that the individual transactions when using columnar storage will be much slower if you will now will want to insert these rows one by one in these analytical databases so it will be much slower than in a traditional transactional databases or if you update one by one therefore if you are using this columnar database storage databases analytical databases you typically maybe prepare your date what you would like to have there and then you do bulk import of the whole table or the bulk import just of the changes which will be then much more efficient and I made a also simple example on my local machine as I said that I had generated the sales fact table with six million rows and I did this query which just does the aggregation of sales amount sales cost and the distinct count of customer ID from more or six million rows and grouped by product families and on Postgres so whenever I run it so it was approximately 18 seconds on my local machine then I in virtual machine installed HP Vertica and didn't do any specific optimization configuration there so the first query I run so it took about nine seconds because well it just needed to load and cache the date in memory but each repeated query took just 1.5 seconds so with exactly the same date amount so I got 10 times faster performance so in reality probably you won't get the 10 times better performance all the time but in some studies of real customer data they quite often report some three to five improvement on query speeds which are like this aggregation by and grouped by queries and I did the testing also on Amazon Redshift and got similar results to that with the same dataset and my very unsophisticated recommendation well unscientific recommendation I went to consider what so if you have less than million rows in fact tables so then you probably won't see any big difference so if you get 10 million so then complex queries will be get slower on Postgres MySQL and if it's will be 100 million so so you won't be able to manage these aggregation queries realistic time and so when you have already 10 million and more records in your fact table then for analytical queries you might need to consider these analytical columnar databases so short recap what we did cover so problems with analytical queries using traditional approaches dimensional modeling star schemas monitoring all upon MDX CTL and analytical columnar databases and thank you very much for attention and you can see all these examples I posted on github my arson profile there is a sales app demo application so you can find it there what I showed it and then later also all my slides will be published and thank you very much and I have some two minutes for questions still thank you