 Good morning everyone. So, welcome to the first session of the day. So, I will introduce a system called GRIL, a unified analytics platform. So, before that I am Amarish Viree. I work as an architect and data platform team at InMobi. And I have been working in Hadoop and ecosystem since 2007. And I am Apache Hadoop PMC member and Apache Hive Committer. And before InMobi, I was working with Yahoo for five years. So, that's my background. We'll see what I'm going to cover today. So, I will introduce how InMobi is using analytics in its business. And then what problems it is facing and why we have built a platform at InMobi for analytics. So, I'll introduce the, I'll introduce the platform GRIL. And then the OLAP model that is built in for GRIL. And then some examples we'll see. And then I'll spend some time on the design of the GRIL. And then I conclude. So, InMobi is a digital advertising company which actually shows, advertises ads on your phone, smart phones and tablets. So, what it does is it provides a marketplace. So, out there they are app developers and publishers or other websites who wants, who actually are owning the space on your phone. And then the advertisers who wants to actually sell their business. And phone is a medium that they want to use. So, InMobi is providing a marketplace where it buys the real estate on your phone. And then it gets, it serves the ads from the advertisers that they want to show. So, in between all this, InMobi is actually acquiring the user. So, it's understanding which apps the user is interested in. And it is understanding more about the user so that it can actually show better ads to the user and make money to all, both publishers and also to advertisers. And in between it gets the share from both. So, that is the main business of InMobi. Then we'll see more on the analytics use cases. So, there are many, there are a couple of personas, there are not a couple, there are many personas at InMobi who wants to use analytics. So, there are advertisers and publishers and their account managers who manage specific accounts sort of things. They want to actually do analytics on a UI based platform. They don't want to write programs or run queries on a system sort of things. They'll just go to the UI, look what is happening. For example, an advertiser will log in every day and see how many ads were shown and how much he could reach, how many users he could reach in a day. And the publisher wants to log in and see how many ads are published on his page and how much he could earn. So, on a daily basis they want to track what is happening to their business. And their account managers who are looking at specific accounts, looking at what is happening over a week, over a month, over a day sort of thing. And there are business and product analysts who are analyzing the business from end to end. Like they're seeing what has happened to the revenue end to end or if there is a new product that has come, how the product is performing over all the business. Or they can be only looking at a particular geolocation and how that product is performing in this geo sort of thing. And then there are data scientists in NMobi who wants to actually understand the user. So understand trends and infer more about what is happening in the business. And then they want to do forecasting or if there are any anomalies, they'll try to do that. And there are engineering systems themselves who wants to get feedback into them from what has happened. So if you're performing, how we are performing over the day and then what should they do next day. So the engineering systems are built on top of how we are performing by ourselves. So there are their feedback systems like ad relevance. So whether ad is relevant to the user or not. And then the developers themselves who wants to actually go deep into what is happening to the business. So these are the broad use cases where analytics are happening. So if we categorize these use cases, so there is a broadly categorized into canned or dashboard. Where it is, the canned ones are where you have fixed few fields and you're showing a dashboard to the user. Where you'll just go look at it or drill down on the page and do it more, some more interactions on the page. Or they are ad hoc queries where a random user comes and he wants to file his own query. And there are interactive and both of these have flavors of interactiveness ad hoc and the canned queries and can be batch based as well. So what I meant by batch based queries is you're not looking at a smaller amount of data. If you're looking over a month of data, you can't have it in the interactive fashion. So you fight the query, then go back and then come back and look at the result because a query over a month of data won't be answered in a few seconds or minutes. So then there are scheduled queries where the users who wants to run a particular query say every first day of the week, every Monday of the week and see how the performance was over the last week. And the users who wants to run some queries say every first of the month or to see actually what happened over the last month. So such queries and they love to remember and run it every day or every day or every week or every month. So in such queries, they can just schedule it and then go back go back to their usual work and if an email comes into their inbox saying that okay today is today's performance is this last week's performance is this. So that is what they want and the data scientists wants to actually infer the infer some insights through the machine learning algorithms that you'd see throughout the conference. To achieve all this use cases in movie has many warehouses in its system. So it has a like 170 TB of Hadoop based warehouse and a 5 TB of SQL based columnar data warehouse. And then it has 70 TB of HBS cluster. So and actually the scale at which it might be works is it serves billions of ads add impressions per day. So, so all the data will be put in this warehouse for the users to query. So why they're both Hadoop and SQL. So this is a bigger question but I'll try to put it in one slide and we'll try to understand but they can be a bigger talk and just why we need Hadoop and SQL together. So the world has understood that we need both Hadoop and SQL together. We can't just just live with Hadoop alone or SQL warehouse alone because when you look at a query that is coming, you have things like some queries which would scan very less data and can respond in a millisecond sub second latency. But the queries which would scan little more data might come in a minute or so. But if the queries which are scanning a lot of data will take a few minutes or hours to finish it. So for the queries that can be answered with subsequent latency Hadoop is not the system. So even there improvements happening in Hadoop. So it's not there yet because Hadoop will be scan based and it is it will be a process that will be launched which itself will take for the launching itself. It will take a few seconds. So those queries can be answered by Hadoop. So we would need a SQL data warehouse to answer such queries. So we have categorized such queries as canned queries where you have the fields of the fixed number of fields and then put them in a data warehouse and just hit it to answer it. So which that is where you can mix both your Hadoop warehouse and SQL warehouse. So this is what is there at Inmobi and then so what are the problems that Inmobi is facing. So there are multiple systems. So one is based out of Hadoop. One is based out of Kalama data warehouse and there are two three different systems that users go to and then run the queries. So that is first of all that is confusing for them. Where should I go to get get an answer for my query? Right. And then even though the systems are built there they can talk to just one storage. So Hadoop based system can talk to only Hadoop HDFS and the other one can talk to only the Kalama data warehouse. So even if the user who has gone to the Kalama data warehouse system cannot get an answer to his query even though data is available in Hadoop because that system can't understand what Hadoop is and vice versa. So they want to make use of the data available there but they can't use it. And then the schema management becomes hard. So when you have so many systems, if you have to add a new field, new column which is exposed to the analytics, you'll have to go modify each system independently. And for the end user to see that there is a new attribute that he can query, it takes at least a month for all the systems to make it available like end to end test it and finally make it available to the user. So and moreover there is a big sequel on Hadoop community building out there. So we are not making use of that. So there are interactive queries also getting answered in Hadoop with newer things, newer things and Hadoop that that are coming up. So, so grill is born to solve all these problems to provide a unified analytics platform. So I'll introduce grill here. So the two goals of grill is provide analytics as service. So the first one is provide analytics as service and then unify the query layer and catalog layer. So what do I mean by analytics as service? So what it wants to provide is there if there is any team any any any business that is holding some data. Go load create schema for your data and then open it up for analytics. And then all your queries are also can come as a as the service APS that it provides actually very similar to your AWS right AWS provides you as three as a file system, which is a service where you can hit it with all rest APS. So grill wants to achieve something on something like that from from the analytics side. So where you create new tables new new and our update things and then our five queries through the API. And then the other goal is to actually unify the catalog and the query layer for for all the things like that we have just seen right all ad hoc and can queries or batch and interactive queries. So on a single interface so that your user shouldn't be worried that okay if it is in SQL warehouse I should go write a SQL query and to hit the data warehouse and write a MapReduce program to read things from Hadoop. So here is a simple architecture of grill where in the middle of this we are seeing the grill server which is providing you endpoints for the user like it is providing the rest API for all queries and the and the scheme updates that you have. So and they can be CLI and Java client JDBC client built on top of it or applications can directly talk rest API. And then they can be other bigger applications built on top of it top of it or a bigger UI built on top of it. Or a complicated system like the ad relevance engine we are talking about that can be built on top of it which will issue a query through rest API. And then grill server is backed by something called OLAP metastore that will see what the OLAP model is available in grill in the coming slides and it is built on top of high metastore. If you have already heard of it. So what high provides is actually it catalogs the data high metastore H catalog that what it does is it catalogs the data that is available on Hadoop. So where the data is available as tables. So we're building OLAP model on top of it and then grill server is backed by a driver. So each driver here is actually any execution engine that can talk SQL like language. For example for talking to Hadoop we use Hive engine and then for talking to DB we use a JDBC engine and it can be anything like shock or impala as well. If you have heard of those those are also those also accept Hive QL languages to do analytics on the data. And and they can be storages coming down your driver which can be HDF is a H base column data warehouse or S3 or a redshift database that you have. So you can add a new storage as long as actually you have a driver which can answer which can read data from that storage. So if if there is no driver which can talk to that storage will we can actually plug in the driver which can talk to that. So before I built on before I build build up or OLAP model I want to give a typical data layout of how you will store your fact data and dimension data. So what is fact data fact is something that that has occurred an event that has occurred. For example in in the case of in movies business that it has served an ad on a phone is an event that has occurred right. So what we will represent is okay in movie has served an ad on a particular phone at this time of the day in at this location. And the device type is say an iPhone and it served a say Nike ad on your phone. So what is the ad shown and when is it shown and how long it took to render it so that is an event that has occurred. So what we are trying to capture is that data in a table. So how do you put that in a table is actually you you will represent that as some dimension cuts and some measurable quantities. So we are calling them the dimension cuts and measures here. So then we have all the raw data coming. So in movie scale it is a billion ads that are getting served. So there will be billion entries that okay each where each ad is served and associated parameters with it right. And now if one business analyst wants to query and if we hit this data it is going to be huge and in every day we have so much data coming. So this is this if you hit raw data for everything it is going to be huge. So how to solve that is is actually build an aggregated fact on top of your raw data where you are dropping some dimensions and then and some measures as well. So how how how did you how do you decide like which measures and dimensions to drop is actually what you are interested in like you. There might be many parameters associated with the ad that we are talking about but are you interested in all of them their own 500 parameters associated with a single ad that is served. But what is the final user interested in. So that is what makes the aggregated facts so which are of more interest to pull them up. So when you are pulling them up and measurable quantities can't just be pulled up right. So they if you are dropping a dimension then the others have to be rolled rolled up aggregated up aggregated on the dimension when we are dropping the dimensions. So then you can build many aggregated facts on top of it. So it in mobile case we have around like five aggregated facts on the raw data that are getting built and from the top of the pyramid we answer about like 70% of the queries and we don't hit the raw for all the queries. And the other angle to the pyramid that you are just seeing here is the time. So once you have decided that I want to group these measures and dimensions together into single aggregate facts you can store that in a monthly roll up or a daily roll up or an hourly roll up. So where your data is available and a monthly granularity or a daily granularity or an hourly granularity. And even when a query comes you want to answer it from the top of the pyramid and on a and on and on the time dimension which is of low cost for example the monthly fact answered from a month monthly query answered from a monthly fact will be faster than picking up 30 days of daily data for the same. So this is the typical fact data layout that you see in the in the other data warehouse houses as well like the column or traditional data warehouses. So and then what is dimension data dimension data is something describing your things like for example the location. So you can have the location which is associated with more perams like which city it is what are its interesting points. Is it is it like crowded how much population is there the more things associated with a with an attribute with the field with the domain so it's more like a domain data you can call it. So and this data is not a fact data it is actually not an event sort of thing that has occurred it is describing about the domain. So and this data is actually a slow changing data it won't change every day your population of the city might increase in a week or in over a month when you are if you are looking at but it won't increase. We don't be updated every day every hour every second sort of thing. So here also what we have is all the attributes associated with the domain but again if we hit this this draw data for all the queries you are again going to incur cost. So there cases where you actually build a subset of the fields that are available on your raw. So what we do here is just drop the columns so there is no aggregation happening while you are dropping so you are just dropping some fields. For example population I am not interested going up I am interested in city and its interest points what is it famous for. So you might drop population and just keep interest points is at one level and use that to answer your queries. So again here when I when I use a query comes you would like to answer it from the top of the pyramid than from the bottom of the pyramid. So how how are the fact data and dimension data related so this is the typical relation between fact and dimension data. So where in your fact data you will not actually store the actual dimension fields there. So you would you would actually keep the keys which can refer to a bigger dimension table bigger dimension table like more like a foreign key in rdbms. So which can refer to that and join with that and get. So this is a data warehouse model of storing your big data. So now so if you have heard of SQL and Hadoop the first thing that comes to mind is Hive right. So but why does Hadoop Hive itself is not enough why did we build grill what is Hive giving and what is it not giving. So what does Hive provide is it associates structure to the data that is even present in a file system. It provides a catalog like service called H catalog and it provides. Plugable storage interface where it can actually talk to the other systems. And it is it is not actually SQL but SQL light which is called HQL the Hive QL and it is adopted by other systems as well and there is strong Apache community sitting out there who is building Hive out. So but what is missing so what we want to actually represent is the permits that we have just seen and the snowflake relationship between the tables that we have. So what Hive provides right now is just simple table with rows and columns. What we want to say is all the data that is sitting out there in your warehouse is not simple tables with rows and columns. They are related they are related to each other in a way that one is one is an aggregated fact that is built on top of and some other thing. And and the fact table can refer to a dimension and a domain data with some keys where there is a relationship between them. So what we are saying is your data where the tables in your data warehouse are related and we want to capture that relationship in your schema so that every user need not know that relationship. And now and now when we have multiple storages like the traditional warehouse and Hadoop together what we want to have is the logical schema. What I have designed is this and I can store the same thing on Hadoop as well as same thing on a on a traditional warehouse. And what is missing is the pluggable execution engine like a DB engine where it can just go talk to the DB sound and serve the query. And and other features like query life cycle management where you can build more features on top of your queries like query caching. If the same query comes again, I have the result ready. I'll just serve it to you. And when it is shared across so many people you need some sort of quota management so that your system is not eaten up by just one person it is used properly by all your stakeholders. And then the ability to schedule queries as we have seen that people want to run fire schedule and forget about it and the result comes to them every month or every day every week every weeks. So, so we would like to build all the missing parts in grill that is when the that is what is grill adding. So, we will see more on the OLAP model here. So, what grill adding is three basic construct on on top of hype it is providing something called cube another thing called dimension and storage. And what is cube? Cube here is which is consisting of our dimension cuts and measures that we have seen in our data layout. And dimension is consisting of all the attributes of a domain and storage is a storage endpoint that you have like your DB or Hadoop distributed file system. And cube consists of something called fact table. So, when we are calling it a table here the fact table is the table with list of columns where we are saying the order of the columns is this. So, when we are talking it is a cube we have went and we have went for enforced any order it is all set of fields that are available for you to go query. So, when you have to query a query something you do not have to know the order of how it is stored. So, and that can be inside a table which is which is our fact table here. But what you are exposing to the user is just the cube and and you can build like one layout for your HDFS and another layout for your data warehouse. As your fact table and similarly for dimension also. So, what we are exposing to the user is all the attributes that can be queried through the dimension data and then the dimension table corresponding to the table of it. So, what does a cube contain? So, it has measures dimensions dimension attributes and and and expression on top of them. So, measure is the English word measure actually anything that is measurable and when we are talking about measure here. So, you will associate a name and type to it and and the default aggregate if it is actually picked up in your pyramid in the last on the on the last thread what you have seen. And then the dimension attributes can be simple dimension attribute which is a name and column or a reference attribute which is actually pointing to another dimension or a hierarchical dimension where it is actually something like location. So, where we have like zip code city state country and region and you can store any of any one of those in your fact and the time dimension where you can have a time range is associated with your queries. And expressions are any complicated expression that you want to build on top of this you can store that in the metastore. So, that everyone need not remember the expression for or are the complicated formula that you are calculating. So, some of the concepts here are borrowed from Mondrian if you are familiar with. So, and for dimension so we have again attributes and expressions attributes are again the same simple and referenced or hierarchical. So, if we see the relationships a cube will refer to a dimension and dimension can refer to dimension down and a fact table belongs to a cube. Then what we are saying is cube is consisting of fact tables and fact table can be present on multiple storages. A dimension table belongs to a dimension and dimension can consist of multiple dimension tables and dimension table is present on storages. So, to map this data model to the earlier layout that we have seen. So, the bigger pyramid we are calling it cube and each box they we are calling it a fact table. And each fact table is referring to a dimension sorry the cube is referring to dimension dimension can have again dimension tables. So, these are all the logical schema that we have just built and the tables here can be present on any storage. So, you can associate a storage with that tables here. So, now we will see a couple of examples. So, before that what is the language that we are providing for the cubes that we have just built. So, it is still SQL almost SQL. So, we are adding the word cube in the front to differentiate it from the other queries which go to the base tables. So, we are calling it cube select with the select expressions from a table reference and with where and then there is a time range function that is getting used to specify the time range that you are looking for. Then the group by having and order by and limit. So, let us see an example. So, here the user wants to actually get all the city names from your city domain that you have. So, so a grill has actually looked at that query and it is figured out ok city table is present on a storage C2 where I can pick up that table C2 and Ct table or it is present in a system like file system. So, if you have to store a dimension data on file system because it is slow changing data what you do is actually snapshot it every hour or every day sort of thing. And then so a grill would figure out that ok it is getting snapshot it. So, it will it is saying that ok I am picking up C1 city table where I am picking up the latest partition of that city table. So, we will see one more query. So, here what user is asking is for each city give me the measure MSR2 from the test cube for a two days of range it is March 10th third hour to 12 March 12th third hour. So, grill is looked at this query and it has figured out ok I have to pick up C2 test fact to answer this. And then there is a relationship between the test cube and Ct that it has added the inner join Ct table with Ct. And it has it has seen that ok it has to actually pick up some few hourly partitions and a daily partition to answer your query. So, it can because the range that is given is of two days. And in between there is like one full day that you can I can just go get that get the fact data from a daily roll up and answer the query. And for city table itself it pick the latest data of it and it has added a group by for your city name. So, I left I think wrap up I will brief on the design part we have we are doing like yeah. So, if you have to add a new engine. So, you have to implement an interface which is explain execute and fetch results. Or if a new query comes to grill engine where you have multiple execution engines what does grill do. So, it will rewrite the query for each execution engine the rewrite that we have just seen and get the cost from each of the engine. And then pick up the execution engine with least cost. So, this is more more more than the grills of a design. So, the the rest API is given by the Jack Saras and then we have authentication on the authorization getting built for all the services that it provides and metrics and statistics for all the services it provides. And their services like like session query metastore and scheduler so query service is where you can fire your queries. And track them and list all your queries and then a metastore service will provide all the crud for your cube dimension and the storages and then add storage to fact table and add partitions to your storage tables. So, it will provide all the crud there and then the session service provides you where you have you open a session you add some configurations you add some jars they are available throughout your session. So, that will be available not as not only in the shell, but as it as the rest API will be available to you from the session service. And then there will be even service which can talk to all the services and so that they can exchange messages. The current state of a grill is at the query metastore and metrics are available statistics and schedule scheduling queries are in progress and the client client and execution engines are done. So, what is coming out is so we are looking to normalize the query cost across engines and then load balance actually across engines even though the cost on one of the engines is slow and but if it is more loaded we want to give the query to the other. And then we want to have altering your schema which can actually percolate to all the storages that it has and then authentication and authorization and then we are building up machine learning service as well and as part of grill where you can build models on top of your queries and then the quota management. So, here are the references get is grill is already open sourced from NMUBI it is available under github NMUBI github.com NMUBI grill. So, and it is well documented you can have a look at it and the mailing list open you can go mail it post on the mailing list anytime and soon we are incubating into Apache. So, mostly in a month we will be incubating into Apache. So, if you are interested in contributing to open source you can you are always welcome to contribute. Yeah, so that's that's what I have 20 questions. This summer race for a ladies and gentlemen do you have any questions it's time for questions if you have a questions you can put your hand in the air we'll have someone come in hand out to Mike to you. We'll go with you sir. Yeah, you have talked about cubes. You could have explained as a what is a cube in terms of shifting the dimensions like if you have time and location and category in XYZ axis when you turn the cube around it will be the opposite instead of time and location and category will be category location and time. So, that is the concept of multi dimensional cube. I wrote a textbook on this. Then many OLAP servers like Microsoft SQL server has an OLAP server. It takes care of all this query processing of the OLAP queries on the regular database of the SQL server. Yeah. So, there are things like big SQL where you know postgres query SQL queries are converted into Hadoop big data queries and back. So, when that happens, the effect is similar to what you are describing in grid. I mean think already happening in that direction. So, there are proprietary solutions, but we haven't found a open source solution which is gathering both of them. There's a thing called SQL. I don't know if you heard about it. Big SQL doesn't talk to a traditional data warehouse. So, here we are here we are talking about unifying both traditional warehouse and Hadoop. So, in other words, you are converting the Hadoop data. You're not converting, but you're keeping them to separate and processing. We're going to hand the mic this side please. What's your comment on the PrestoDB? Would it have solved the problem or what were the limitations? Pardon? PrestoDB. PrestoDB is doing something similar to what Grille is trying to achieve. Would it have solved the problem? So, I'll have to check what is PrestoDB doing, but yeah. So, what Grille wants to achieve is give data warehouse features on top of Hadoop data as well. And then have single unified interface for both your DB and Hadoop queries. Okay, we have a question to the right and then forward here. You can hand the mic there please. Hi. So, you talked about the dimension pack tables and all. Do you have any, I mean, can you give any insights into what are the levels of dimension you are working with for ad hoc queries as well as pre-applicated queries? At Inmobi, we have around 125 domain data and dimension data, dimension tables, which are both sitting in Hadoop warehouse and also in a columnar data warehouse. That is what you are looking at. Yeah, yeah. So, what's the cardinality maximum and minimum cardinality you are playing with? So, there are things like the creatives that are getting added which are at a huge scale with millions of rows on the dimension side. Millions. Yeah. Okay. Sure. Thanks. The question goes here. So, if I understand correctly, before when you have a schema, you are pre-deciding the dimension cuts and the factors, right? So, why not when the data is being written, stream it using Spark, store your dimensions anyways and use impalata query? I don't get why the complete orchestration of things. Okay. So, if you are looking over a month's data, I want to actually analyze over a month's of data or quarter data, right? You could go the Cassandra way also, right? Impala or Cassandra, they won't solve your batch queries, right? So, you will have to have a system like MapReduce to hit those queries. No, it doesn't. Impala has its own execution engine which won't go well with batch queries. Okay. What is the short is there? What happens? Why? What? So, Impala is built to actually solve the interactive queries part, not the batch queries. So, Impala can be one of the drivers that we have which would answer your interactive queries. So, yeah, Grill wants to actually unify your system. So, this is a solve all for both interactive and your pre-decided queries which you want to run in batches and have reporting out. Yes. So, it wants to unify your interactive batch and other dashboard queries and your schedule. So, who is the target audience for Grill? Who uses Grill per se? Grill is used by everyone who wants to use analytics. So, the personas that we have seen in the beginning, like the analysts, business analysts, the product analysts, and the data scientists are the advertisers, publishers themselves. I'm going to have to stop you there. We're going to go to the person at the back with the question. This is going to be the last question. Can you keep your questions as concise and to the point as possible? Thank you. Hi. I just want to understand how real time is your framework? Like how real time you can integrate events into this framework? So, we are working on a Spark driver as well here. So, which would give you the real time notion to it? So, from the user point of view, it is still the SQL that is talking, but it can be run with a Spark SQL driver that would build which would which can run on a Spark cluster environment, which can answer the real time queries. Suppose I just take a real time scenario like I want to track the number of times I wait in traffic outside. So, can we go at that real time in this framework or is that a restriction? Sure. So, Grill actually doesn't do much heavy lifting. So, it doesn't have an execution engine on its own. It's providing a single interface where you can plug in more engines, plug in more storages. So, as long as it can fit into that paradigm, then yes. Have a number? We don't have that engineer. What are you asking about? But yeah. Okay. Ladies and gentlemen, that's a time for the questions you can link with the speaker once you're done with the session or she will be around for a while. Can we please put our hands together for her? Thank you so much, madam.