 The first part will be talking about the evolution of whatever we did in the movie and where we are. The second part will be talking about a particular system which we generally call LODA to analyze it. And the third one, I want to keep about 15 minutes of course in 10 of the sessions because at least 10 of us don't need so much time to see what is happening. And taking the questions over the end like a lot of stuff took out so we don't have all the questions. Okay so when talking to people like across the mountains and just like I've been working and also I've been always here I noticed that many of us or rather most of us were starting with LODA. LODA goes through definitely the same phases in the data processing and we thought it would be good to share our experiences and what mistakes we did, what worked for us, what didn't work for us so that you can use the effect to not make those same mistakes and not in our expense and like we did the same when we talked to others. So this is not about comparing the merits and demerits of what is available out there it's just showing you what worked on the ground for us and what didn't work. So just to give a little bit about the history of the problem, I know you have seen a lot of data bytes and data bytes from yesterday this is the kind of thing we handle every day about 3 billion records that come in the system these records are basically from the ad service in mobile in the business of ad service so when they get a request for an ad and they serve, they log it, when a click happens they are named on it and you have like multiple of the live sessions like after motion, it could happen like on motion or something like that, so you log all those things it's roughly about 2.5 terabytes of uncompressed JSON so I mean, treat that number with a little bit of like a hint of thought not so much like not everything about what we call is the input, there's a lot of jump like user agent of a device like who wants to do talk and answer and so on and so forth but still there's a lot of data if you want to do anything like that about 100 primary dimensions and 300 the live dimensions so whenever you are into a field of analysis of data the second point which is the number of dimensions and measures are very very important just saying the raw count of the record numbers in the size of data doesn't mean much if you all you are doing is just logging it for let's say a regulatory purpose or something you could just log it and keep our target but the key part is that you want to vary the data very frequently on so many dimensions and measures and that makes it very hard to organize the data you have again, analysis of data is one very important aspect of analysis of data how long will it take if you want to go back and do the analysis if you were to do a daily analysis properly 3 billion records aggregated with a few million records probably 10 to 20 billion records at most and we could do analysis on it pretty easily using the data but if you have an analysis for ideas of yours and you want to see day by day for the entire data the problem suddenly explodes and becomes slightly unanswered so to continue on that as I said data sizes and the number of records themselves are not the complete picture what you do with that data is maybe even more important like for example if you are into medical sciences or you are into computational biology like even one gig of data would probably be a lot of crunching and would take probably hours of effort to get it to mean for all of it whereas if you are reporting a domain like this talk is talking about then one gig of data is not so much like probably you could write a standalone Java application I mean that's one extreme of it so the data which we get it becomes even more harder because of these things first thing is very very frequently changing or dynamically in the model of data when we started off we were measuring into these feature phones but feature phones I mean what kind of phones where you have ring phones and you let anyone download wallpapers and that was the kind of business model so actually the data that we log is a reflection of the business we are in as the business becomes more complicated the data which you are logging and how you are modeling also becomes more complicated so we started off for 2-3 years back by serving the ring phone and wallpapers in the past slowly the smartphone business kept in even in just a year ago iPhone came in and iPhone actually started to become popular came in about before that started to become popular so you could now serve very different kind of apps you could serve both actually not 3D and 3D now but you could serve a bigger banner app on your phone or you could even serve slightly interactive apps so they force you to capture different aspects of the advertisement which you want to analyze we come to it towards the end of the presentation where I actually show what kind of things it could do then even the query packages are not very fixed so within those 300 or 400 or 10 inches you don't know what the user or the undist inside the company wants to query today the impressions in America went down I mean is it a cross world concept or is it just focused to America if it is just America then is it focused to a particular state or a particular segment of user or some operating system or it could be that one of the big advertisers who was burning like hundreds of thousands of millions of millions suddenly went away so your impressions have gone down you don't know what you're looking for so if the reason I am writing to this is because if the query patterns are known then what you would do is you would create some views out of the data and store it in a very very efficient data only data store it could be as simple as MySQL or if you can even get columnar data stores like Infobrite or what have you or even go with things like Cognos or something where you actually have some sense of what the query is going to be looking like you can't build models for 200-300 inches we want both canned and ad hoc reports so there's a big distinction and the definition of canned report is actually very exciting so canned report is something where the user or large amount or large segment of the user in a company wants to see on a day-on-day basis without a change so it could be a campaign manager wants to look at his accounts every day in the morning and see how well they are doing whether they should be going to them and asking for more data or something wrong ad hoc report is more when you don't know exactly what the campaign is for you have some manager but then it's more of an extraordinary process what we have noticed is that very very frequently a query starts from an analysis starts from being ad hoc people do it over a period of days or something and then the thing that makes sense for me and then the sales crystallize it materialize it and I want to see this report every day I don't want to be doing this every day so there is no clear distinction between what you can say ad hoc we definitely know, sorry what you can say can't we definitely know what ad hoc is then we have like I told we have multiple phase shifted data by phase shifted I mean you could have a request coming at this time let's say in this hour whatever hour is it 12th hour of the day you could have a click coming for it half an hour after that and then you could have an app being downloaded as a result of that click maybe two hours later so you have data coming in at different points all are history of the same event that is an hour requested but what happened for the event could be shifted in time so if you want to do an analysis where you want to combine these to all the events into a coherent history or a trace of an event then you have to bring it together and lastly the problem becomes different because we have many many different kind of users for the same system I start from actually I miss out the developers here but right from the developers they want to get some data for whatever reason it could be for modeling their forecast engine or for predicting the inventory or etc it could be normal stage team out there which is trying to kind of do some operational analytics on why my account is not running well or what to do to make it run better it could be an analyst what we like company level or organization level analysis direct exit they want to be a dashboard kind of view every day in the morning or it could be other machines so you could feedback the system into the machine directly and they can query it so there is a wide spectrum of who the users of this kind of system so just a little bit about where we started from so back in like 2010 or before that we used to get all the data on the logs just record them on the Azure machine and there was a person who used to run every hour runs all the data put it in the database and then we had a user interface which would query the database for the reports so it was only canned reports there was no error so you knew what kind of reports you wanted to show to the users and you had good database used to do that and it was perfectly valid model at that time we had entered the business the number of the instance were like partly few, probably country, publisher, advertiser operating system was also like not so valid because all the Nokia or LG kind of phones were the same in OS or some version of that money so there were not too many dimensions there were not too many measures all you want to impress that there was how many impressions it had served and how many clicks there was and yes, how many requests there was so the problem was pretty simple and first they had to find that was about 100,000 or order of 100,000 events per day then as the volume increased to millions this power processing just fade over and die and you can't find it so we had good developers out there and they said like, the second thing is pretty stable everyone is jumping on it, why don't we try so we set up a small cluster like three machine cluster if the data center was somewhere in India and then we had a small map use job which would go actually map use job would input there was a transfer service which would go and collect logs from all the serving machine, bring it onto the cluster and then they would run the machine cluster and process it it would take longer because the cluster was small but at least it wouldn't die off it was still a management problem and things were running fine for a while but then what happened like as the business kept on increasing people said I want more views of the data just two views of the data is not enough so then the developers had to write more views jobs to populate their views so every time a new requirement came they had to write a new pipeline in map use jobs which would grant the logs and populate what the views in the view now this was a hard work time consuming and also an error problem it's very hard to get a map use job right, if it is not very good job so we said fine this map using is not working for us that's also the pipeline right now ad hoc is still like so the way we handled ad hoc was creating more views if you want this kind of report you have a requirement for these three dimensions and these two measures, fine it's important we created another database view on it materialize view and we just focused on getting that views right so we took experimented with PIG PIG worked out very fine for us like you could take like simply write a new job and that would populate your DD views so and the data gets aggregated using PIG on an hourly basis, post a DD for analytics and everyone was happy just for the questions like if anyone is taking advice it works fine as long as you have logs which are just linked to a say group by and some and few filters but if you're doing anything more complicated exact example is probably I can talk offline but if you're doing anything more complicated which may warrant multiple phases of map use job then it turns out to be expensive because right, custom map use job gives you a lot more control and you can actually compress many of the map use phases one after the other turns into one phase and with map use job for any of these adobe phase jobs I use the biggest data so if you can reduce the number of stages in the pipeline that's going to be single biggest in one case or in the case but then the thing changed enter tablets and enter android and you have like a thousand different versions of android and analytics just gets complicated you add formats and you introduce rich media or these media ads where you can ask for your queries like how many users view this video from 20th minute to 30th minute or how many of them wins in half of the video or if it's like a car interactive car ad you want to get analytics on only people are actually building on their steering a lot so either it could be like they're steering a lot or they're trying to see the effects of the need or something so you can select all kind of things from the ad itself it's like building a heat map over an ad what parts of the ad are interesting to the user and it's just not about what ad is interesting to the user so the analytics get much more complicated as I said earlier the db still suffers from the limited view problem you have a few other dimensions like that view so like if you have 300 dimensions creating 100 three dimensional views is just not enough it lacks immensely so we tried different approaches at this time we took on hype that was about late 2010 early 2011 so this was actually a big surprise and a big disappointment it didn't work out as we expected this at that time I cannot comment exactly on what is the stream right now but I would be very very surprised if it is very drastic so ok so if you have like a layer so if you think of your infrastructure stack you have sdfs sdfs at the bottom then you have adoomapidus built on top of it and then you have peg and hype and all those things and you take it to the stream where you have things and how which will even package everything together and give it to you now what we have noticed from our experience that things at the bottom of the stack are pretty stable so if you are talking about sdfs or map reduce from adoomapidus they are pretty stable and you don't need to muck around with them or take them around with them too much or have to get them working so to get a good great running if you are maintaining a grid of machines on your own take a little effort to tune the parameters but it's not that bad as you keep moving up the stack and the quality of the software available out there drastically reduces so if it's more a query engine like hype or something let's take peg peg is I would say below high in terms of that traction it's providing it's giving you more what to say and elements have operations load this table, run a group file on it and do some filtering on it so it's pretty stable it does the work you can specify a little bit of control but you can still get the work done if you are coming to hype you have more assumptions how query is going to be planned what are the operations that are going to be optimized to what level the optimizations will be done and these kind of systems we have noticed that they are not made or written out to be so that one sign fits all I don't know if it is even possible to do so but they work very well at facebook where they were implemented similar to Google's implementation which they haven't open sourced but they work very well where they were conceived but outside outside that environment there are a lot of challenges in our case the particular challenge was the red side was small at that time when we were trying to hype it was our 2025 machine and I was on any system like I'm not like pitting on hype as a bad system and just saying we tried it and what we faced and just shared there are a lot of resources spawn inordinate number of jobs it will not try to reduce the data in the map of state as well as that can be so there's a lot of data and if you have no bandwidth system not a new bandwidth I mean you are like probably a 100-day or 10-day kind of connectivity between your machines it doesn't scale for that then inordinate number of jobs is what I told so for a simple query it will spawn 5 jobs for crunching 100 bits of data that same 100 bits of data will be read and written like in a sequence of 5 operations and that would be very bad for the performance and since it works on multiple levels one query is taking all your resources and time then you can't run any of the query whereas if you had a 5,000 machine cluster or something bigger than that it would run in some form of a cluster and you wouldn't even notice all the delays I'm talking about and if you have a bandwidth system that is supposed to run you face a lot of problems the second thing that we face problems were like immensely large code base and very less expertise available within our organization to get it right like it was at that time it was about 150,000 lines of code or something 120,000 lines of code and it's a mess like it's not like for any open source system like there was but you go look at the code and it's really really really hard to just go and change anything or fix anything if you guys attended this talk by Kulkarni from Google yesterday it was the same thing he brought up like people why are you still trying to go and use Google's BigQuery interface or something even though it's limited but it takes less of a time it takes about 6 months to get a Hadoop based stack right and start doing it so how long have you got it right if there's a high change in your model of the data then you have to just go ahead so we worked on it for about 3-4 months team of very bright engineers like 3 or 4 engineers, 3 engineers but it was just turning out to be too hard a problem with that so we went back to pink we said it's a high line we can't spend forever on it we'll just go back to pink it's okay if your business is like 10 engineers, 5 engineers who will take the request from business and they'll write a fixed script and give a report back now as you can see there are multiple problems like first of all there's a delegation of request and in this terms someone will fight a bug or something which comes to the data team also that this report is to be given and then that is assigned to a developer he or she writes a fixed script in many cases or very frequently they gave back to the business it's not what they wanted so it's a back and forth so approximately we saw about 3-4 days turnaround time for every query being processed and one developer's day pulp query being taken to process every query on average and the second problem was the performance of anything like PIG is only asked what is the person writing then and if the person is not very familiar with the data model or doesn't know the PIG that well or for whatever reason then one script can just kill the entire system and if the system is small it's easy to get killed and we have a lot of these issues like someone wrote a backpick query and the entire script came down and we have to get in the middle of the night, reboot all the machines or do stuff like that so what are the lessons we took at this point frequently tools which are available out there don't work as it is they require a lot of tweaking constant tuning and a lot of maintenance they are it's very difficult for them to absorb the dynamics of the data if your data is changing very very frequently then you have to be very careful about what you are using and make sure that that system or the way you are using that system is able to take care of the changes or the variations in the business you cannot afford to spend six months in a system and then with the change of business like after introducing all your data model and everything is a boom and you have to again start to discuss they are generally found to be so when systems are they are not meant for particular person organization or need they are like developing databases so you have or operating system you make one version you give it to everyone and everyone will use it but unfortunately at this game the generosity doesn't work that way you have to be a little bit specific to your need and you just can't take all the shared system and get it to work I mean you may be able to work with a student for that but in most cases and in most conversations I have had with people it's very different and almost impossible to get it right that way the other part which is actually very many of you would have raised there is no end to end coherence system available there are parts of the start that are very different developed by different people different organizations different I mean one could argue that there is a beauty of open source that everything just comes from everywhere but when you try to put them all together it's a very very hard task you know if you think about this problem you would have to take data transfer layer from somewhere ends it could be tried or something then you would have a cellular which is moody or astroban or something else then you would have an ETL layer which could be big or custom marketing jobs you would have some data storage format you would have high to run the query you would have few on top of it so build your query then you will have crystal reports or something to visualize the data and so on like the story that I already asked so many things you could go with things like Centro which we tried and we are still trying to add some text to it but even there like things are not as quick as they appear on the face of it it takes a lot of effort and sometimes it just doesn't work and the last thing we I mean it's obvious it's not even a realization that when you are not inspired to be used by business and if you use peg or if you overfit peg for all the requirements that is both the pipeline as well as the allocation then it's bad because it would need a lot of developer time which would be used elsewhere and no more be useful just to give a just to give a small example about these are the like I was talking about complexity reports so I will come to this tool a little bit later but what you see are the early data and these are the dimensions which a person requested for I couldn't get them to fit on one screen so I take it three snapshots of the report which I showed there are about like 36 sort dimensions here and not many about 10 or so measures so this is one screen this thing continues on the second screen and this thing continues on the third screen so you have like a huge huge amount of report and this is like one of the reports which I have noted this morning I have just shared you have reports even bigger than that so how do you even start by building database views or even if you make specialized commercial million dollar software how do you get this report to work there where there are like 37 38 dimensions and I don't know so many measures and it just gets forget about answering this query in one hour or two hours most of the system will not be able to answer it if they don't know a priori what the requirement is going to be or what the pattern of the report is going to be ok so I was saving last 15 minutes for the demo I will quickly rush through what we have and I will probably let the product excel describe itself so I will just go very quickly through what is this so we developed Yona which you can think of it as in the same space of the stack as type it does many things similar to high barrier yield borrowed from high barrier yield where we thought we could it doesn't make any sense to implement any like RC file format beautiful file format to save records and 21 so it's an in-house developed query system to perform completely analog analysis it's a complete stack it's not one part of the entire end-to-end stack it has TTA query processors a query builder so visually you can build a query a visualization there on top of it and this is all better now it supports the usual operators like every other SQL compliant TV a lot lesser than that but most of the important ones like SEDX, RMAvres, Desync Expressions, Compile where I have a user defined functions where it differs like it's very very heavily optimized for storage and queries and it's a data model which is lying underneath it so it makes use of the data model in every way possible and somewhere and gives you examples of it but it tries to use a data model very very extensively and optimize the queries all the facts we timeshifted facts about all we have about 100 different metadata tables where you would have id to name mapping or something like this but they all are just brought together in a CVS view and given to the user so that the user doesn't have to worry about what table should this query go to and what are the joints I need to do for them that can come the demo it's very very simplistic and it's exposed as a platform so there's a UI as well as an API interface so it can be plugged in like, fake or hide into other systems and other systems can directly query and consume the query so instead of going through the whole design of it I'll just have a little slide for life of query I'll just walk through how when a query comes what stages it goes through it's very highly reviewed and finally what happens so we use our protocol buffers from Google extensively both to store that data and also for communication between systems it's a very very beautiful technology given by Google and I encourage you guys to take a look at it it solves a lot of problems it appears to be a very tiny thing in the whole lot of great picture but it solves a lot of problems so you have a UI which generates a request in format and transmits it to the server in JSON format the first level is where we do some optimization the first level of optimization we did was whatever is required from the metadata table so there are two parts of any data without any system there are fact streams which keep coming from log servers or where the actual action is happening and then there is slowly changing metadata where you keep record of small IDs to larger names or other attributes so when you query on additional attributes which come from metadata then it requires us to do a join joins are expensive so if we can cut down the joins very early in the query in life cycle it helps us a lot so we do metadata to fact promotions by converting all these group bytes into decode so if it was a group byte let's say country, group byte country so that would require a join from country ID to country name but I could do a decode decode is like a if then clause in SQL where you can say if the value is equal to 2 then emit India so if you convert these group bytes into decodes then you don't need to do a join on the metadata similarly for the way after the optimizations are done we automatically select a queue so you have multiple data queues lying underneath all on flat files I haven't gone into that so we automatically select the queue depending on the use of way instead of user specifying what queue we should go to and also we have an appropriate level of granularity or aggregation over it which will answer the query in the least time after this is done we create these join chains what we call which will help in the metadata based dimensional queries and we also estimate the cost of the query and priority and then submit it to the job project so this helps in segregating small queries and large queries a large query cannot come in just take over the entire system and everyone has to wait so this is a very very important part of it if not for this then one user would affect the entire algorithmic system once it's done we reformulate a query a little bit again to remove the redundancies like if you have asked for a particular measure like let's say valid request as 4 columns one as part of no fill rate and another part of calculating some other formula of yours then we try to remove redundancy and optimize the query once it's done in the mapper you have lots of optimization filter push down so when you're doing a filter over a record so we are reconstructing the records from the disk we keep applying filters and as soon as we figure out that this record cannot pass through we just drop the reconstruction stage for that record so it helps us bring a lot less data out from the disk after that we do aggregations in the mapper so this is also very important we do lots of aggregations right in the map side instead of waiting all the way to the reducer so this helps us transfer a lot less data from mapper to reducer which is good for system where you don't have a huge amount of value in the reducer it's pretty simple we just apply the formulas which the user has used the formulas being no fill rate of CTR or something but form a having clause if the query has a having clause do it and then do a open if the user has obviously only top set countries or something like that and once it's done you dump the file into a CSV format and send a link to the user and also make it visible on the UI which you'll come to in a minute so two slides on what all of all this what I've described what are the two or three things which really work for us the first thing was they concentrate really high on efficiency of the system and modeling of the data so if the data is modeled correctly then it can have a dynamicity in the data it can handle a lot of new feeds coming in and you don't have to you don't have to stop the system or redo it the system what we have takes in new feeds as a model I mean without blanking no one even feels that there was anything then we have done a lot of work on optimizing the joints joints are the killer thing in any of the data warehousing systems built over mapper use every joint required one map to reduce space and if you have a lot of them like many of the tools available out there then the time increases drastically so what we have done is we have optimized the map side joints very very heavy we know only what is required into the memory and filtered by many different criteria horizontally vertically and also filtered based on what the user has given so that helps us load if you are familiar with the highest map side joints know that the map side joints are applicable to small tables now small is not defined small depends from machine to machine environment to environment but what is written in the documentation is that about 10 or 20 MB tables are considered small tables with under 100,000 rows or something after that it becomes big but with these kind of optimizations we can load tables over 5 million rows into memory we don't load all the 5 million rows because we do a lot of filtering and the table size could be I mean as big as about 4 or 5 million so it's a huge scale and then the only technique we have done is like when you are joining these fact tables with the dimensions and you have a billion fact rows you don't need to do the join every time you could pre-compute these joints keep it in the memory and then reuse those pre-joins or even simplicity so we come to the demo in one more minute like if the system is not simple then it will not get useless people will not use it and it will be just another technical showcase kind of system which you are building the company and will buy it there it should be a very good data dynamic it should be very very intuitive when you are building the company and a good analytics there and how it works later we were there up to some time but we realized just dumping those people is not enough and then you need to have something more rich than that and then again in the beginning I said the difference between ad hoc and canned query is very shaky so you should be able to support scheduled or scheduled ad hoc query so once the user thinks this ad hoc query is not ad hoc anymore they can just go and schedule it so that's all in this slide I will jump into the demo of it and why I am going if you have any questions like I will be asking just want to know what is the aggregation ratio on your for example the physics we use that slide just want to know what is the aggregation ratio you get like what is the input by input loss by output loss as we showed in the video so I mean we have multiple queues we have queues which we are anticipated to be used very very frequently so they are probably from 3 million reports they would come to 3 million reports and then we have increasingly more complex queues where the aggregation ratio wouldn't be that great so I mean approximately to the largest queue which we have created there it should be probably from 3 million to maybe about 50 million so response time varying actually the fastest we can get here is not more than sorry not less than about 45 seconds that is it takes over the time to just start map it using and we have taken but like the bad queries actually can be pretty bad so they can go up to half an hour or something even for 5 minutes even up to 5 hours but most 90% of the time queries come back within 2 minutes it is hard to generalize at that scale right yes you get examples of things that somebody was not known and hence more able to optimize those scenarios rather than somebody in a cell so let's say just loading the metadata into the memory so that is one example now what happens in all the generation is you can partition them horizontally and vertically but you still don't know what else you can leave out and that is the reason why it is very difficult to optimize there but take an example where we are let's say connecting the events that happened to the creators what was the creative name by creating an advertisement what was the advertisement name and the advertisement type for which this impression happened and I want to do it for last 1 month of the data so if I store in the metadata even what is the creation date for this creator then I can afford not to load all those creations into the memory which were lying outside this 1 month page so that helps us bring that gives us more confidence that the system can keep on handling more and more data and we bring it only in the memory what is required and just leave out so that was just one example yes we are actually in talks to do it there are couple of things like we want to be sure before we make it open source like there are some like I just said this is like a customized to a data model we don't want to just go out there what makes sense only to us and not to others so we have to take it out of the system and expose it as an interface and that if you want to realize these potentials then you get this data into this format and then be able to explore these potentials and the second thing is we are lack of people to do our work like very old type situation so to make a good system document it and make it usable by others it takes some amount of effort then there is also thought in where to open source it we can open source the system as it is or we can break it apart there are many parts like automatically selecting the queue non-rains and the system which is going to take care of it's catalog and the query operators and data drawings you could give back to find so you could use a break the system apart and contribute to multiple systems so users don't have to have familiarity with later in the system or the easier way would be just to clean it up and give it up so we are still thinking about the best way to do it yes, yes so data doesn't come in form of JSON like we add some records in the form so they want to read it out to run that query on it JSON is very, very temporary format it's like one day left format after that we store everything into protocol workers for the system we don't store anything in JSON yes basically as you can call it what you do is there are two parts of it how many dimensions and measures have you requested for the complexity of the queue it is going to and the third part is for how long have you queried so some mixture of these three attributes tell you what the cost of querying is so this is like a front page when the user logs in now if you see here this is an empty space and here you have a list of all the feeds they are glued together so there is no concept of what queue you want to go with or do what all you do is like you want to see the number of clicks so these are dummy data this is only for representation purpose here for the feedback we got when you are building a query many times you realize the mistake later on that this is not what you wanted and if you present a visual structure of the query to them then it becomes a lot more easier to fill so you select like I say I just want two measures and I want to group it down by country and I want to group it down by time stamp and from time stamp I will take date wise and I could have drag filters here so if I put country here you could have multiple country or something and you can have multiple filter conditions and just move it and you can say run so at this time you can do three things, either you can run it or you can schedule it for periodic runs or you could just save it as a template query for later and once you are here you just select for what date range or dog duration of the data you want to run and say something now if you go come to the running portion like I can say a lot more queries so this query is currently running by someone else it's a bad query like not a bad but a long query mine just got submitted so let's see how the data time has been looking for so I could just like pull up the data for this query I have prepared a query to show you that is for data and there is some normalization of data and fabrication of data not to be about actual numbers but you see like the query times here so like one minute, one minute 17, two minute, under one minute two minute, four minutes, 44 seconds 57 so in general the query times are pretty fast for the queries that matter to people it's very very fast so I will just pull up one query which I ran earlier in the day so if you will do this query you can either go and download this thing you can see my mouse here or you can read the query or these are the two kind of we are just starting with the visual analytics for the data to do this is a simple one which will so this tool will just pull out the data from the report and show it to you so you can do filtering and sorting and all sorts of thing here you can sort it down but this like you can see the data is so much it's hard to make sense out of it probably you won't be downloading into Excel and doing something more with it so to overcome that we have like this another tool so this is like the same data in a chart this should be exciting so what I requested for I requested for 31 days of data or for 16 countries or 7 operating systems and these were some as good reference not getting the sense of the data and that tool, three measures I wanted was the as a question so the first thing I want to see is how are my business going across time and I want to see on a line chart so you can see here very quickly how your business has been doing there was some drop here which we try to see what comes to consider this drop but there was a drop here and we want to know why keep the chart here, we'll do a drill down so the entire thing has moved back you can come back to it and I want to see the same thing by country that whether one particular country has been doing this or is it for a day every country so I'll draw it again so this red thing here which is India so this was for July last year and very restricted data so India suddenly dropped about 10th July kind of thing and virtually became non-existent whereas this one is Thailand, even Thailand dropped so I can see that this pattern is the drop in the red view so impressions which I saw here was because of India and Thailand so before I go further into it I just want to see how is the relative impression distribution between these countries for impressions and place let's say I just take place and at this time I want my chart so this is again interesting you see that Thailand has the maximum share of impressions as place so I would probably want to go and check Thailand further because Thailand is the important country and it also lets you draw India comes second in the number and we also we also want to see why it is here now this thing you see here this is even interesting the inner pie is place and the outer pie is impressions so India is giving you on an average many more clicks as compared to impressions whereas if you take this country which is Indonesia it is giving a lot of impressions the green pie but the inner pie clickers but this was just one of them I want to see why that did happen in my entire business so I have selected Thailand here because in Thailand and I just drill down so now all my charting is restricted to only Thailand so now we check across operating system impressions and how are they contributing and what happens there so this is giving much better picture I can see that Symbian OS is giving a lot of impressions only Thailand then Nokia and if I want to see how they are contributing across Thailand then what I need to do is I select time and operating system so here I think become a lot clearer your Symbian OS suddenly had drop it didn't pick up much after that and this one where this guy was like others we don't recognize these two had drop which had led to a drop in impressions so these are the kind of things you can do now just one last thing let's say this entire thing instead of Thailand for India so all you need to do is like select India here and it just go all the way up and do all the easier things so you could do a lot of things and ultimately you could export these things you can see them do and they can export them as charts embedded in the PowerPoint and send it out to customers do whatever thank you