 My name is Vinak Hegde, I am VP of Engineering at HelpShift and prior to this I actually have worked at InMobi where I actually built the first few generations of the data stack which I think Amrishwari just talked about. I contributed to the data warehouse as well as the first Hadoop installation and I think in India it was one of the first and probably one of the biggest probably exhibition of maybe Flipkart in India and before that also have been doing analytics I was working at a company called Akma which is a content delivery network and we also built analytics stack a complete product line out of that. So, I was the architect there and even in InMobi I was the first guy in the data team and here at HelpShift we have an in-app mobile help desk. So, we are building a data stack from scratch again. So, while I have taken we have built you know data stacks at three different places what I have come up with is a framework for making decisions regarding you know what components to choose what are the trade-offs that you need to make a lot of it is like hard fought hard fought hard work advice where we didn't take the right decisions and then we had to wake up in the middle of the night and fix it. So, I am going to share some of my experiences regarding that. So, what I am going to do is divide this into like three parts the first part I am going to talk about the nature of data because some a lot of people are not aware of the second part is I am going to talk about the framework and use that framework in the third part to talk about very specific examples of software which fit into each of the layers of the framework and comparing contrast you know why something is better than the other. It is a lot of material to cover. So, let me see how well I do but I see not this not as a see this not as a talk but start of a conversation. So, feel free to you know connect with me I will be hanging around at the conference for a couple of days as today and tomorrow and also I will be around at Nexus VP partners booth as well. So, let us get started. So, this is an oil rig data is like the new oil and it is not just in terms of people want to the metaphor is quite rich because even if you look at the terminology we talk about data mining or you know refining data. So, just like oil everybody wants to look at look at raw data and then gleam insights out of it refine it and you know maybe sell it at a higher price or you know use that to kind of fuel their engines of growth and that is something that everybody wants to do. So, also on the on the farrier side we are I think at peak big data as well just like we are at peak oil and you can already see the next bird's word coming up which is like internet of things which is also kind of fueling big data as well. So, how do we understand data? If you look at data there are certain aspects of it that we should know when we collect data a single data point it does not tell us much it may be it is a sensor reading or you know the bid rate for an ad network things like that. But then if you understand the relationship or take two different seams and combine them for example, in ad network you have a you have an impression and you have a click. So, if you have a click it tells you yeah there is an intent there and that then it becomes like information you know something more from a single data point. So, on the left hand side you will see that there is connectedness on the on the bottom axis you can see there is understanding then you can take some of this information and then you probably aggregate it and you start seeing patterns where that is where kind of machine learning statistical analysis fits in and once you have done that when you you can take the same kind of understanding and knowledge and apply to different domains that is when it becomes wisdom that if you know that these are the patterns that I continuously keep on seeing and I can use those pattern in a different industry right that is when it becomes wisdom. So, this is like the hierarchy of you know of data and always that we should try to move up the hierarchy and down the hierarchy to move from data itself and see how we can combine and go towards wisdom because that is where the real action is. So, this is what the data stack looks like at the bottom you have data generation where which could be like sensor networks it could be a mobile phone it could be a click stream it could be a web browser then there is data collection and transport if it is a web browser it could be like JavaScript sending you know or a pixel that is firing or it could be something like you know like low powered sensor networks where they collect data or it could be data that is typically transferred at web scale between data centers or between maybe an ad server and maybe the Hadoop ecosystem or in the case of mobile it could be something that collects stuff from the mobile and then sends it back then once you have collected all of that data you need to store it somewhere physically right. So, what are the decisions that you need to do when you are designing storage right. So, I am going to talk a little bit about that finally, once you have got all the data in right you have to take all of this data and you know transform it do various kind of transformations with it and do the processing and make it ready and kind of ingested put it in the format which is ready for analysis because the raw data may not be good enough to perform any analysis. So, maybe you need to do cross feed validations and stuff like that and finally, you take all of this and at the highest level your visualization which is like a data narrative. So, you have gone from a single data point and you have taken all of that together you have taken the connectedness of the data and then you are telling a narrative about from that data using the visualization. So, mainly I am going to talk about the bottom four layers I am not going to concentrate so much on the top two layers because of one lack of time and I also feel that that is going to get covered in tomorrow's talk towards the end by Shailesh Kumar. So, I am going to cover the top bottom four and he is probably going to cover the top two a little more in detail, but I will talk a little bit about those two as well. So, there are two different approaches to getting insights you start off with a hypothesis and then you find data that can support or refute that hypothesis. For example, you feel that if you introduce if you are in a retail market and you feel that you can change maybe the arrangement of SKUs or arrangement of items on the shelf and that can probably increase your sales because maybe the items that have a higher margin or maybe at a high level and those which have lower margin are at a lower levels. So, typically most people will say hey what is available in this category buy that and just go off right not everybody will look at all of the alternatives evaluate them and do that. So, you are pushing the user to do that. So, maybe that start off with a hypothesis and see if it works you see with if a certain shelf arrangement works or not right that is one way of doing it the other way of doing it is a bottom up approach you look at the one data showed in the slide before you start off with a data point then you try and see if you can combine with anything else then you try to find out if there are any patterns and then try to see you know if if there are certain principles which you can apply to kind of enhance the data. So, it is like so, this is another approach that you can use. So, what do you mean by nature of data? Nature of data could be mean could mean like if it is a social network like your connectivity of friends friends right that could be one level it could be inter relationships also for example, if you are using Twitter not only the message is also the location is important right where are you coming from. So, what so, Twitter does that right from what location are people we can even set the trends on a localized basis and that is because they have used the nature of data exploited that you also look at the interrelationships between different entities right is it like a one many relationship is there like a is it a peer to peer relationship so on and so forth often what is useful is to look at the ratios distributions variances and the medians to look at look at the shape of the distribution and it can have a huge impact rather than talking about this in a very dry way a great example is the retail industry which everybody of you can relate with right. Suppose you have a chain of stores and you want to optimize maybe revenue how do you do that there are so many approaches to doing that right you could you could actually say for example, your pantaloons or your west side you could either try to increase the number of footballs in which case you have to do outreach and marketing a little more and then you see how many you measure how many footballs are there how many people are coming into your store or you could say once the customer is in I am going to try to maximize how much money I would make out of that customer which is the classic argument which is average revenue per user average revenue per customer right. So, each of these the metric you try to optimize is very very important because that metric can actually make your business grow like crazy or it can actually destroy a business because you optimize for the wrong thing it can have a huge impact on your business right. So, let me let start going the going up the stack starting with the most basic thing the first thing that you need to kind of know is like what data needs to be generated for if you are for example, if you are in a retail store you could check maybe number of items average price of the item do they buy it in groups like for example, shaving blades would typically be bought in groups or diapers would be bought in groups right or so what is what is it that is kind of important. So, you have to see like understand the domain and see what what needs to be generated you also have to look at the frequency of generation for example, if you are looking at a meteorological data how good is the data generation for example, do you need to take samples every minute do you need to take samples every every hour or is it ok if I take it maybe like just once a day and take the max and min and that is good enough right. So, this is one of the most fundamental things that people miss and this is extremely important I cannot emphasize this enough because the quality of your data that goes into any of the upper layers matters because you can take good data and take make garbage out of it, but you cannot take garbage and make you know gold out of it and I often find people get obsessed about frameworks you know should I use Hadoop should I use Spark they do not look at what is the nature of data and that is like a real huge fallacy is kind of unfortunate in our industry right. So, this is so in this talk I am going to kind of counter that mentality it is also possible to pre aggregate stuff right or you can sample stuff, but when you sample you have to be sure that the sample is representative of the population. So, population is all the possible values of the outcomes and the sample is a small representation of that for example right if you do an exit poll you could typically you know ask people whom they voted for the on the internet is it a good sample probably not because it represents a section of people who have access to the internet which means that they are a certain economic strata right and maybe all of them voted for maybe the BJP and all the poor people actually who are larger in number voted for the congress. So, in that case you will have issues with getting we getting insights out of the data because your data sampling is itself also other thing to kind of watch out for especially which has a huge impact in the analysis analysis stages look at the metadata that you can attach right for a tweet has a huge amount of metadata for example it has a location it has if you have replied to somebody it also has probably things about like what what client are using from that also actually look at what what you can gather around a single data point that metadata is really useful there some examples are sensor reading itemized store purchase data and impression data right. So, these are some data formats I will quickly go through them and tell you why choosing a good data format is kind of important right. If you look at csv which is like the bare minimum that you have or tap separated which which which is which actually emphasizes a row based row based approach and it is very hard to kind of represent network data for example in a csv right the other is the next slightly more evolved format is json right. So, json has has a notion of hierarchy you can embed elements inside it and that is slightly better the third format is thrift thrift has rich data types and it does not care about the language or anything because it serializes to this and it has different it has different language adapters where you can read from that data which serializes to this and finally, rc file I think most of you might not be aware of it, but especially those who are looking at Hadoop and especially for analytics should look at rc file. So, what rc file does is your huge amount of data which has tons of columns and tons of rows tons of dimensions and measures it actually splits the data in a columnar fashion and that is especially important for analytics and I will dig a little deeper into that in this into this file format or this sort of thinking a little bit later in the talk, but I put this here so that you can Google it and you know probably look at these alternatives because you should not only look at csv. Then we will look at data collection and transport data collection and transport you can do some aggregation at the source or send every data point you can also do some pre aggregation locally or store forward. There are two different methodologies that people typically use either one is that you either push the data or you accumulate it and then you know there will be some kind of network API or some kind of callback that will keep on you know pulling this data. So, both of them have their pros and cons in a push scenario so typically it can be synchronous, but it though it is not necessary pull is better because it is easier to scale. For example, if I need to sample data right if I use a push methodology I can and my server which is collecting it is down I can do some updates, but if I if I have some kind of local aggregation and I use the pull methodology if I put the server down for maintenance and I pulled the data from it then that can work really better. So, push versus pull methodology is something fundamental that you need to think when you are collecting data. There are also factors in choice of underlying transport protocol there are some of them I think you are already aware of so I won't go into TCP and UDP TCP is connection oriented and reliable UDP is connection less and hence unreliable HTTP API is most aware of, but something interesting that most people don't know about is MQTT. MQTT is a transport protocol which is useful for sensor data and resource consent environment. I have seen also people successfully use this as a pub sub mechanism for collecting data from mobile. So, it is not just when I say resource constraint it need not be just sensor networks and you know stuff that is out in the desert for example, it can actually be mobiles or equipment that is moving around for example, GPS or if you are maybe correcting some of telemetry data that is what it was built for that you can use MQTT. So, there are alternatives to TCP and UDP which are the most popular ones and MQTT is one of them. Just going back there are a couple of factors in the choice of software here. There are some data where you might be ok if you miss a measurement. Maybe for example, you are taking meteorological data and you are taking a sample of the temperature reading every 5 minutes, but what you actually need is maybe for the R you know how the temperature varies on R. So, maybe if you miss a 1 5 minute thing then it is fine. So, that so the reliability constraint is actually it is actually relaxed there. The delivery policy is also something that needs to be thought of. You can do it you can have at most once right you deliver it at most once or you can have at least once or exactly once and that has huge implications on your design. This is something that lot of people do not think through when they build it and we will go into an example where this matters. And final one is like durability can the message be stored? Can it is it fault tolerant? Can I recover from failures if a node goes down? So, how many people use Kafka here? And how many people use Rabbit MQ? Fairly split. I hope you have made the choices thinking carefully because I am going to tell you where each of them excel. So, it helps you to use both and I will tell you why we use both. Typically you do not want software proliferation you want from operational standpoint you want to use one software and use it everywhere. So, because how it behaves under load and for deployment it becomes much easier to maintain one software rather than a bunch of softwares. So, Kafka is very producer centric what it does what that means is I just assumes that the producer will keep on producing and I will keep on queuing it. The consumer can come anytime pull the data and consume it right that is what it kind of assumes and its whole design philosophy is kind of designed around that. If you look at Rabbit MQ, Rabbit MQ says hey I am going to get a lot of data and I am going to be just like the central broker and just keep on routing it all the time right. So, that is where Rabbit MQ is good. So, Kafka is does not have very sophisticated routing capabilities. Whereas, if you look at Rabbit MQ it has fairly complex rules that you can set up to route data around combined data and you know replicate data as well. Kafka is also better for a durable messages what that means is like for example you can send everything from a anything from a CPU reading like how much CPU is getting used to an email in Kafka right and same thing in Rabbit MQ, but what would you use for both right. If you need durability right you cannot lose email maybe you can lose a reading I would I would highly recommend Kafka right because it is designed with disk as a story in mind. Rabbit MQ also does that by the way I know a lot of people will say hey Rabbit MQ also does durability. The fact is it is not the design decisions that they have taken were not designed for reliability for durability. So, it does it and the performance really sucks when you put you know use disk as a backup for that. Also Kafka is better for large messages. So, if you are giving like for example email or if you want to send a bunch of PDFs to people or you know just in this lot of data that is coming through PDF for whatever reasons or anything that has large message size then Kafka really works really well. Kafka is also extremely performant when it comes for large volume of messages. Whereas with Rabbit MQ what we have seen is when you increase the number of messages because it assumes that they are going to be transient it does not work very well for large volume of messages. So, the difference is like if you can do almost 100 plus K messages per second for Kafka, but you probably start hitting the barrier and you know starting getting CPU maxed out for Rabbit MQ for the same workload for something around maybe one-fifth or one-tenth of that. So, that is that is one of the things around Kafka. Okay, one last thing I wanted to say is that we use both for all operational data where we do not mind maybe losing you know some of the readings we use Rabbit MQ for stuff where we are where we actually don't want to lose stuff and we want at least once kind of processing for example email we use Kafka. So, you can see the complete split. So, one is optimized for one thing and other is optimized for a slightly different thing. They both do queuing they both have the same kind of design design goes, but the design decisions and the trade-offs they have taken are quite different. So, the next thing I am going to talk about is data storage. So, let's dive a little bit deep in that. There are different kinds of storage media SSD memory hard disk network and often because of AWS often combinations of these. There are also different storage formats. For example, there is B plus 3 which is the most common one. The most newest one that has that is kind of there in databases is called fractural trees. How many of you use MongoDB? How many of you are facing problems with MongoDB due to transactions? For example, okay. So, if you want to have the best of the world that you want to have transactions and you want to have a document store as well then there is a software called Toku MX that implements fractal trees and that uses MVCC kind of model for storing data. So, actually it can support transactions even with a document store and MongoDB doesn't. So, this is I don't know how many of you can read that. Can you read this at the back or quickly read through them? So, this is the data access latency from all the way from the CPU which has L1 cache to a branch misprediction which takes all these times are by the way in nanoseconds and there is a comma separator at thousands. So, this is something like 150 milliseconds and this is about 3 microseconds. Compressing 1k data, you can see where SSD fits in. 4k random read from SSD versus a network. Now, a lot of you might know this but a lot of people do not think this through when they design systems and they pay very dearly for it. A great example that I have seen is when people use AWS which is quite common with startups or even like enterprises when they don't want to scale out architecture. They use EBS. What is EBS? EBS uses the network and I don't like to say this but there is new kind of snake oil to some extent that Amazon has come up with. They allow you to have drives which are not rotational that are not magnetic but SSDs but the cache is it's over EBS. So, incur the network cost in spite of the fact that you are in SSD. So, you basically take this and add it to this. So, maybe it is better. Yeah, I am not saying it is not it is bad. It is obviously better compared to a magnetic disk but amount of stuff that you are doing over the network just kind of kills the performance. Honestly, there is no point. So, there I would probably recommend if you are using AWS to look at an option and design your architecture around ephemeral disk. There are problems with using the ephemeral disk because if your instance goes down for example, you lose all the data but you can make up for it by having multiple replicas. So, if you design your architecture for performance, use ephemeral disk. Don't use EBS. EBS is very low hanging fruit to use and the reason for that is snapshotting becomes easy. If your instance goes down, you still don't lose the data. So, it is actually very attractive to use but keep in mind the performance characteristics when you do this. The same thing for those of the back who cannot read all of these. It is represented in a visual format. So, every as you go from every partition to partition, there is an increase by 100 by factor of 100. So, you can see that L1 instruction takes about 2 nanoseconds and you can all the way here which you go through 3 orders of 3 multiple orders of a magnitude to a packet from California to Netherlands which is about 150 milliseconds. So, this is something that people don't think about. There was this interesting discussion yesterday at the Aerospike. Can I use the 45 minutes and maybe take the questions later? It is a lot of stuff to cover. You can take the questions maybe offline. I will try to make it. So, let me quickly go through to the next layer but this is something that you have to keep in mind while going through stuff. So, these are the different paradigms when you look at the data processing layer. So, the data processing layer actually is where you take all of the data fill in the missing value, do cross validation, you can combine different data. For example, if an ad server or you have request data, impression data, click data and then finally, you go to the landing page. You can take all of the streams and combine them together and this is where you do it. You take in data, you process it in some format, you aggregate it, you find out, you maybe do some windowing on it and find a median mode. Do any kind of statistical analysis and this is the layer that you kind of do it. There are various ways of doing it. You can use a message passing interface if you have a cluster. You can use micro batches, which is quite becoming popular due to Spark, which I will cover a little later. And you can also use Storm for real time streaming and obviously with Hadoop and even other systems, you can use batch processing. So, similar kind of example to before. Storm versus Spark, we have a, Storm uses a task parallel kind of architecture, where you can spawn different tasks. Whereas in Spark streaming, not Spark, but Spark streaming, it uses a data parallel architecture, where it splits the data across machines and the primitives in each case are slightly different as topology, which is like a hard to explain. Which is like a set of processes that keeps on running and processing data. Spouts is where you keep on ingesting data and bolts is where you actually process the data. So, Storm is extremely good for working on individual items, like you have an item come in and then you see whether it meets a threshold or not. If yes, then move it forward or drop it. Then you can use Storm or even if you want to keep on accumulating, like you want to keep on adding data that keeps on coming in, then maybe Storm is quite good. If you want to use micro batches, which is like you take a list of data and then take a list of data points and then you do some operations on them, then actually Spark is quite good. So, Spark actually shines really, really well in machine learning instances. You can do that stuff with Storm as well, but it is not kind of designed for that architecture. Again, there is a huge difference in the latency. Storm can guarantee subsequent latency at very, very high throughput. Spark, because it is batching stuff, there is a few seconds latency. Also, in terms of fault tolerance, Storm assures that it will get at least used once, whereas, Spark says exactly once it will be taken care of. So, that is like the difference there. I think part of this was covered in the previous talk. So, quickly for those who do not know OLTP versus OLAP, OLTP is typically if you go to a, for example, if you go to a bank branch and you do a transaction, it will be typically going to something like a MySQL database or a Postgres database or a Oracle database, which is like every individual item which has a lot of stuff associated with it. Whereas, if you want to know what is the average, where OLAP science is, where if you want to know what is the average amount of money that a certain depositor has in his bank account, you would not do those transformations or you would not do that query on the OLTP database simply because it is designed for high amount of rights and not for grouping. So, that is where you actually would use something like Infobrite. Actually, at Inmobi, the data warehouse initially was built on Infobrite, which was kind of covered in the talk before grill. So, MySQL is extremely good for example, transactional workload. Infobrite is actually quite good for analytical workload. When you ingest data using CSVs into MySQL, because there is a structure around it, you have to create indexes and whole lot of other kind of overheads, size actually increases typically when you ingest the CSV data. Whereas, Infobrite what it does is, if you look at the data, like let us say we have a database, analytical database of all the cricket matches that have happened. And I want to say find out what is the average of say Sehwag versus Tendulkar versus Ganguly. How many players could be there? May be like 5000, 6000. What happens is, this is all categorical data which does not have high cardinality. So, what Infobrite does is, takes all of this data, it compresses it. It creates a dictionary and then replace it by symbol and every time you can just dereference that and get extremely high compression. Inmobi we got about and I think Amrishwari if she is there, she can back up those numbers. We got about 40x compression. If you take 40 GB of data and which was already aggregated and put it into Infobrite, we got like it only used to compress down to 1 GB. But that also means like loading is slow and CPU intensive. It is also great for machine generated data because most of it is repetitive and in a certain range. For anything that is range bound and for analytical, something like Infobrite is quite good. Alternatives are also like Vertica and I think MonetDB. So, you did not use Infobrite. I have used Infobrite as an example here. The best part is that it is very easy to do sampling and do approximate queries. If I want, as I said, you can use a top-down approach and I want to know whether my distribution or my hypothesis is at least in the right direction. I can just do an approximate query without running the query on the whole data and I can say, if I am going in the right direction, then I can run the whole query and then slice and dice and analyze the data more. That is something that is unique about columnar database. There are also differences in which they are tuned for performance. One of them uses indexes and aggressive caching. MySQL, Postgres both do that. Infobrite actually uses something called a knowledge grid meta-etallia for better performance. Actually, that is a very good paper on the knowledge grid. I think those of you who have access to ACM digital library, it is worth reading about how they actually implement how do they actually implement the knowledge grid. Two minutes, that is it. Finally, there are no SQL data stores. One of them is Edgebase and the other is MongoDB. Some of them are kind of useful. Both of them do not have schema, but Edgebase can be thought of as a slightly different category, which is like a wide-column store. Whereas, document can have embedded stuff embedded and it can have some structure. Edgebase will have to kind of simulate that structure and it is really painful to do that. If you have one key and you know tons of columns that can be associated, especially if there are spars, then Edgebase is actually extremely good for that. But if you have documents in which there is not that much sparseness, then MongoDB actually is good. The other big difference also is that MongoDB does not have triggers. If you have an event or if you have a data point come in which matches a certain criteria, it is very hard to do that, process it in MongoDB, whereas Edgebase has observers where you can actually have a callback whenever an event happens. And also the scalability is good due to HDFS. MongoDB scalability is okay, but as you scale out the performance, what we have seen at Helpship is the performance suffers pretty badly. If you want to dig deep in MongoDB, actually there is a link to a blog at the end. We have talked about all the operational problems that we have faced with MongoDB and we are using it at huge amount of scale. We are doing almost like 50,000 documents per week and 2.7 billion events. Some of them do go into MongoDB, 7 billion events per month. So quickly, so I talked about most of this stuff. The other software that you need probably want to look at, but probably may not know about is something that does complex event processing. If you want to perform operations on a window of data and not individual data points, then something like S-Per which is open source software is quite good. You can actually use also Spark because Spark also has an idea of windows or micro batches. It is kind of similar, but not exactly the same thing. So that is something that you need to look at. I think most of the other stuff has been covered. Neo4j is actually good for like for graph databases. Finally, last couple of slides. The data analysis layer is when you take on all of this data and then you take with third party sources. For example, if you are doing marketing, collecting marketing data, then there is something called as designated market areas in the US where US is divided into DMAs where one city, for example, SF can have a lot of DMAs, but several states, for example, the Midwest can be kind of consolidated into one DMA and that is based on the demographics. It is also based on the income levels and then money spent on each of these DMAs. So that is something or even geopolitical boundaries. All of this data can be overlaid. For example, if you are on IP, you can find out the coordinates or you can find out at least the state in which it lies and then you can do overlay on that. So you can take one data point and then kind of amplify it by merging other kinds of metadata, taking third party data. You can use geo coding. Also incorporate human input, which is quite useful in supervised learning if you want to do that. So I am not going to talk much about this slide as I said before. Just let us move on to the visualization. Why is visualization important? Visualization is important because like I showed you the slide before, maybe you look at those numbers and you do not get a scale of those numbers, but if I show it to you visually, humans are visually, I mean, in the brain, the largest area of all the senses is actually covered by the one that processes the images or the visual input. And visualization can be huge factor in actually making a narrative. And you can see that happening more and more. I think there are some talks around that as well in this conference, where we use the visualization as a narrative to see how maybe a certain industry is doing this. Towards the end, there is a link to a stream graph. This, for example, if you search for the most popular box office receipts of most popular movies, I think for like seven years or eight years, it actually shows you, this is the axis of time and actually shows you how much each movie made. And within one shot, you can find out various information. For example, you can color code the genre. You can, the height can be the amount of revenue it has made. The other axis is time. So, you can find out which movies did well. You can also have, overlay some kind of input on top of it by the distributor of that. So, that is quite useful. Visualization is quite useful for bringing out the different aspects of data and finding out the interrelationships between different data. The other one, which is quite good is Sunburst. If you have a hierarchy, for example, if you want to look at disk space usage and you have a directory structure and want to find out where is there is some pesky movie, which I have never seen, but it is lying in some folder somewhere. But I am not, I am deleting all the small files. How do I find out? So, if somebody shows you a Sunburst, so the innermost ring is the top level components and then you have the hierarchy actually showing up. So, you can actually use that to kind of zero into maybe the files, which you would need to delete to make space on your disk. You can use it for relative populations of administrative blocks, market cap of sectors, which are listed on stock market index. So, in one, in just one visualization, you can actually have a tons of data that is encoded using different layers. There is extremely good library if, for those of you, you have not used or I think the killer application for just like Ruby on Redis for Ruby is a library called ggplot. We use it fairly heavily at Inmobi for finding out outliers in fraud detection. So, you take a list of bunch of data and then you kind of do a scatter plot and you can color each of the dots based on certain metadata and then you can see what bands they like. So, just by kind of running those scripts, you get a good idea of how your distribution looks like. Yeah. Mr. Hegde, if you don't mind my apologies, but we're going to have to cut you off there. Yeah, I'm done. Thank you so much. So, my timing was good. Actually, we have time for one question. We're going to push the time for one question. If you think you have a mind blowing question, your hand went up first. The gentleman please. Okay. So, you can check out the links on to the blog where we're talking about MongoDB, for example, and also some benchmarks that we have run. Follow me on Twitter and or you can send me an email if you want to follow up some questions. Hi Vinayak, I'm Sandeep. Where are you here? Okay. Hi. I wanted to ask you about that MongoDB versus HBase comparison. There you specifically have a point that is HBase has triggers and whereas MongoDB doesn't have do you give me any kind of example where you would like to use triggers like some some real example or you have used something. So trigger could be like if you have some outliers that are coming in right like for example, you are an operational database and maybe the CPU spikes about about say 80% for example, and you want to know that you want to do some alerting based on that or maybe bucket it on based on a certain criteria then triggers are quite useful for something like that. Okay. So it is based on the data events. The other place where I see it use very heavily is also like in Postgres not MongoDB but triggers in general is that it is also used for table based replication. Thank you.