 Good morning everyone. I'm Varun and from Minjar, it's basically a cloud based realm of these technologies. Actually this might be a little out of place because I'm very not going to talk much about maps to be truthful. This is more about processing the data and getting meaningful information out of it and the challenges we faced doing that. So I hope you don't, it's not a presentation so I hope you don't get bored about it. This is just one slide of what we do at Minjar. Basically what we do is that we have a good experience in design development and management of a lot of location-based data. We basically manage the group, cluster a lot of the organizations and kind of help them in moving to Hadoop and doing all the punching number, punching location-based data on Hadoop and get meaningful data out of it. We have a private cloud, we have a hyper-scale backup solution, we have like on the cloud basically we have already got optimized stats where you can come and deploy the applications. What I've associated more with is the big data team which is looking at location-based data and punching that data and getting involved. Everybody knows what hyper-local is and so to start with so basically what we are doing with location-based data is that we get this a lot of data from other sources and we actually what we need to do is to extract the meaningful information from this. What we understand with meaningful information is that the data kind of we get is very raw and just say that at this particular latitude and longitude for this time something some event happened. Now if you look at all those points the latitude and longitude are at an accuracy of like six or seven digits. Now if you start to look at all this data and try to get something out of it it becomes too messy so it's not too good for everybody to have this kind of data. So what we do is is to make sense of this data in a little different way. So I think all of you would know what a title is so when we get this data we actually break it up we kind of get the hyper-local information which is not just where that event happened but also in what time period this event happened. That is what makes it a hyper-local event and so basically when we get this data we calculate all these scores of the data based on that and that is where we are. So how we are going to prevent this data? This data is basically when we get it is basically latitude longitude and just a pen time and basically what happened in that point what event happened and so on. So times and each time is a rectangular piece with a resolution and this has I mean a set of longitude and longitude where we can map that event. So this data can be events, photographs, location, I mean, digital events, videos and so on and all this data needs to be mapped to a particular time at that particular location. So when we serve that data back or when we give that data back we can actually say that this is event which happened at this particular location for this time period and basically the number of events coming and we keep scoring using various algorithms and this determines how good that data is. So the way the processing happens is that we get raw scores from various organizations, there are a lot of organizations everywhere. We get this raw score like photos from Flickr which would have tags and the keywords and so on. We extract the location and the context, the keywords, whatever keywords are there, Bangalore and so on and Koer Mangala, I mean there's a conference going on, that's an event which is happening, we get all this information, we adjust the scoring for this location for this keyword and we get the time information, we get the time period at which time period it happened which is like based up like morning hours or like weekday nights. So someone like a beer club would come in a weekday night event. So if something happened at beer club it would come in a weekday night and then we use this data for the prediction by doing all the scoring algorithms and so on. So getting into this, how we would do this for Bangalore, as an example let's consider an area like Koer Mangala. Now predicting Bangalore what is happening there, what we would do is break it up in a set of resolutions. So if you want to do it in a 10 square meter resolution or even smaller, you can break up those set of latitudes and longitudes into the set of types. Those set of types, so when we get an event, say a beer club, it will spoil in a particular type. Now if you look at Church Street, mostly events happening around Church Street would basically revolve around pubs and so on. So that is what all this information, if you say an exact latitude and longitude to a 6th degree of event, to a 6th accuracy of event. I mean somewhere in Church Street it wouldn't make too much of a difference if we have an event happening at say 0.001 and 0.002. So we don't actually do that. What we did was we took all this data and get a tie out of it and the way we serve the data is for that particular type. You can actually change the data type. So maybe Korn-Munla can be one type, Church Street can be another type and Shantinagar and so on can be just another type. Now the problem with a tie is that if you start making, I mean what we did was we made rectangular ties. Why we did that? Because if you start making a latitude dominant in a multi-following shape, it makes it very difficult for us to render that information. It's very easy to, you can use Google Maps, OpenStreetMap, whatever maps and you can put that information on top of that. You can layer that. If it's a tie, it becomes very easy and becomes very fast. If you start making multi-polygons of everything, it becomes too slow, which is why we broke it up in this form of text. Okay. More of this, my presentation revolves around processing this amount of data. We actually never set up a Hadoop cluster locally. I mean locally honored local system we did, but I mean as such most of the processing of all the data that we do has been EMR. Why we chose Amazon EMR? It has, I mean, especially to start with, the biggest advantage of EMR was that we didn't have to do any hard work of setting it up, procuring machines. I mean initially we didn't even know how we wanted to do it, whether it will actually be successful if it actually work or not. So what we did was we started running on EMR and it gave us several advantages. One was we had no headache at all of setting up clusters and so on. Just go to EMR, launch a number of machines you want, number of clusters you want and then just focus on the logic. That is something an advantage that EMR gives you. All the storage that we have is currently on S3 or Glacier, which is another advantage because we don't have to download much data. We just look at it from there. The challenges mostly that we faced, this was basically long running clusters which were becoming too expensive. We had clusters running for over a week on M2.4 extra machines which were just way too expensive, consuming too much of time and money. It's actually a pain to manage all the workflows. So initially we had just two or three jobs running and then when they started becoming workflows depending on each other, that was the other problem we faced. I mean of course if you're ready from S3 to this, it actually takes a lot of time to read it. And biggest problem was that failure of one job would cause the rest of all the jobs in that cycle to pay. I mean how we solved all these problems. So the biggest best practices of performance that we had, we actually, if you look at it, every line of code, when you solve a problem of that magnitude with that much of data, every line of code actually comes. So I mean we actually optimized to that level but instead of local variable, if you make that variable outside the class, it would give us a huge performance difference. I mean saving almost a day of processing for us. That was the level of performance optimization that we did. One of the biggest things that we need to do with Hadoop is the configurations. So the map reduce framework, the way that it works is you can configure almost everything in the number of map hours, the number of reducers, the number of, where you want to output that data, what is the sort buffer, what is the amount of HDFS that you want to use. So that is all, I mean all of it that we optimized. And we have different configurations for different kind of jobs, if it's a memory intensive job, we give more memory to the map or to the reducer. We actually, I mean a lot of people when we work on Hadoop, you start writing in Pegor, Hyde or Ruby Python, which is more of a streaming job. We actually started using Java to write all our code. We didn't use any of Pegor, Hyde or the high level languages. Java was also high level though. But I mean we started using Java and it actually got the cost down a lot. It was another day that we cut off deep usage and the DC characteristics. And we realized that, I mean the map was an reducer, you can't have a common configuration across. So that is something to change. This maximized the IO bandwidth. So basically that what it's basically like, where do you want to output that data and onto HDFS, when you want to read that data, how much of it you want to read in what patches. I've been talking about reading the map of the reducers and the various information of them. And we actually experimented a lot with JVM reuse. So sorry. I mean kind of start using the JVM and process that. It actually helps you a lot. This is a test that we did when we run through Java on Hadoop. So if you look at the data, actually you should have the boundaries where that's okay. So if you look at the times which you had for Peg, it's around, I mean these are the times that we had. This was for the 11 machine cluster running around two maps and readings on each of them. And if you look at it, these are the times that we got from, I mean just the load, filtering, 10% and then reduce and so on. So it would take around 1072. After all the optimization that we did, that I've talked about in the map of the reducers and so on, this went down by almost 40%. So we cut down our time to almost half of what it used to take earlier. For data storage, I think that is really important was tuning the iosort.me when the data gets output onto the disk, controlling the number of parallel mappers. That is something which is very important. Basically, if you have a lot of input like, for example, if you have like 1000 files and you want to do them parallelly, it will happen much faster. But it will have a negative impact also because of all this processing happening. The memory, etc. consumption will change. So we kind of, for each kind of job, we optimize the number of mappers that we should run. And we started using a custom input format which is a combined input format. So if you have 10,000 different files, instead of sending it all to different mappers, we kind of combined these files, made them into 1000 files and gave them to common mappers. That helped us to reduce a lot of data transfer that was happening. We also experimented with different codec choices. We used LZO, we tried VZEP and so on and we optimized that as well. And I mean, after a certain period of time, your process data actually becomes a big data problem itself. We started using Glacier to store that and that's another problem that we are facing right now. So the amount of savings that we achieved through this was almost 100x performance, increased that we had by all these weeks. The size of machines that we used to use earlier was a M2.4 Xlarge on Amazon EMR which has now reduced to an M1.Xlarge which uses almost one fourth of memory and half of the processing part which is a big savings in terms of the amount of time. We do a minimal amount of transfer of data outside of S3. So the only data transfer that happens is inside Amazon EMR and S3 which saves a lot of cost. Now that the work which was talking about which is to take around two weeks to be completed, it now takes just a few days to be done around two or three days which is another huge amount of savings to the ways we talked about. There are still quite a few bottlenecks that we need to solve. One is the dependent work close. For EMR we actually do not have too many, I mean ASCAPONE and so on don't really work well. So we actually experimenting with that and coming up with something of our own. As I said, the amount of data that we process and store in S3 that is actually itself become a big data problem. So that is it. Short presentation I actually didn't go much into all the EMR stuff and the weeks that we did because I realized it's more of a map confidence. Okay, questions? So some of your, how could the data products in the picture data that you show it? It's there on the machine but I don't think it will work here because it's ID. And these are only few IDs where it works. So I'm sorry. I have a lot of data that I can show but that I actually want to make. You can show some dollar figures whether they're selling. Okay, so dollar figures I actually talked through this slide. So if you look at it around, if you look at, you're talking about M2.4 extra machines which would turn to M1.large machines, M1.exlarge. So M2.4 extra machine costs around say 0.25 or something dollars per hour instances and then M1.exlarge costs around $0.1 per M1.exlarge. So that is around the savings of around 20, I mean I would say about $2.15 per hour and the job time, it used to be almost two weeks has gone down to like three days or four days now. Plus the amount of data that we transferred has gone down to say around 20% of what it used to be. So the amount of I would say is around 80 to 90% cost has been reduced from earlier. So just to talk about the amount I give you an estimate of what I know is earlier probably he was spending around $200,000 on EMR for running all the jobs. Now it's come down to around $25,000 or $30,000. So that has been a huge savings for the lead. All of this is per week that I'm talking about. So that is what it's come down to. Right. So it's how all processed data is being processed. What is the local environment? So local environment is a, I mean a small loop trust on a local machine, the local running process that, but when you want to actually do that, the advantage of EMR is that you don't actually have to launch a cluster for an 11 machine cluster for running for one week. What we do is we just started to machine cluster just for like sit two hours, do that testing and shut it down, but it actually costs you just $0.2 and nothing much. So that's the advantage I was talking about using EMR. You actually don't have to, I mean think of all the setup and where you do the running, where you do the testing. You have that EMR you just go ahead to create your job, upload it, just run your cluster, two machine, ten machine, a hundred machine cluster, run it on EMR, do the testing, get the output, just look at it up. So that is something that's a huge thing. You don't have to actually have an infrastructure. We don't have to think about if there's a machine available, where do I run this now? And those kind of things. So all those things that it's on demand, you can run as many as you want. We are using Java because earlier we started using Ruby. Let me use Ruby and Python. We actually use this thing also. This is what I'm talking about. When we started using Java, we got huge amounts of savings on time. Around 50% of time was saved on processing. The advantage of using Java is, what you're saying is higher the language to store it, I mean the more time it takes. The advantage of Java is that since Hadoop actually converts all your jobs to Java eventually. So if you're writing it big, what Java will do is it will create a jar out of it and then run it inside jar. I mean because it's written in Java itself. So it expects a Java input. That's why we make a custom input format. We make a Java, make a jar and then that's why Java helps us save some time. That's the reason. I mean I'm not talking about that big is a bad language or something is a bad language. But the thing is that since Hadoop is written in Java, it's a non-tributed. And it wants the input to be in Java. So do you use some, like, Mahal is looking for the difference of using in-house or in-house? We're using in-house most of the things. I mean we actually looked at, it is for engines and so on, but I mean there's not that much which is actually helping us out. I mean, basically RA is to save as much as time and as much as processing or memory as possible. So we prefer to develop new solutions. But we, whatever it is. Yeah, earlier you said you were running a private cloud. Are you still using Amazon's cloud services? We're using Amazon cloud services right now. Earlier we did set up a small system to start with, but when we actually moved to EMR and, I mean we realized that there's much less headache and we can focus on development. That is a heaps of quantity. That's good. That's good. That's a technology we're using, but in one of your slides you said that you were making a prediction at the end. It's something that you can make predict. I was wondering what kind of predictions you can make. So I mean, from all this data after we process this, the predictions that we make, I mean, it's basically for hybridizing. So the predictions that we make is what kind of paths to serve there. For example, if, I mean, if you look at the past, it's simple machine learning, if you look at the past data that this is something that has been served here, people are interested in this, or these are the kind of things that happen here. These kind of events happen here, I mean, the pub events or conferences. Then you can actually predict, saying that over the next period of month or maybe on a Sunday, these are the kind of things that will happen. So for example, on a Monday morning, if you, so monogamy what happens is that most people would frequent the pub on Saturday night. So it doesn't make sense to show someone an ad on Wednesday afternoon for a pub. So the prediction would be that on Sunday evening that this people would probably, or maybe just saying that, even to the extent that Saturday night pub, people going, living in Shantinagar, go to a pub in, say, Church Street, or people living in Varni Gattaro would go to a pub in, say, Kormangam. So those kind of predictions are, that is what we mean. These are the kind of ads that we show. I see. So your clients and both individuals as well as taking in small businesses. Right, please. Okay. Small presentation. It's awesome. Thank you. Thank you.