 People registered plus we have a lot of MOB employees who are around you who are interested in the meet-up. So I think you have a crowd of about 300 So the I think all of you know the agenda. I'll just run through it once But before I do that, let me just introduce myself. My name is Vinay. I'm I used to get a data development team and some of the products that people will talk about now were built by my team and and Little bit more about agenda. We are going to have four talks today We are just keeping the initial part introduction because we are already quite late So maybe towards the end we can actually have a stay back and you know talk a little bit about personal experiences So going through the talk the first talk is real-time analytics on it. This is something that we have been doing for about 9 months as of now and we have already solutions which we have deployed So there are two people from MOB open to talk about it. There's Rohan and Pishlet So you can see the presentation first that is up there the second top is on name note have a pretty which is by Suresh universe from Hortonworks as you know Hortonworks was a I would out recently from Yahoo and now they're completely us Supporting the Hadoop platform Then the third top is again from MOB. We have Srikant who is a principal architect at MOB and we have developed So we have at MOB we have been through about five to six iterations over Hadoop implementation that we have done here and This is I think we have grown to such a scale that we now need a complete data management platform and that is when the need for ivory arose and Srikant will talk in detail about that and Finally we have Arun Singh Moorthy who is again Working with Hortonworks and he's going to talk about Hadoop 2.2 and a future of Hadoop. So Arun, I think if A lot of you who are around the Hadoop and the users and Big mailing list you would have seen some of his comments. So I think it's pretty well known in the Hadoop community so over to our first speakers Rohan and Kishne Hey, good evening everyone So we are here to talk about some system we developed on top of HBase Hence the topic of the presentation is real-time analytics on HBase So basically over a span of few years in MOB has grown and so has the volume of events Which flows through a movie system. We grew from one machine to end machines We grew from one data center to multiple data centers But there is one requirement which has remained the same Which is like wanting to have real-time real-time visibility into our system And hence we started to explore various possibilities which are out there So this is this was a basic problem which we had which is like Getting near-time visualization of granular level data This should work at scale And we have multiple producers and multiple streams And we should be able to extend it throughout various streams Which we have since we are across PC. We should we should also be able to manage it so So we started to explore we explore various kind of open source technologies which were out there and Finally after going through a set of iterations What we found we found certain sets certain open source technologies which we thought we could we could leverage and Using that we created this whole pipe which Provides or which answers all these problems which we had so Exploring what we found we use a combination of technologies Which is like we used scribe for log transfer We use edge-based But we used it we're using open time series database This is bytes stumble upon and of course edge-based on top of a do and there after we also use a We used to manage our workflow using OZ So this this is the combination which help us in creating this This whole pipeline and providing our goal real-time view of our systems So how did we do that? So this is the very simple basic architecture, which allowed us and Gave us a free hand allowed us to achieve what we wanted to achieve. So locally we transferred transported all our events using scribe, which is by Facebook within local aggregations wherever possible and then we moved all our aggregated data to a central base so that we have a unified view of our This is what we did and in the central case. We had edge base Through and we varied edge base to give real-time use and the graphs and everything which we wanted to plot using that So this is the central edge base setup, which we have a cluster of machines Open TSD. We also have multiple open TSD instances study and There's a data ingestion happening from this side and there after Using this same interface here also brought in Various events or various aggregated information. This is the basic interface in the central What we have so after doing all this after evaluating such kind of technologies What we manage to achieve? We look at here. We have almost great level granularity and we get Network level this is just a snapshot. This does not represent our real systems Snapshot of what we were able to achieve So minute level, we were able to plot everything whatever was available With a certain latency of course the latency varied from one minute to five minutes But we were able to achieve it and give it so our scale from a million events to a million events We moved on but still we were able to achieve this Listed out So this is this is basically a time So these are the various aggregated methods Different kind of This is a segment which I have so this is a very small snapshot the unit It's a is what it is The real learnings from this product was not only what we achieved The real-time visualization of data But we also helped unlock how we can use HBase to solve business problems in general Like what to do what not to do what type of setup we require What are the best practices that we should do? How to leverage maximum of our HBase So the three main Learnings where how you design schema that fits the business problem How your plans can work efficiently on HBase and Of course, how do you maintain HBase over your production system? so As as you all know we used a open DST for abstracting the schema and the client Inevitably being used in HBase so Open DST schema is a real Geminits own because it really tells how you should design a schema when you are Poting a business solution over HBase So they are broad level of things that you should take care like HBase on its own is very efficient in doing Rain scans it's very efficient when you are doing a bulk put It lacks behind to a certain extent when you are doing either random reads or random writes over it Okay, so And there are other points as well for example if you are keys and the value that you are putting various different sizes Then yes, of course your space will be sub-optimally used So what what the schema in open DST does is it tries to maximize? Maximize the compression maximize the Encoding that you do when you're putting a data so that so that there is minimum amount of repetition of your data based lesser amount of space and You also try to map the very variable size Workset into a fix side fix side So If you see I think it is visible to all of us This is the generic command that you actually Do to put into open DST you put a matrix and then you mention various tags that you want to associate or Index your record with Okay, now these are really useful because when you're querying you can Segregate your data and you can tell okay I want data only for certain tags and I want to aggregate over So that is a general API that you use now When you are when you are looking into the Open DST does it it does it in two phases. It actually builds two tables Well, this is one table and this is another table. So this is an encoding table where they encode all your tags on your matrix into Into an effort of driving fix side or minimum size This table actually does two side mapping one mapping is from your encoded to your verbose and The second mapping is the opposite that is from your verbose key to your encoded now actually when they are storing the Record into space other table. They actually only store the encoded UIDs They don't store the verbose and hence what they do is they're able to Not only Save on wasted space because these encodings are smaller, but also repetitions are avoided. So you really don't waste End number of bytes in storing a key that is very Very frequent. You actually store it using a smaller encoding now. This is a UID table that they create and The next table as it creates The actual fact table where they store all the records So the fact table generally contains the encoded row. The encoded row is in fact connecting all the UID from the previous row and Store one row per hour So for all the combination of your tags, they'll create one row and For each second in that hour, they'll construct as many qualifiers So if you know a bit about HBase, what it does is it's a sparse sorted map multiple levels of lookup where each row key maps to a qualifier and What happens is they carry multiple qualifiers within a particular Volume family. They can be any qualifier. You really do not creation time of HBase. You simply declare your volume family So qualifier is simply what you generate at runtime. So The schema that they have Optimizes on the number of rows that you create. They minimize basically and they leverage the concept of qualifiers So real learnings from here are meant or a real HBase schema should really be optimizing on the reputation and best use case because it's it's important and The number of rows that you create and the number of qualifiers to create should also be well defined Like you should really should not put a million qualifier in the same column family The qualifiers are not sorted and put into HBase. So there is real effort in in doing a scan for the individual qualifiers Okay So for each second there will be one cell stored in HBase each of the rows So the next really Important thing is the client design and what are the strengths of HBase to leverage it really? So open TSP what it does is the real meat of the thing besides the schema is The HBase async client HBase in so they have written a new client to interact with HBase and what they do is Taken the concept from Python Twisted library, which is nothing but for an abstraction where you can hide away the synchronous nature of your underlying API and yet expose a asynchronous It won't do not lock on the synchronous calls that are being made to HBase And hence what you can do is you can multiplex your Request and hence your client responses to come down Okay, so the HBase async client is what they have written it's based on again that Python twisted concept and the client really interact with entity So what you do is you start up entity server you what you do is you You register all the all the callbacks for different commands that Open TSP support and for each of these callbacks you again hook up into HBase async and then interact with HBase So the real ending where that there are multiple ways you can interact with HBase They are really bad ways to do it that are very having very less hurdle to cross So a lot of native users try to do that but to really get power from the platform you will have to Do a bit more of experimentation and figure out what all different clients are there And as you grow into that stack you really start to get So the last last thing and not the least thing that we got from the product was Learning of how to maintain a very stable HBase on your production so if you if you look into the mailing list of HBase user what you will figure out is a lot of people have difficulty in crossing the first hurdle where they get the HBase stable and up and running in their production without Significant down times on aspects like this. So what we figured with Yes, there are few things that you will have to know up front. It's not a black box But those those things are generic enough so that you can use it across your different products Okay, so the first thing that you do is HBase is memory-hungry memory intensive so it caches a lot of information uses memory extensively and It is a JVM based process. So the problems of The huge heat and then corresponding GC collections Really really start nagging after you have run it for significant amount of time so Memory overheads and controlling considering them is a very important task that you should do You should figure out what you should tuning you want to take. It's very trivial once you get it it is it's up and running without any problem and the other problem is with enabling Memslab so What happens is I'll give you a very brief account of like why Memslab is important what it is so Memslab is simply an effort to Distribute your heap into fixed size chunks So that you claim your heap very efficiently you don't create Variable size holes into your heap which will result into your GC portal So Memslab is simply that it's nothing more than that So just just play it simple and chunky fine your Deep into fixed size chunks and then writing variable size record into those fixed size chunks and then reclaiming them as So these two are very important then so what happens is when you update your records or you You delete your records in HBase HBase leaves holes. It is not an active cleanup So you can enable auto cleaners like you can Enable major compactions now major compactions are simply rewriting your store files the files When you wish you are storing your data over Hadoop now they will be having holes So what it will do is it will rewrite them compact them merge them together and prepare a very optimal store, but it is having a lot of overheads, so if your cluster is already bogged down by application load Major compaction the wrong time can have the wrong results on the cluster and then Is the online mergers so online mergers is simply where you have empty regions and you want to remove them So that those empty regions are cleaned up. So what we learned really is we can disable major compactions Automatic major compaction and we run them manually on our own and Online mergers is really a Ruby Need to run Again, there is a bit of learning curve there, but once you figure out how to run online mergers It's really a piece of cake there and then There are other other problems with the number of connections that you have number of clans Now they can be too many connections for too many clans connected to your space You can have significant overheads on number of threads means spawns of other so things like XC with Configuration in data node where you limit really how many thread pool you are maintaining for serving a request It's also very important to to ensure that Your cluster itself is not being walked down by your overhead of Serving the request. It's only Really busy in serving them rather than