 Yeah, so we'll be talking about how to build a stack for big data system, a generic stack. Obviously the actual, am I audible? Yeah, obviously the actual compute and logic will depend upon your application. But I'll try to pull up things which are common to most of many of these systems. This is a bit about a company, I have running stuff and semi-true about me. So we come to a crawl and extract large scale data and we convert it into structured format. So what we do is we can give you data feed for all the mobile reviews on the web or all the travel data review in real time. And we have appropriate feed for that and clients can connect to our API and they can get the data. This is the sample output, it doesn't really matter here. So this is the agenda of the talk. We'll set up the context that what we're going to cover and how deep. Obviously each of these topics will take more than half an hour and all taken together. We can't cover much. I'll just give a very high level overview that what systems should be there and what are the choices you have to make and what are the key things to be taken care of. Then there are things which should have been covered in a longer lecture for a couple of hours but we can't take that wherever it's mentioned here. So if you build a system for big data, you will eventually end up with all this. Multiple nodes, incoherent set of coherent nodes. What it means is you will have 10 servers which deal with DB related stuff, 14 servers which deals with jobs scheduling or one kind of job and another 13 which deals with something else. And then completely where the actual processing logic lies, we'll have multiple processes quite often interdependent upon each other. What it means is so probably output of one is eaten or processed by the some other process. So they have to communicate some more and more often that they'll be on different servers or different nodes. We'll have data storage layer in multiple middleware. Data storage is a critical component traditionally people had been using SQL based storage layer but with advent of big data it's not very easy to scale in that layer. Kind of requirements, something like join and all can't be easily made. It can be made in other systems also but you make multiple trade-offs. So it's consistency important. So how important is high availability. So those things come in storage layer and multiple middleware over there we'll talk only about queues. So it's kind of almost always present in these systems that multiple processes across the nodes they talk via some messaging layer. Tools for installation monitoring and scheduling, these are pretty important actually. Once you operate on hundreds of servers or thousands of servers people monitoring and kind of managing them becomes a real thing. So unless you really think well about these things systems kind of become pretty painful pretty soon. There are other things which should have been covered like source control, version control, port review, continuous integration but they're not closely related. We'll leave them. We'll start with how to install or what are the things to take care of when you kind of try to build or install such systems. Then one of the examples from each layer, one messaging layer, one from storage layer and then our sketch of a sample demo. We'll start with installation. So simplest thing is you keep some image that this is our DB system image or this is QN system image and whenever you have to install one you kind of bring it up or if it's part of some cluster or cluster then you add it by appropriate mechanism. Individual, this system is pretty good. There's no maintenance cost and it scales also but the main problem with that is how do you save back the changes. So let's say there's a system with some 10 custom software and you want to modify it too. So you have to make a new image and save it back. So tomorrow you may have to make small modification to something else. You again have to make a modification and save it back. So that doesn't scale. So tomorrow you'll have same server version 3 and same server version 3.2.2. So and you'll be confused what has to be used where. At the individual level we have pretty robust and well tested software like Aptnium which work and they work well for the individual system installation. So individual package installations. I don't know how many of you use ETC Keeper but it kind of saves your config data in Git or some other version control system. So these things work at individual machine level or individual package level. This works for one purpose but if the purpose changes, this doesn't work. There's no way or there's no apparent or easy way to scale this to multiple servers. What people do is they kind of write scripts to manually install Apache on all 10 and if you have to modify on 15 then it becomes slightly painful. There's no silver bullet. This if you really want to do it well it's a difficult problem and there are solutions for it which have been recently attempted. And Chef and Puppet they are kind of popular choices but how many of you have used one or both of these? You will agree that it has pretty steep learning curve and it has its own issues but point is once you set it up and the kind of things it promises, these tools do them well. You could have put there but I will just pick two and I will talk only one. So before going further I wanted to introduce something else. These things are difficult to get started and pretty often you will make mistakes. So rather than making mistakes on a real system we kind of have something virtual try out everything and if it works well we replicate it on the servers. Many of you would have used virtual box on your laptops also. So it's pretty easy to install and download, download and install and inside Davion you can run Ubuntu or inside Ubuntu you can run something else. For all practical purposes it gives you a vision of a real system something like what EC2 does. So you can do it on your desktop also. Similar solutions exist for EC2 firms or cloud also or server solider. For example whole of EC2, there is a mistake in EC2 is built using Gen and then there is Wegrant which is more of a wrapper around virtual box and it allows you to configure virtual box in a headless way. You can do in it up in SSH. So these are three set of commands. So it brings up the instance and you can SSH inside it. For initial setting up purpose these are two pretty useful things. So you can set a port forwarding. Let's say Apache runs on the port 80 of internal machine you want to access it from external. So then easy way to port forward from internal to external. And also there is a provision for share directory. So you can mark some directory shared and whatever you write in that directory goes inside the internal machine also. So you don't have to copy data between your machine and the virtual world. So out of those two, Chef and Puppet, I am giving example of Chef. What it does is a basic idea behind this is they want to code the installation. So whatever you do in process of installing a new server they want you to put that in port and goes in virtual control. So individual installation file will look pretty simple. So for example to install package flim you just say action equal to install and it goes in some file. So this is the lowest part. And then there are things like which machine has which roles. So web server machine will have three softwares and DB machine will have four softwares. We just have to name them that this machine has these roles and when you run the Chef client it installs those packages. So there is a simplest form you can have Chef on your local machine also and that's called Chef Solo. And if you want to have centralized thing which actually is needed for a large installation then this tool can talk to the server and it gets all the data and metadata from there. And individual installation thinks it's called cookbook and that can be stored either locally or on the server. So I think in the first part we try to cover how to install new softwares or what are the tools of choice to install new software which softwares to install it depends upon your application layer or logic. So next I'll take few things from other layers like storage and messaging layer and try to note down some points regarding those what are things to take care of. So first thing is when you run multiple processes there should be somebody to take care of those. You should know what's happening, how much memory it's taking you want to kill it, you want to restart it you want to kill it every three hours or you want to restart it whenever memory which is close beyond a point. So again I picked two tools, bot and money there could be others also and we'll deal in some details about that. So this is a simple old snippet I'll kind of skip over but the thing is you can see which you can specify what command to use for start what command to use for stop and it's not here. So yeah you could have written things like if memory which is close beyond 300 MB restarted so those things you can do and once this guard runs in a machine you can remotely kill or restart the processes. And there is once you have multiple processes running it's no more feasible to manually schedule them or hard code their scheduling because machine usage varies and if you have multiple servers more often that not will be a surprise if you kind of graph plot their usage or see the average usage it's hardly more than 50%. So that's where something like job scheduler comes in picture and it allows you to utilize them in a better way. So again there are many choices rescue, win-stop, silvery or in the simplest form people use cron or some queues. Things to take care here is once you make choices or try to pick a new system for job scheduler you could see if it has consideration for priorities if it has consideration for tags priorities some jobs you want to run give more priority than others it may be important others can wait. Taxes this is something I really like so process can have multiple tags so it's some host specific or it's some cache specific and only some workers will run on them. Option for retrain ability to inspect the queues so for each of these give two or three of these they may not accept all the criteria data storage layer so a lot of buzz word or a lot of articles you'll see coming in this layer because many things have been done and the similar paper by Amazon I think it's called Dynamo or something that really lays down the rules and many of these systems kind of use that so again thing is what's the thing you want to give priority to so you want to give priority to consistency that if a data has been written it should be written in all the replicated clusters or you want to give priority to availability so that if data has to be on four machines even if it's written on two and two copy later it's okay or what the use cases are like is it a graph based system or is it a document based system do you want to handle your data at the level of documents or do you want to handle a key value or key value level key value I guess most of us kind of misuse SQLite or my SQL to use key value there are other systems which kind of do it at a distributed level there are software for full text search I guess Lucine and Scholar are the most popular ones and briefly mention one more which is Scholar the popular key value source and this is the thing I wrote first so in the long run what happens in these software that maintenance cost is more than it kind of outperforms anything else so kind of amount of time you have to give to let's just replicate the data what happens if a node dies which happens pretty often so all those things should be considered and with that respect I really like old amount so it's a simple key value store it does only one thing and does it nicely it's pretty simple to add a node or remove a node you can specify if you want to write some data how many nodes it should go to and when you want to read how many nodes it should be read from to be considered a valid data so in the first part we wrote that we saw that if there are multiple processes they'll talk to each other they'll be interdependent so most common way to model this interdependency is to talk via some other layer but often that happens to be the messaging layer and if you see compared to other things I haven't written any choices here so many people don't use RabbitMQ I haven't understood yet where they don't use except for that you have people in the team who is pretty familiar or is God of Sentience where you should just use this it implements ANQP I guess it implements some other protocols also it's actually pretty robust so you can have models like direct subscription topic or cannot direct is simple I send message you receive it nobody else knows fan outage it's kind of being broadcast to all the workers the topic could be that a set of workers are interested in in one thing for example if you try to model some BSE or NSE you'll have pretty often topic server will be used because there are many people who will be interested in stock prices of a company it has good options for high availability you can use DRBD and it has been well tested many people have used it it hardly ever fails there is one catch though you should not try to use DRBD DRBD is disk level replication across the machines you should not try to use it on Amazon or EC2 because that itself is a virtual layer and you don't want to do or disk replication on a virtual layer so I guess these were the components I want to talk about I can't sketch for a demo which I can't do so simple demo we could have done is individual workers could have generated mark of sentences randomly and we could have stored them in in a world mode store which would be globally available and other set of workers which would be kind of waiting for the job after storing this we could have enqueued in RabbitMQ or some other job scheduler and other set of workers could have worked there but this is a kind of cheating so basic cheating comes here there is no interdependence in the processing of those records they are individually processed that makes the thing simpler if there is interconnection between those records modeling such system becomes difficult but this demo I can't show you because my laptop doesn't work there are other things which could have been part of this talk but I am almost done but I didn't add in detail because because of lack of time so thing is on this level you need a monitoring system for alert and dashboard otherwise again you will have no idea so in monitoring system I will just talk about sensor and graphite so sensor you can think of it as a routing layer so each of your individual processes they send some data, they send some report they send some log to sensor and it decides what to do with it so it can send that to twitter it can send that to your iAM message it can send that to page duty or if it's something like time series data you can send that to some graphic engine like a graphite so that will make a graph and you can see something blue or something is being utilized up low level there was one more thing I think you should kind of try to build or if you desire such a system you should take care of its distributed log collection again I think one of the speakers today in this series talked about Splunk in there so that I will leave, other is promo file script and these things can be used for multiple things their common requirement or common goal of these tools is to aggregate data from multiple processes so it can be used to route logs or aggregate logs it can be used simply to move data from one system to another so some points what flow so there's a source of data which happens to be multiple nodes or multiple processes and there's a sink you may want to store this in sdfs or somewhere else and how the data goes from source to sink is displayed by double HD so that we have to design that which servers are loaded which you want to put randomly for log data I guess that's all it was pretty fast and I didn't go in deep of anything because we could not have done so each of these will take more than half an hour I'll be available here today as well as tomorrow we can talk about this in the next video thank you for watching see you