 Hi everybody. Yeah, I'm Jabir here. I'm from inMovie. I work for inMovie, the big data team and part of the operations team. I have been with inMovie for three years now and basically, I will be talking about some techniques that I have found is a must have in a production environment and how I intend to fix it. The product that they were talking about that I'm going to open source it here is not yet in production. It is in a very beta phase that we're just testing it on stand box and we're still in the coding phase of it. So it is not ready, but I can walk you through of what we are looking at doing going forward. Yeah, so I have generally been working on Perl initially and now I'm just trying to move to Python. So I've already given you, I've been with the Yahoo infrastructure team also the corporate infrastructure team. Why do we need to self feel, why do things have to auto, why do we have to automate it in production is pretty critical question. Initially, when we started, we are like 10, 15 servers and now we've grown. It's very difficult to do things manually. It's kind of very, it is impossible to do manually. I'll possibly talk about the scale that we're doing. But generally, what would be the solution people would take? It would be like hire more people or the other option is just use technology to fix the problem. And automation is important because we don't want our engineers to do the same thing in and out. Then it becomes very monotonous. People start leaving the organization. They start complaining that they need to do better work. They don't get time to focus on things that they want to do. It could be not Nestle work later. It could be, it is not good, right? So in case a production engineer is awake 12 at the night fixing a non-call issue, he's going to give up at some time. And automation will also help in improving the time to detect and time to resolve. So time to detect is very critical because you don't want to detect issues in production after two hours, especially when you're going to have something very critical where you're going to have reports going out every hour. You don't want to downtime your system for two hours, last two hours. So this is the general scale that I'm talking about that we've grown in the last three years. So I generally work on the Hadoop platform. So generally, if you see you have the HGFS MapReduce, I will mostly relate things towards Hadoop MapReduce. I would not be focusing much on the web serving front. Initially in 2010, that was just when I joined. We were at a scale of 15 servers in one co-location. It was pretty manageable. Two people team, we could possibly stay up a little longer in the night and look for issues or have the operation center kind of ping us when there's an issue or give us a call. But over the time, right, so if you say 2011, 2012, 2013 and 2014, 2011 we added a few more servers like another 40 servers, then we doubled. Then you see the regions also we kept adding. We went from U.S. East to U.S. West. We went to Hong Kong. We went to China. We went to London. And then now we are at a scale of around 450 plus servers. And this is likely to grow with what is going to happen this year. So with this kind of growth, right, it is kind of impossible for, say in case I have an outage in one co-location, it is impossible for me as an engineer to go log in to say 100 machines and fix issues. Say a network issue caused all the task trackers to go down. It is really not possible for me to, I could do a CSH. I could do a PDSH as of now, you know, 100 servers. I can batch them and do it fine. But what if these numbers go up to, say, 1,000, 1,500 or say, 3,000, like what happens in Yahoo? So we try to identify what is the best way that we can fix the problem. We tried a lot of monitoring tools like Nagio, Zabix. We had some crons to check and send emails regularly. The thing about these are proven, they've been tested, but they can only send you an alert. They cannot possibly fix the issue or trigger any events. Obviously you can hack around the system in Nagio's to trigger alerts. That is one thing that I would be intending to do. It is very, not very hard to maintain configuration. What with a Q2 order, you have like 100 machines to add. So again, it's a Nagio's configuration check which has to be generated automatically to fix the problem. There's no way API query that I can use in Nagio's, say in case I want to write some solution out of the Nagio server, I need to run a query. And it's quite hard to integrate with Chef, Puppet, stuff. So this is a general process that I think most of the companies follow in case of outages. So you have Nagio's as your primary monitoring device. Your monitoring server. Any alert sent by a mail, SMS, or a call to the operation center. The operations center, what they do, they possibly just look up the issues. Possibly the alert name, the follower, unbook. They try to see what the problem is. They try to correlate. They try to identify what the problem is and try to fix it. Otherwise it gets escalated to a production engineer, engineer who's more familiar with what could be the problem. And then he comes online and then fixes the issue again, reconfirms with Nagio's. So the problem with this check that this model is, there's a lot of manual intervention. What if point one, the post fix was down on Nagio's? What if it was a new alert that was not there in your runbook? Or what if you had a new knock engineer there? What if your L2, your operations engineer is not reachable? So there are like plenty of steps. And the whole process would take 30, 40 minutes to close the whole issue. So what we're trying to do is we're trying to have an event which is sent from Nagio's to an application server. The application server sends a message to a client. The client executes the custom script and returns the status to the application server. An application server records the strdr and strdr so that somebody wants to come back and see after some time what was actually done. And it raises another check if required. So in case my, I start getting a alert saying, check task tracker. So what could be the potential problem? There are multiple problems. It could be a task tracker issue. Might be the networking is down on that box or could be the disk is full or the switch is down or the rack is down. So initially I could start off with one check where I say check for the task tracker process. OK, task tracker process is not coming up. Why? Check the disks. The disks are fine. OK, then trigger another check. Check for the job tracker. The job tracker is up. Now I could trigger checks in a loop and then it could raise another check if required in Azure. So what the other thing that we could see is we could use this potentially for the same case you get a CP load 100%. You're trying to see every day at 12 o'clock, you're trying, you're getting an alert saying, check load on this machine, but it's a small glitch. By the time you log in the process over, there's no load on the machine. You could possibly again create an event that's sent to the client and at that moment it records the status of the machine itself, machine itself. So this is the general use case. Say for example there's a report which needs to be sent which is being built on. So on Hadoop there are kind of multiple dependencies. Data does not come from one place. Hadoop is like the big data platform and you have data coming in from everywhere. You have data coming in from MySQL. You have data coming in from Postgres. You have data coming in from clusters on the other regions. You have data coming in from the front-end servers. And say for example you have a report which is delayed and has to be shipped to the customer. What could be the potential problems? The potential problems, the script could be running, or it could be running slower than usual. So when I say running, I mean running, running forever. Or it could have started late. The CPUs load is on the machine or the machine is running low on memory. MySQL issues, network issues, maybe the job failed, but the operation center guy was out for dinner and he missed the alert. So these are the general kind of things that happen every day, right? Not every day. Once in a while they happen, but still it's bad for the business if you miss a report. It is a bad impression. Or things could be anything totally unknown. So let's go back. So this is the same thing that happens right now, right? Which I got told you earlier. So why don't we just put everything into a script and have the script execute and fix it by itself. Maybe if it's not able to fix it, then let's see, let's get somebody at 12 in the night wake up people and then try to see what the issues. So this is what we're kind of looking at. We would still continue to use Nagios as the monitoring server. But Nagios can be replaced with anything that can just make a rest call. We have the application server which is running on Flask and SQL Actomy. Then the nodes, the client package, runs as a dependency on zero MQ and REST. These are your other applications, MySQL, PostgreSQL, your servers. Sorry. The application server, it uses a DB just to store the events and record the transactions. The servers could be anything. They could be Apache or Tomcat, your Hadoop clusters or deep databases. And then we have admin. You are written with bootstrap and jQuery. So the advantage here would be that trivial tasks. You don't want somebody to just get up because of a false alarm. So in case you got a check load, it would possibly go check if everything is fine. Just wait for 60 seconds, rerun the check and just say close the alert saying that's okay. So reduces time taken to respond. Some tasks can be done across many servers at once. So this is the again example that I was talking about where you have the task trackers going down at once. So I can give you a typical use case where you have the name node going down. What happens if the name node goes down? All your services are done basically. Anything that writes to HDFS, anything that reads from HDFS, anything that processes data from HDFS. Your entire system is down. And at that point of time, I don't think we would even be able to manage with 10 to 15 engineers with expertise. But with something like this, right, you can parallely spawn checks across 15, 20 machines, 100 machines, just get them fixed immediately. So what we try to do is we try to build a framework instead of having one solution, instead of having one check of framework where I can just go plug in checks. Today I'm using HDFS, I'm using MRV1. What if I move to Yarn? I don't want to kind of redo the whole thing, but I will just build a framework around it. That can do this. It can just trigger a chain of events for me. So possibly show you here. So this is a sample check that I have. So I just trigger a check saying, trigger a check on localhost and say check DN. So what eventually happens if you see the console, is it kind of visible at the back? If you see the console, right, there's the listener that I run on every node and that kind of receives that event and it has a check that actually it runs, it has a script and it kind of records that script. It says it ran that script, status code and actually sends back the data. It says this is the PID, this is the STD out and with the status code. So I can possibly take you through a small UI. So the UI is kind of, so you just have your server running here. I could spawn a, so this is your, so this would be your client. That's your server and possibly I have a very naive right now. We're kind of working on the interface, interface of the last thing that I want to focus on right now. So if you see the successful checks, so example, this is a check SSH. So it runs the checks SSH. It says this was the event ID. This was the time the check was actually triggered, 1825. It ran with this PID. The STD out. There's nothing just said restarted STD. Restarted STD. So I can show you how it works. So with every possibly check, so this check is basically, you could just have this curl URL put into your Nagios. Nagios where it sends a notification by email you could add another script which will possibly trigger this check. And this, if you see, should have actually received this check on the, so this is your server. Basically it says sending message to this IP, sending wires, set MQ, then it's actually, so it's actually done an SSHD restart on your client right now. So this is what we're trying to do in short. I would be, I'd be feeling to, will you take some questions right now because I think I've finished a little fast. So, yeah. I think somebody's there. Hello. Yeah. So this thing looks similar to Monit. It does something similar to Monit because I think Monit is a similar thing. It can, if there's an event and it does a set of, it runs a set of commands and is it something like Monit? Am I wrong? Okay. So Monit is basically, we looked at Monit, basically just looks at things and does a, so it might not be usable for everything. So this is only from the infrastructure perspective that I've been speaking about, but what if you have some kind of an application logic that is built in, you have something like say your reports that are generating right, what if after an RFC things are different, you want to do some kind of validation. So you could just put that validation script into your pipeline and possibly even do a rollback of your deployment. So maybe you want to run a query and you want to run some kind of a custom script on it and say, does the report match with the previous release or does this release actually have some kind of data that you were expecting to have after the release? So you could do things like that with this. So this is just going to be a framework where you can plug in anything you want to do and execute it remotely. So Monit, we've looked at Monit, but I don't think it matched our requirement as of today. Hi. I was just wondering, I mean, where do you in your application are doing the business logic as to if you receive some event of something going down, what are the checks supposed to be run and what services to be restarted and things like that? So are you developing a framework around that or I mean, where is that? So your business logic, right, whatever you want to check, that could just be another script that you want to execute, right? So that would go into your scripts and you could just trigger that script. I don't have an address alert for that or you could have a cron which just kicks off this event say at 4 p.m. when you're expecting that report to complete and then it just checks what it says. Basically, we just don't want to have too many crons all over the place. We just want to have one place, one centralized place where you can just kick off anything. So Salt is also there, but we evaluated Salt, but really didn't match what we wanted to do. So we basically want to do something that is very flexible. We don't want what we know what we're writing. We wanted to have total control over things, right? So I mean, for example, if we have a networking issue, right, it takes down a certain number of servers, but when you get events, it would be like 50 servers are down at this point, but you don't know whether it's probably down because of a sand frame failure which is taking down all the hosts or maybe it's because of really a network switch going down. I mean, so you would have to then write a logic, for example, if they are not pinging or something, if sand frames. Yeah, so that's the thing, right? It's a chain of events that you want to build. So as a non-call engineer, right, I have some kind of knowledge that I know about. So in case network issues don't happen every day, but it could potentially be a network issue. So I can put on things like saying that, okay, if this happens, this could be the potential thing. So in case I'm not seeing this event in my report, it is possibly that this server is dropping some events while it's writing to HDFS. So those kind of checks I could just possibly show you the... So basically these are just the scripts, right? You could put in any kind of logic that you feel is required to run, keep your system up. So you just plug in any script you want and just run it. Thank you guys.