 So, I am here to talk about some random stuff on what to do when stuff doesn't work. I mean, there are enough people here talking about what to do when a system works. But from what I have seen, from what I have seen, most of the time I have spent doing the most work is when somebody is there. And the rest of the time I have spent preparing for those things. It sounds fine. So, what mostly admins deal with are escalations. And escalations are probably the bigger big point than architecture setup. And there are various solutions that people have created which work great for large scale deployments. But when you set up a cloud-based system and you start building... Hello? It's the direction of my... So, what ends up happening is that no matter how well you set up a system, as your application goes from, say, 100k users to a million or two million users, that week when it happens, the week that it goes by, is probably when cloud actually pays off to use. So, that is when cloud pays off to actually use. Because even though you haven't probably made a capacity plan for a few million users, the day it hits a big news site, dashboard, reddit or whatever in the top of the list, that is the day that you have to actually get the system to work. And that is the day when you will get the maximum amount of work. And on that day, it is very unlikely that you will have a plan for the load, or the number of servers, or the kind of errors you would have. So, the spare time that I get is spent building stuff so that I can monitor systems and create systems and create systems without using any of the standard mechanisms available. There are solutions for some of these problems. For example, people do use Puppet Alarm, people do use built-in crons to do restarts, people have arguments pushed to systems, there are companies like Splunk which sell you a lot of monitoring systems. But those are all solutions which either require a project or time to set up, or at least a couple of days to open the ports if you have a secure system. So, everything that I am going to demonstrate here pretty much started off as an attempt to mine access logs. At a particular point, when a particular deployment went from about 100 servers to 200 servers, most of what we had done to analyze and understand the system completely went for a task because suddenly we couldn't really look at 200 graphs in one screen. It was not something that we expected would happen. It was just that the big screen that we had could show 50 graphs after a particular point, trying to look at 100 graphs on the same day just by swelling up and down every 15 minutes just seemed like a completely pointless exercise. And the spike lasted for about three days maybe. For those three days, I needed 100 extra boxes. And for those three days, I didn't have any real monitoring other than people looking at graphs manually. This is not really a good problem to have when the load disappears in two days. But the fact that we needed 200 servers at that point was because the people had not planned for that kind of load. Basically, the solution was throw more servers at it, wait till Monday. If till Monday we can't fix stuff, we will worry about it then. So, what do you do from like 5 p.m. Friday when something goes viral till Monday morning? That's this whole presentation. And it is not really a presentation. It's mostly cold. So, how many of you here have a basic understanding of Python? Okay. And if you have used said-on, grep, is a root account, I'm guessing. So, what I'm planning to demonstrate here is how to go beyond something like pdsh and use something that runs off a cloud of servers to aggregate your data. And how to actually build a, let's say, a fake MapReduce system on short notice without a project. So, the challenge at that point was to figure out how many users are getting errors. How many of those errors are because of what we have? Bad. And this data was thankfully being logged into places where I could grip and in patterns that I could grip. I don't know if anybody can read the code. Can everybody see it correctly? The bad. So, basically, there is a tool set called Fabric which is very popular. And Fabric comes with a bunch of things inside Fabric which is used to run how Fabric runs jobs across machines. Okay. It is called Parameco. So, what Parameco does is that it lets you start an usage connection, run a command on the other side, using your agent running locally so that you don't have a type of password, and collect the data path as a stream. So, what this code does, or the part in the Tom does is collect it in stats, which is also probably better done with other things. But what this code does is it goes through the entire Tom file, it looks for all the requests which, or figures out which file is taking what time in this particular system, and basically runs it through all of these commands across all the boxes and collects the data into one machine, into one pilot program. Now, under this point, this is no different from something like PDSH. How many of you have used PDSH? Okay. How many of you have tried? So, when we have four machines, I could go ahead and do this by hand on those four machines, right? The real problem, I hate to say this, but I have no control over how many machines are required to handle the load right now. I can make a plan for it, but those plans are pure cuttings. The moment he gets the first number one and read it, it's pretty much dead, whatever plans I have. So, at that point, to take a single command that I would run across one machine, run it across like 600 machines, pull all the data back in a format that a program can understand. Like, PDSH is great if you just need to run a command and forget about the action itself. You only care about success or failure. So, if you're passionate to run and get the output out, anybody can see it. Like, I'm basically doing a group by output result that I get back. Something like this needs to be written and kept. I'll probably be sharing this code at the end of this presentation and put them up. But, as a responsible admin, everybody's duty to write random, basic framework tools that let you run commands across your systems. Not only that, collect results across your systems, collect data across your systems. At some particular point, we're going to hit the limit of how much you can process on one box. The real challenge of an 800 server or 900 server deployment is that even if I run this across 900 servers, at some point or the other, I'm going to have too much data to process it globally. We hit that. We ended up writing two data files per machine and shipping one half to one machine, the one half to the other machine, and have both of them agree. And then the right machine aggregate onto the left machine and finally get an output within a period of five minutes for every 500 of it. So, being able to monitor the system real time for a custom query was what this particular bit of work helped me do. And the fact that with the handful of code written during an impression crisis, you can do this, is why the cloud infrastructure bites you if you don't do that. If you can't write a script to monitor your entire system quickly, for a parameter you don't know about, you cannot really use the cloud and sleep comfortably. I've spoken about 3A and look at some part of these issues. So, the fact that these kind of tools can be built and kept aside for cloud systems is one part of the equation. The other part of the equation is the cloud keeps growing. There is no real membership in the cloud beyond a given point. So, for example, today, I can say that my machine has 800 boxes and these are the ideas, right? Which gives me no guarantee after two hours what the machines are there. You do not get to see not only what machines are there, but most likely you do not get to see where all these machines are, how these machines are connected to as well. So, the topology of the network is out of control in the crowd, which basically brings you, like if you happen to use an EC2 differential, whenever it's been in a node, you have really no idea where and which router it is connected to, how far is any of the other nodes. So, we had a system which temporarily used to take the data nodes, do a link to the data nodes, figure out the bottom 10% of machines which fall below the latency timings and just terminate them. And this is sort of like a continuous process which basically created an optimized system in the beginning, but which basically meant that every 20 or 30 minutes, two or three of my machines would die off, two or three new machines would spin up, that would basically toss the dice. So, what happens when you have machines coming up and down every 20 minutes? How do they keep track of what machines have data? Or what machines and how old are they? One of the solutions is to actually write the API or hit the EC2 API and get the data out and try to figure out whether to use that data or rather figure out how much alternative data I need for each of those things. So, for any instance ID that I have in the cloud, I'm not happy enough to just get an IP, right? I probably want to know which rack does it sit on, whether it's like a top rack, bottom rack, which availability zone does it sit in. And trying to keep track of all of this and trying to hit the API when something like this happens basically had a different issue. People kept throwing those out to APIs. So, if you keep hitting the API every half an hour or every 10 minutes, what's the scenario? To check if all the APIs you have are all the APIs you have, the guys who are posting the APIs are probably not going to be happy with that. They expect you to hit this when a machine goes down or you want to spin up a new machine or do some activity, not as a fix to load at the bottom. So, what we ended up using was what Hadoop used to coordinate large scale systems, which is something called Zookeeper. How many of you have looked at Zookeeper? Zookeeper is not ultra complicated. The way it uses it is not ultra complicated. It provides an equivalent of a Windows registry for a large cluster, which is a horrible thing to say about it, but that is what it does, because it gives you a slash part based naming system. For example, I have a slash nodes field, which contains all the APIs. Each of those nodes contains a JSON dump of what I need to know about the type. And whenever a machine boots up, it goes and updates its field, and every five minutes it goes and updates that field inside my system without actually going to an external vendor like Amazon's API or something like that. So, are there any interaction problems there? Because it's a separate database from, let's say, Amazon's a vendor, so Amazon's database tells you the real scale of your systems or the real scale. Now you have the... Yes, I do want to compare the database elsewhere, but since that is up to you, I have to find links from my problem, and also because it is initialized at good time. I have a better idea of what all the machines I have from the Amazon system, which I have to query. Not only that, I have to actually wait for the request to come up. So, there is a little twist to the state of the story. So, what we ended up doing was we needed each machine to know which all of those machines existed at the moment. So, if I have 800 nodes, I have to find 800 requests, and find them to get that information. Whereas in this particular system, I do not need a lot of time. You keep a very nice feature called a watch. Watch basically watches. I can say slash nodes watch for any change. At any point, when that changes, that is the only time that the rest of the scripts will trigger. So, an external lake probably doesn't give me an ability to watch for things. But that's it. The fact that the matter is that this was duplicated multiple times. For instance, once this was set up, the Nagios alerts suddenly got better because they could suddenly also figure out whenever a machine gets added. We had a main cache cluster to which whenever an IP is added, you need to actually not only change the IP in the list that you have, you have to actually go ahead, go into the web servers in each of them and actually update the IP list that you have. These workflows basically got started becoming simpler to want. It is supposed to happen. But how does each web node check whether the IPs it has for the data nodes are relevant or current? Obviously, they are refreshed every few minutes. But I could suddenly write a script like what I saw to do that. But on is not probably the best way to write complicated scripts. Some people prefer Word. I personally prefer Python. And more than just using Python, I prefer to... I have a ridiculous hand which I've been for future years. So, this bit of code, what it does is that it lowers up a Python module, sends it over the wire, over SSH, over standard input in SSH, and runs my bit of Python code on that server. It's basically a remote code execution directly over SSH. Now, what this lets me do is that I no longer have any agent running on each calls. So, I learned it to that point. Another problem with large cloud requirements is that if you need anything done, you probably need something running on that box. Like if you have 800 boxes, you need something running on each of those boxes. Most of the time, that basically means that you have to push an RPM, push some code, at least SCP some code over there to execute. But when you are debugging stuff and you don't know what you are looking for, you probably can't really rely on something that somebody is already coming. So, this bit of code, what it basically does is it opens up a module called hot, converts every function in it into a chunk, figures out the length of it, a Python monoliner which lowers it into a temporary module called temp, sandballs, and runs it on the other ones. So, it basically meant that I already have an SCP step, which is what I would need in a production scenario, because I would need to SCP in this file to each of the boxes, wait, then run the command across each of the boxes. That is how it is usually done. But at this point, I basically say, here's my code, here's the memory, run, then give me the result. And I don't really care about writing on that. This basically ended up having bigger data set. It basically meant that at this point, I no longer need to write my own script. I can write more complicated scripts. I can write stuff that, for example, takes an IP, figures out what subnet it comes from. Do a GYR to look up. I can figure out that for this particular geo, how many hits did I get in the last 10 minutes, which is ridiculously complicated information to care for any kind of build system that we had at that point. We had something that would mind it within 20 minutes or 30 minutes, put it in a DB, let somebody query out. But by that time, my ability to quench a particular subnet or draw a subnet on the edge was completely gone. There are scenarios where a particular IP subnet will flood you with a request. Being able to figure out which IP subnet is flooding you with a request and ban it is very critical to the system. Obviously for the stability of the platform. When a DDoS happens, and then a DDoS will happen while ban that I was watching. So we had a list of commands running perpetually. Just minding this data, looking at it and writing ban rules. And that kind of agility with the system is only very relevant if you have a system that you don't control as it grows. Like a few years back, I used to work in Yambu, things were very planned that there would be like a fixed step of like with two weeks ahead of time you would know which IPs you are getting. Compared to a system like that, the cloud ecosystem that I am in right now is completely scary because the moment you put your foot down in one place that the server is gone. The moment you put critical data in one server, it happens of course that it says they are terminating the data. Can you please terminate it out of big backups? Things will keep changing all the time. So the potential agility of doing random things at a random time when you have woken up is the reason why stuff like this gets written and hopefully gets thrown away over one day. Because most of what I have done here has probably survived a week at best. After that point, other people have come up with written better things. Written grants, loggers, indexes run, parallels. That is pretty much all really I have to talk about in particular but I will take all kinds of questions. He could probably change his user name but he would probably like to push the user name and change it out to everybody else. Would it be possible to do this kind of thing? Yeah. Well, that is not what we saw. To say that what happens when you see a user that is spying for a particular user name and you are not expecting it in the system. To figure out that is happening. So if I knew the variables ahead of time, there are all these natures that I could prepare for. Yeah. The fact that you figured out how to fly. That was just the question of this and why the actor was doing that. Yeah. Well, usually the actor would then run in my street and it would not even take a minute. No. No, but I wouldn't be able to like look through the entire system and run it for 800 boxes. Just do something like this. The whole point is to run stuff at half. Like you don't know what is coming up. The next thing somebody comes up, there might be something else I could identify with. Like maybe it's not sending Google. Maybe it's not sending the Google urgent records. Right? Like the process will send it. That is why all this exists. So little bit of a figure out. So how many of you happened to use the Facebook API? Do you know what happened yesterday? Do you happen to know what happened yesterday? So yesterday Facebook accidentally turned on IPv6 for about two hours. For just a few. I'm pretty sure it's still turned on. Yeah, still turned on. So as you can see, this is Facebook. There are all whatever. In text. But they turned it on in preparation for IPv6 day. But that basically resorted and it was like suddenly everyone is like, oh, we don't have IPv6 robots set up already. And because they gave out three IPs and the PHP extension as such doesn't really figure out which one is connected but just connects to whichever one is available. Pretty much had to go disable IPv6 resolvers on every box that I had access to. Within a matter of 20 minutes while the DNS resolver started filtering bodies. Right? And the fact that I can run jobs across hundreds and thousands of machines whenever I want to and the fact that I can actually do it not as a shell. Shell is a horrible language for anybody who needs to operate on data. You put a space in the middle of something and everything breaks. File them as a space inside and everything breaks. Which is the only reason that all this effort of sending byte and byte code access. So that I don't really have to worry whether to put a backslash here or whether not to put a backslash here when I first started. It was a simple change but the fact that you have to do it in the last game within a few minutes is a challenge that only the cloud brings. People might disagree with me saying that there are last scale systems deployed by enough people before the cloud as well. But the only thing I have to say is that it really tells you that you don't have to plan for scale. It says that you can start on 10 servers and then you just grow organically. Right? I am sort of dealing with the end result of that organic growth with all these random things. Okay. Does it use an extra port? That is my fundamental issue. Yes it does. So anything that requires an agent, anything that requires a port opening is probably not a thing I want as a crisis to do. Because if you say that this will be useful and somebody says that a week before the crisis happens By no parallel. Okay. No parallel. It is a sensitive parallel. Yeah. Yes. But except what I end up getting is not just output from the sensitive like for example I could echo JSON out from each of these and I could parse JSON in the fetcher and I can get structured data out of the remote systems. My real problem is that like how to get structured data out of the system for example the user agent example was user agent, number of columns and it was pulled out of the JSON structure. So the final thing you need to structure the structure that's pretty much I have that's my brain down. It is great to work on the cloud for a lot of reasons because as I said before the 100 to 200 might be a problem the way I look at it today for somebody who has 100 boxes and cannot put in 200 at any given time for instance like a business ending catastrophe. I am not really complaining but I am just saying that the cloud basically brings out a lot of bad behavior in people and refusing to plan for scale.