 of a debugger than a program and that is sort of what I have to talk about here in a roundabout way. So, how many of you actually program in systems which run on multiple machines, more than two machines, right? So, in some most of the time when people talk about debugging things, they do not talk about debugging in production, right? The thing that I have to actually talk about is how when stuff doesn't work in production, the basic toolkit that you have to carry to actually have any hope of debugging anything. There is lowest layer of the platform because most of the time nobody is actually going to tell you what went wrong in explicit sense. The difference between a good program or a bad program is that the bad program score doesn't work. The difference between a good program and a great program is when the great program score doesn't work, it tells you why it didn't work, right? And personally I have spent more time debugging an issue that could be fixed by one line than writing hundreds of lines of code or by myself from scratch. So, this is one slide that I found and I don't really think I should try to make it again. Of all the tools that I have always used packed into one single slide, let's list them out. People raise their hand, right, you know, let's raise. Okay. What is Netstat useful for? It is network communication between two machines, right? It gives the status of the network packets. That is not the only thing it does. So, what Netstat gives you that is very useful is that it tells you what the open connections are right now. So, when you log into a machine and you think that it should be connected to XYZ machines and you think it is connected to the wrong machine, for instance, you can immediately figure out by running a Netstat type in TN, right, which says TCP, do not figure out the domain name of the IP. The other thing Netstat is really good for is finding out what domain sites are open, who is reading it is not given by Netstat. Can somebody tell me how to figure out which process owns listening socket? Netstat does that as well. So, Netstat, when running to do more, tells you which port is bound to which PID and what is the command line option of it. This is not terribly useful where I work because everything says Java. But, after you get the PID, then you get to use something very interesting called the prod file. Any guesses on what prod stands for? So, the kernel exposes something to the user to give you an idea of what each process is doing. So, after you figure out which socket is it listening to, you can go to slash proc slash PID slash fd and find out all the files kept open by that process. Maybe your process is reading the wrong configuration file, maybe your process is writing your file you do not think it is writing it. So, how do you use that to figure out all that stuff? Now, let us say top. Okay, what does top do? Top gives you the processes resource and uses of the resources of your system. Okay, top gives you which process is taking up CPU. Not only that, top if you open top and press one, it will give you which CPUs are being used right? VM set. What is a useful thing about VM set? How many pitchforks? What is bad about pitchforks? You will see many pitchforks, at least your virtual space needs to be increased. Yeah, basically tells you that you are running out of memory in some fashion, more than anything else. Let us see. S trace. Okay, I will tell you why I use S trace the most. I will run a command. It will say I will edit the configuration file, I will run a command and it will have no clue that I edit the configuration file and it will be reading some other configuration file and it will say something. Google for it and it will say Red Hat has this configuration file and Gen2 has this configuration file without actually telling me exactly which configuration file I should edit. So, I use S trace type and E file command to figure out exactly which files are being opened. Can you think of anything else that S trace will be very useful in? Network connections. Yes, so it will actually print out the size and buffer of each read call. So, if you have given a 600 MB buffer to read and it is reading 4 KV at a time, you know that you are doing something stupid. My personally most funny experience is using a 900 byte buffer and first time I really did it with 900 bytes, next time I really did it with 400 bytes. Anybody want to guess why? From an echo, sorry. Because the empty was 500 bytes and that was it. Yes, so the first read comes out of the first 500 byte packet I got, next 400 bytes out of the same packet and then instead of waiting for the next packet, it exits immediately. So, S trace helped me quickly figure out that I was doing two read calls but I should have done one read call with the right buffer size. I used that. Anybody has the struggle to use I used that? Yes. So, right now what I am debugging is balancing this usage in Hadoop. So, we run with like 24, 25 disks and when you run a command, it round-drops it in all disks and there is no real scheduling, global scheduling. Now, what would be ideal if you could pick a disk which is not being used, right? Which is very hard. But the alternative, how we end up trying to figure out if it is non-optimal or not is by running IOS tab and figuring out two things. What is the block size being read? What is the block size being read? And the number of blocks being read and the bandwidth being read. If the number of blocks you are reading is very large and you are getting a small bandwidth it basically means that you are reading from many, many, many parts of the disk and so reading from one part of the disk in a sequence. Anybody want to guess why reading one sequence is faster? No. Not the cache. Because you don't seek too many times. Yes. Basically, disk is written on the cylinder and you just read that cylinder keep following the cylinder without moving the head. Yes, there is the file that is not frightened. You don't even need the file to be frightened. The other thing is if you accidentally happen to run RAID 0 on software mode and you are writing data and you want to figure out which among the 24 disks is the one that is writing data slower or reading data slower. Use something like IOS tab and block trace. IOS tab is slightly different. What is IOS tab? Which process is reading how much? Yes, which process is reading how much but it doesn't tell you which process is reading how much from which file. As anybody would guess, this is guess work after you figure out what process is making a problem but I was a DevOps guy at Zynga and most of the time when you are called in to fix something nobody is actually going to tell you what all things are running. If you are going to tell you it is not working, please fix it. And you have a half an hour to fix it. At that point you at least have to be able to try out your bug and see this is the person who is to blame in that half an hour at the very least. For situations like that, these tools are tremendously useful. Let's pick Perfis. Too high tech for most production systems because you don't want a production system to be slowed down by Perf counter. But when you are developing stuff it makes a bigger difference than when you actually have it in production. TCP down, dumps TCP data but does it only dump TCP? What level does it capture? It gives you link frames which basically means it gives you Ethernet frames raw Ethernet frames and you need it on a day when some two people have put the same IP for different machines. You do a TCP dump and you realize that in this machine this IP is this methodless and this machine, this IP is that methodless. Until you run into something like this you will never realize that you need a tool like this. Nobody needs to really think about ping Can you think of something better than ping to use? TCP test load? Yes, but there is a very nice one called mtr in Linux which tells you where in the entire packet chain packets are getting dropped. And very soon we will need something even better because multipart TCP is coming which basically means that the packet leaving here is not always going to go through the same device in all directions. So if you are on a phone and you are connected to Wi-Fi and data plan, your TCP transaction gets caught on the data plan and switched to the Wi-Fi without dropping a packet. So when similar systems get added for internet servers it is going to become fairly painful to just use ping. Right? And anybody know how ping works? It's an ICMP packet. ICMP packet fine. How do you figure out which machine is or the ping part is easy. You send it, you get it right. Yes, so how do you go from ping to tracer? You set the number of the time to live. After that when he hops it. Increasing it by one packet. So you send the detail of one. The immediate servers is no, I can send it back. And it is because it gives a negative acknowledgement that it knows which server is it in the middle. And sorry, I think I am done with most of the list. And the most fun bit about tracer is the packet size you can use. Right? Is that you can try a larger and larger packet to see if the network is in the middle. And can anybody tell why you would ever run something like this? So if you have worked with TCP there is something called Naples algorithm which is meant to live with older network cards, but the router is not in the middle. So somewhere in the middle you are using 9000 byte MTUs, big jumbo things. Except somewhere in the middle there is one router that doesn't understand this. It says I cannot process it. And you have to find it. That is the day where you actually use a ping with a size. You can say how many bytes do you want to send in the payload to see if it goes and comes back. And ICMP is not enough. So one of the rules that is not there is HPING 3. If you have ever had to deal with somebody who turns off pings. So what HPING 3 does is HPING 3 lets you set a TCP packet as a ping. It sends a TCP sync packet which is a connection initiation way. So HPING can tell you from this machine to this machine is port 80 open through the firewall. And when you deal with somebody misconferring the network server in production remembering that these things exist can be a complete life set. And every one of the things in that list is very simple in a very unique way. The very small, that's one thing tool. But when all of them are put together it makes like an amazing toolkit to actually work with stuff in production. That's what I have come up with. Who's next? We have 10 minutes more so if anyone wants to talk to us, we can do that. So because we can follow up questions to that. Yeah. Quiz for everybody. Yeah. So you have multiple interfaces on a machine. Yeah. And you want to find out actually how much bandwidth you can get. Yeah. It may be on a LAN. It may be on a LAN. What would you use? Uh, network. Yeah, so network will do that. But any other ways? Network is probably the best way. The other thing that you have to... So one of the tools that I had in engineer is Netcat. And Netcat is the do-all-anything thing of network. So NC, FNL starts at ACV server. And NC port number, 5-data, sends it there. So on some networks, I have had to use Netcat to copy files from one machine to the other simply because SHS is too slow. Right? So, classic example is inside an EC2 where you have two private IPs. But the private IPs don't have SHS, it's listening on it. So you can log in from one private IP to the other private IP. But port 80 is open, but it's not open. So right down the web server, start NC, FNL, AD. Send it from here to there. A tar file. Untar it there. So what you're basically running is tar-c-hyphen. Output is standard output. Pipe, Netcat. Post name, port 80. And on the other side, Netcat, HyphenL, AD. Pipe, tar-hyphen, XS, VF, hyphen. So here you're tar-ing it onto a pipe, onto Netcat, over the network, out of Netcat, into tar, into a directory. And also Socat is not mentioning you prefer some Netcat because there's even more. Yes, Socat is even better when you're dealing with Socat. Socat, associated. Okay. If you use HAProxy, HAProxy has something called a socket interface. So you run socket, HAProxy runs off. It gives you a terminal where you can actually disable a server by hand without ever editing a configuration. In a running HAProxy instance, you can go in and say, the server number three is no longer working. Disable it. We use Socat in Linga to poke HAProxy to figure out the cometers for each thing. The stats socket, it will poke it and it will give you a graph without actually having to run anything in it. P-stack. P-stack is another thing that you can use in this case. So if you deal with a stack server, you can use P-stack to dump the stack of that process. And P-stack is much more useful than GDP because you can P-stack hundreds of processes in one ego in a for loop. So I used to work with locks, shared memory in PHP which basically meant hundred processes locking on the same block of shared memory. And you had to figure out which one had the lock so that you could even start debugging. So you take the hundred P-addys, P-stack all of them, write them to file and then grab through them to figure out which one is the one I'm looking for and then GDP into that one to start debugging. In this context, GDP is even bigger tool but nobody wants to really install GDP in production. You shouldn't install GDP in production because it basically lets you load up a process go over to its memory, change something and come off. You have to use GDP before you start the process. No. You can use any process. GDP is the 80 attached P-addys number. And GDP can do it. It's open any other point of view. Yes. Which basically then means that you're doing it at a slightly over-level point of view. Well, you have to be rude to me. Right? P-trace is a capability. So if you don't have the capabilities, mark for your process or executable, it won't work. I'll show you the details. I'll show you the details. Just a minute. Okay. We have the special side of this. Yes. But we don't have the capability. Yes.