 about GlowRoot and diagnosing problems is that GlowRoot only gives you part of the solution, all right? If you're troubleshooting, trying to find out particular issues, there are exactly three places you need to be looking, right? You need to be looking at the application itself and GlowRoot is a good way to look at it. You need to be looking at the database, what's happening on the database. Again, MoonIn gives you a very good idea of what your basic database health is looking like. But the other thing is the proxy, right? Your proxy is creating access logs and the proxy access logs are full of very, very important information, both for kind of generating reports at the end of the day, tracking down particular trouble spots and often you need to look both at your access log file on the one hand and what your GlowRoot is telling you on the other hand and trying to correlate those things to get a full picture of what's going on. Let's talk a little bit about the access log. There's a default access log and in addition to the default access log, I've created a slightly modified access log and it's modified basically in two ways. I want to modify it a little bit further in fact, but at the moment it's modified in two ways. First of all, instead of the standard Apache common log format as they call it, it's actually the same thing that Nginx, we also use at the Apache common log format, which is space separated. I've made it tab separated, like mine you can see here, so there's the IP address these two. I think that would be the username if you're using basic auth, something else I can't remember. It's the timestamp on the request. This is the request itself, Povax 503, that's the error code. 299, that's the number of bits returns. This is the, what do we call the referer, the page with the request has come from. This is the user agent, well we've been talking about the user agent string, so this request is actually coming from an uptime robot as you can see. And this last number here, this is the request time. So I modified the default format to produce this alternative format, which is tab separated instead of space separated. It's just easier to process that way and includes the request time, how long the request took. This slide on Apache, the request time by default comes out in microseconds. In Gen X, I think the request time is in milliseconds, you need to remember that when you're processing files. One of the reasons for doing this is that this format is much faster and easier to parse than parsing the default format. Because you don't see tabs anywhere else except between fields, it's a much better field delimiter to use. And one of the tools that's best used for processing data that's in columns besides Excel or putting them into a database, which you can also do. In fact, I've done it myself, I must confess. You can take the log file in this format and you can load it up, read it into Excel and then pivot around on things and have fun. But the real tool of choice, I guess, for dealing with column data like this in the Unix world at least is using something called orc. A lot of people are familiar with grep. Grep is one of the command line tools that's great for searching through log files for things. Orc is kind of a little bit like grep on steroids. It can do everything that grep can do, but also quite a bit more, particularly when your data is nicely sliced up into columns like this. I've got a little quote off the internet that's a bit pithy, but it's very true. I think the real reason for learning Orc, in fact, is to have an excuse to read this very, very nice book. It's not new. Orc's been around for a while. The K in Orc is from Richard Kernigan, the author of The Sea Language. Written in 1988, the Orc programming language. It's very easy to read. In fact, you only need the first 50 pages. The first 50 pages teaches you pretty much most things that you need to know to process files in Orc. I'm going to give you a couple of examples here of looking at an actual log file and how you can use Orc to make your process of investigating it a little bit more flexible. Maybe before we do that, I'll just show you the book. If you don't learn anything else from this academy, if you come away from it, having learned 10 pages worth of Orc, you know, benefited hugely, because the book is very old out of copyright. I think that's the link there. Thank you to very famous Orc programming book written by these three fellows here. Take it down a little bit. Yeah, read the first chapter. Read the first chapter. That'll probably teach you most of what you need to know in the first 10 pages. It's not very long. It's about 160 pages. It explains to you the basics of Orc. It's much better than any of the odd tutorials and things that you find out there on the internet. Go to the original Orc book, but let me just show you a few things to motivate. So we spend a little bit of time looking at this log file and maybe leave, I don't know, 15 minutes at the end to take any questions people might have in the reference to the assignment that you all have coming up. Okay. So let's look at a log file. First, I put one in the tab here. Martin, I hope you don't mind this. I hope people in Laos don't mind. This is an old log file that I actually have from Laos. It doesn't have any identifiable information in it. And we're not going to look in great detail in the log file anyway, just using it as an example. So the log file here, perf.log, all right, you'll find this under et cetera, Apache 2, or no, under var logs, Apache 2, perf.log. You can create exactly the same format also with engine X. So it's not Apache specific. The first thing is how to look at a compressed log file. Well, when this comes back a little bit to my talk on databases the other day, I mean, one thing we could do, we could just zip the log file like that. But I don't know. The problem with doing that is I'm just going to unzip what's potentially a very large file onto my disk. As you know, from two weeks back, I hate unzipping large things onto my disk unnecessarily. There's no need to unzip it onto the disk because gzip is a nice format. It's quite different to zip in the sense that it's a stream compression rather than a zip is kind of a combination of compression and packaging. It's useful. You can zip a number of files together whereas gzip simply takes a stream and compresses it. So you can decompress the stream on the fly without necessarily writing anything to the file system. Normally, when you're looking at a text file, you'll cat it. But if it's a compressed text file, if you zcat it, it's going to dump it. I don't really want to dump it. This is a pattern you'll see often. We'll slow it down. Whatever we dump, we'll pipe it into less and then we'll see it a couple of lines at a time. So we can see inside the compressed file, we don't have to unzip it onto the file system. It wouldn't make much of a difference in this case because it's not a very big file, but sometimes these files are quite enormous. Okay, so this is an example of the perflog format, similar to what you've seen on the slide. Basically, you've got all of those fields, they're tab separated. So one of the kind of things that we can do with all, we can pick up the zcat, dump it into orc. And one of the first things that we want to do is tell orc that field separator is tab. Now, goodbye default orc is going to assume that the field separator is a space. So we have to tell it the field separator is a tab. And then we write a little orc program. Here's the simplest orc program we can write. We write an orc program to print out a line, print $0, basically it's got a whole line. So this is an orc program that does nothing at all really. It's going to take the output of the perflog and simply print it out again. That's not very exciting. It's just got us back to where we were. We can take advantage now of the fact that orc understands that this data is arranged in columns. So it knows $0 is a sort of special column identity if I refer to the whole line. But if I just, I know that my IP addresses are in $1. Let's print just $1. Now you're seeing I'm pulling all the IP addresses out of the file. This in itself can be a useful thing to do if you want to find out how many unique IP addresses have accessed your server. I don't know if any UNIX gurus on here want to volunteer the best way to do that. Basically what I would do, I would use orc to cut the first field out. I can then sort. That sort will take quite a long time because there's quite a lot of quite a lot of IP addresses in here I think. So let's wait a little bit. Orc is very quick. Sort is very quick and now you can see we've sorted the IP addresses. 18 is a busy one, there's 101. So just taking that out, putting them into sort allows me to arrange them all in order. Having arranged them all in order, then use this little program called UNIX. UNIX will now get rid of all the duplicates. Let's see what we have left there. And now we've got UNIX, all the UNIX IP addresses. Arranged in order. We don't need to arrange it. The only reason we need them arranged in order is so that we could remove duplicates with UNIX. Now we can take that further and sum up the number of lines. WC stands for word counts. Word counts take some options. Word counts minus L will count lines. So the little pipeline there, little pipeline there is going to tell us that on this server for the log file for that day, there was 833 unique IP addresses. It's just one little development I guess from here. We can use ORC to pull out a particular column. This column is pulling out IP addresses. Looking at some of the other columns is interesting as well. Sometimes I have to keep them back and be looking at the file itself. Let's count them. So number one was the IP address. Number two is the time date stamp. Number three, four. Number five is the request. Number five is the request. Number six will be the response code. So I might want to do something like this. The response code seven, eight looks like it's the... Let's look at five and six. If we just look at field number five and six, that's going to show us the request and the response. One of the things that we did, this is now one step beyond the very simplest ORC. Very simplest ORC prints out the whole line. Take it a little step further and we can pick out particular columns to print out. But we might only be interested in requests for events, for example. We might want to pick out everything to do with PHIS API events. Now we know we can do this with graph. We don't need to use graph because we can also do it like this. You can put a pattern in here. We're going to have to escape these. And now, basically, I'm only going to be pulling out all the requests for API events and their response codes. Fortunately, this server looks quite healthy. We're all getting lots of 200 response codes. Looking at the response codes itself is an interesting thing to do. The response codes are six. The response codes are six. A good indicator of that your server is performing well is if it's responding to most requests with the 200 or 300, but not 400s and not 500s. So it can be useful to look at your log file and find out how many, what kind of errors you're getting. I got a nice little, this is now more complicated. I've got it here. Let's take one and examine this in a little bit of detail. Actually, we would save that one. That was more complicated. Let's do a slightly simpler one first. Remember, I did say that we have just removed the filter for the moment. We have a field, we'll put it on to the default format. I think it's a field number 10. Let's just whack it on here. It is the request time. I can see this is a request time. We've actually converted the request time into seconds. I might be interested in just the slow request. All these requests that we can see here are actually being handled very quickly. If I'm interested in all the requests which take less than a second or less than five seconds, they may be the ones that I actually want to investigate further. You can put again a pattern to match the frontier, either a regular expression pattern like we did earlier, regular expression pattern used between slashes like that, but you can also put a numeric condition. So I want to say, show me all the lines in my log where the request processing time was greater than 10 seconds. Do we have any of those? I wonder. We wish we don't, but what do we do? Yeah, we have quite a few. Look at these. This was an example where there was obviously a lot of trouble going on with API events and API me in there also. It must have been to do with some kind of blocking, I think. Anyway, you can see here that we can filter out as we're now looking at all the requests which have taken longer than in seconds to execute. Why don't we sort them? Let's sort them and then we can look at what our top-running requests are sorting. Again, just like before, I can not put it that line. Let's make it a bit simpler. I'm not really interested for the moment in the response code. So just the time and the request itself. Yeah, so there's the first field and the second field. Let's flip it around. Don't flip it around. Keep it where it is. No, flip it around. I want to get... Let's put time on the front. Let's look at the request time and the request itself. Okay, we've seen the time along the front. It's a little bit easier to sort this way. Let's sort those. I don't know, we get an idea of... Oops, I'll push it into less. Sort, sort, put a few switches on. Sort them numerically and sort in reverse. Okay, I'm upside down. Let's look at the long ones first and make sure they're treated. Oh, there we can see. There's some horrible ones. 200 seconds. This was a server that was suffering. I think many of you have been in a position where you're looking at servers which are suffering. Possibly this was a case where needed to do some rate limiting or something else. Ah, you can see the problem in this straight away. Okay, this thing is using skip paging. This thing is using skip paging equals true. What's that about? It's not saying paging equals false. Oh, it's skip paging equals true. It's not a bad thing. Usually I see paging equals true or paging equals false. Skip paging. I'm not too sure what that means in the API. Anyway, here we can see just with a very quick little one line up. We pulled out the request time and the request and sorted in order. So you're seeing your very longest requests here. And it's fairly obvious looking at this server that if you're going to look anywhere for where it's suffering or where it's in need of some attention, it's on API track to the instances. These are requests on API track and posts as well. Interesting enough. And events. There's a number of URLs here which are resulting in long delays. Some of these are possibly effects rather than causes because the server perhaps was being completely overwhelmed in which case it's becoming very slow responding to anything. That's a way of basically doing two things very simply. One is to match lines in your log file. For example, there's a numeric match, the other kind of matching we've seen. We can do a regular expression match. And the other thing that it allows you to do is to refer to the different fields of your log as $1, $2, $3, etc. As I said, that allows you to do quite a lot of interesting things. Book itself, if you read the book, is a lot more than that. It's a lot more than just matching and printing lines. So probably this is 80% of what you use or for most of the time. It's just like a more sophisticated way of doing it than just using grep. But it can do much more. The structure of an orc program, if you like, actually is in three parts. There's a gin section, and then there's this section here, which is match an expression and then do something on the line. And then there's an end section. And I've got a little example here that somewhere in my history makes use of all of those sections. Let's see if I can find it again. It's one I was going to show you the first time. Let's do grep. Yeah, it's here. This is quite a nice example of using a little bit more functionality. You can see it is almost like more of programming language. Okay, I'm going to try to explain this one for you. And first of all, I have a ZCAT because I've got a compressed log file. Okay, so I'm going to ZCAT my log file into orc. My orc is going to count events. For some reason, I can't remember why I was running this initially, but I basically want to look at what I'm trying to do. I'm trying to find out what the result of putting two events are. How many are resulting in a 200, in a 300, in a 500. So basically, I'm filtering here on API events. And then for each line matches API events, right? For each line, a greater variable here called status, it's actually what's called I guess an associative array where I'm taking the $6, which you might remember $6 is the status code. So that'll be 200, 201, 400, 404, 429, whatever it might be. So I'm creating an associative array called status and incrementing the count for each status code that's coming in. And then I've got an end block to my orc program, which will just run through that associative array and print out the values of them all. Basically, what this does fairly quickly, I hope, is run through all of my puts to API events and tell me how they all ended up, right? So I got 2408, I got 69401s, 1666302s. Quite this shows my rate limited was at work here, right? 28,000 requests got results rate limited. But in the process, 371,000 requests actually got through. This kind of little orc break this little breakdown is very useful. If you're, if you decided to implement rate limiting and you want to know what effect it's having, right? This gives you a very quick, easy indication. You can see most of the requests are getting through some of them are getting rate limited. What you might find if there was no rate limiting, and I seem to remember in this case exactly what was happening. If there was no rate limiting at all, then the number of 200s was actually much less than that. But we were getting more errors, right? More 503s or more. I can't even remember what all the error codes were. So it's quite hard to get this kind of information easily and visually, but you can see here it's quite a simple little one liner really that pulls the information line by line out of the log file, puts them into an array and then just loops through the array and prints out the results. Learning orc is good for your health, particularly if you're looking at log files. As I said, you could do all of this in an Excel spreadsheet, but I'd have it done in orc before you've even got your spreadsheet up and running. The other advantage of doing it in orc, of course, is that it now lends itself to being scripted. You could run this as a cron job every morning, right? So you can have a look in the morning what the performance has been like during the day. As I say, the best way to learn orc is to read the first 10 pages of that little book rather than listening to me giving you little examples. I'm going to put some of my examples up here so that you can look at them. I'm going to put them onto the resources channel on Slack.