 on GitHub, I blog at thagamizer.com and now that I am paid to do some of this blogging, it might happen more often than three times a year. I work as a developer advocate at Google, you can come ask me what that means later, but the short version is that I help Google learn how other people write code and I help other folks learn about Google Cloud Platform and my email is thagamizer at google.com and I am way too happy about the fact that I got that as my alias, is so awesome. So, survey. Raise your hand if you have had your site taken down by non-malicious traffic. Raise your hand if you were that non-malicious traffic. Raise your hand if you've just had a generally bad internet day where stuff just wasn't routing right and things didn't seem to make sense. Raise your hand if you've had a site with a missing index that you took you too long to find. Yeah, me too, all of those things. So, story time. At one of my previous jobs, I went on vacation to go speak at a conference because that's how I take vacation and while I was gone, we had 50, we had half a million, 500,000 requests in a little under five minutes for a site with less than 100 active users. Something was not right, you would think from that statistic and took two days where people were following their intuition, hunches, trying to figure out what had happened and we knew the system really well and so we had some pretty good ideas about what was going on, but we couldn't figure it out and I came back and when I came back they were working on importing the data from the app and it was a microservices app so there were actually seven applications because why just have one? Three different languages and importing the app logs, the sys logs and the database logs all into a Mongo database so that we could query and try to correlate and figure out what the heck had gone wrong. And there was a couple of problems with that. One, not everyone on the team knew Mongo query language. In fact, I was one of the people on the team who knew Mongo query language the best and I don't know Mongo query language, let's be clear about that. I can read docs like everyone though, it's just better than anyone at reading the docs. And then it was time consuming, importing all those logs which it took most of a day and we were importing it to some developers local machine so not everyone could access them quite so easily and we couldn't like all pile on and try to come up with conclusions at the same time. Another company I worked on, we had one user sending 10 requests a second for the login page and our average user age at that company was six so our best guess is kitten on keyboard, constantly hitting refresh. But no matter how it happens, some days your traffic's going along and you know you're busy but you wanna be busy and then something bad happens and everything just starts to pile up. So what do you do then? Well the first thing you need to do is you need to get running again, you need to get your site back up and then you need to figure out why and sometimes that's easy but sometimes it's not. It might be easy if you have a tool that says hey dummy, you're missing an index right here and I'm glad that I have had those tools to help me during my programming career. Sometimes you can say oh this person at this IP address, they're done, we're just gonna shut them off for the next 24 hours, let them cool down and come back, have that kitten get off the keyboard. And sometimes you have a problem that you can't figure out because you've got a really complicated service-oriented architecture thing going on and everything looks good at the first step and everything's broken by the time it gets to the second step and all that's between there is copper and you don't know how it could be going wrong in copper but it is. And sometimes you just can't reproduce the bug in your staging or your test environment so how are you gonna figure it out? So there are a bunch of possible tools to help with this. I bet most of the folks here who've done any sort of opposite stuff have a pile of log extract scripts that pull the interesting bits out of your logs just sitting around on your hard drive. I've got dozens and I've never officially done ops. Sometimes you can look at spreadsheets or some of the visualization tools you might have and sometimes you can do really, really low tech stuff like this, I was working in a company where my boss was very convinced that the problem was that Rails was not good enough for our system and that's why our site was slowing down and so what this is is a timeline and each colored bar is a request, the different colors are different request times and shows how long each request and then on the y-axis are the pins that were running unicorns and how long each of those requests was on a given pit. The big conclusion I wanted people to draw from this wasn't like, here's our problem, point to it. It was that the server was mostly idle and so clearly the problem wasn't Rails. Like it wasn't that Rails was not fast enough and that we needed more servers or we needed to be on a different web framework. The problem turned out to be a missing database index but I needed to draw this thing to show them that the problem wasn't actually Rails where there wasn't actually a problem at all. But none of this stuff I've talked about really scales very well and that half a million requests in a little under five minutes, part of the problem it took us along to figure it out was because it was just so much data and the vast majority of that data was your site's too slow, your site's too slow, your site's too slow, time out, time out, your site's too slow. So there are more tools you can do. There's log aggregators and there's monitoring tools and I'm sure that anyone who's got a production site probably has at least one if not multiple of these but they're expensive and when you've got a site that's still pre-alpha or alpha and you've got a small team, you're working on a shoestring, maybe you don't have the money to buy those tools. Also, if you're not production, you may have set them up yet. Also, sometimes they're just the wrong tool for the job. Often when you're dealing with those strange issues, you need to be able to do ad hoc queries. You need to be able to use your expertise in how your system works to be able to figure out how the pieces are breaking and how things are going together and some of those tools don't support ad hoc querying. And one of the biggest problems I've found is that oftentimes you need to have forethought. You need to know that something is going to go wrong so you can figure out what went wrong because you need to set up all the analysis stuff ahead of time. And that brings us back to the title of this talk, forensic log analysis. So this is the kind of log analysis that you can do after something went wrong if you didn't set up the right tools ahead of time. And I think it's a really handy tool to have in your toolbox. Don't expect you to need it very often but like many things that live in the back of your toolbox, sometimes it's exactly the right tool for the job. So Google BigQuery is a tool I'm gonna use for this. I didn't know that it existed three months ago and I'm guessing a good hunk of you guys didn't either but now you do. So what is it? It's a tool for querying large data sets and by large I mean up to several terabytes. So yes, large for most common definitions of large. And it has two main use cases in my experience. One is to aggregate and summarize millions or billions of rows. And the other tool, the other use cases to find the needle in a haystack. So if you have five bad records in five million, it's really good at finding those very, very quickly. And speaking of quickly, it can query across all of the births in the US from 1968 to 2008 in less than 10 seconds. That's fast. It's easy. It uses SQL. How many people raise hands? How many people in the audience know SQL or at least can do a select statement? Great, you can use BigQuery. That's all you need to know. And you can upload CSV or JSON files, formats you already have, formats you already know, formats you can already build. And best of all, it was purpose built for querying logs at Google, at Google scale. And it's publicly available now. And I've personally found that when I use tools for their intended purpose, I tend to have better luck. It's like why we don't hammer in screws as one of my CEOs did. And then the pictures fell off the wall. So yeah. So a quick demo. The rain off. Seattle versus Salt Lake City. So to do this rain off, I am going to query the NOAA weather data for the last at least 50 years. And it's a database called the Global Surface Conditions of the Day Database from NOAA. It's abbreviated GSOD. Yeah. And it's a publicly available data set on BigQuery. You don't have to have an account. You can just go in and play with it if you want. Also, just for reference, it's 115 million rows. So what we're going to do is we're going to go here. And we're going to go to this is the data set. We've got station numbers. And then rain is a Boolean. So you can just 1 or 0 if it rained that day or not. So I'm going to select for the station. We're going to select the station number. We're going to count all the rows. So we have a total. And we're going to count the number of rainy days by summing the number of columns that have rain. And then we're going to select that from. And then really long name for the public data sets, GSOD, blah, blah, blah. And we're going to pull from two specific station IDs. And I'm pulling these from a text file that I got off a public data website. So that's the Salt Lake City Airport. And then we'll search for Seattle. And that's the Seattle Airport. And we're going to group by the station. So we can summarize by Salt Lake City versus Seattle. And then we're going to order by the station. So they stay in the same order, Salt Lake City first. And I typed it wrong because you don't use when. You use where in SQL. And then we run the query. And it takes a second and a half. And we see that Seattle has about 3x, almost 4x, the number of rainy days. That's all Lake City does, which shouldn't surprise me because I think this is kind of a desert here. It's actually really cool to do this because if you look at Atlanta, it actually has more rainy days than Seattle. So you can also download this as a CSV. You can play with it on your machine. You can push it to Google Sheets if you want and use the visualization tools there. There's a lot of things you can do with it once you get here. But this is just trying to drive home the point that we're carrying 115 million rows in less than two seconds. It's fast. So how to? So I'm going to show you how to use this for your own stuff because while weather data and the CDC's birth data is all very cool, not actually something that will get you paid a lot of money unless you're a data scientist. And I'm not, sadly. So if you're familiar with any sort of data warehousing, everything I'm going to sell you is going to start sounding very, very familiar. Step one, we need to abstract the data. Step two, we need to upload the data. Step three, we need to query the data. Step four, we need to understand the data. A lot of times people leave this out of the process. They're like, data. And I'm like, that's nice. What? What is this data telling me? And really it's actually going to hopefully understand the data because I've done many visualizations that actually mean nothing, they just look really pretty. So, and for reference, for the rest of the talk, I'm going to be using a very simple blog app that I wrote. I'll add a 15 minute Rails blog. It has posts, posts have titles. That's it. And what I did is I ripped up the blog app in like 15 minutes using the Rails scaffold and then I used Mechanize to make a load generator and I shoved a bunch of load at it and ran it over the weekend and have 30 megabytes of Rails log and seven and a half gigabytes of Nginx log. So it's probably something wrong with my load generator because I filled the disk. But 30 megabytes of Rails log is enough to at least demonstrate what I'm doing here. You can definitely do bigger data sets. So the first thing we need to do is we need to extract. And Rails log looks about like this and we're humans, we pattern match. So you can probably tell that about there something bad happened and that's a stack trace because there was a 500 error. But what if your Rails log looks like this or maybe like this or maybe like this? Yeah. So most likely this is the case you're actually dealing with and I don't pattern match well at this scale. Like I'm good pattern matcher, not great. So you're gonna need code. So I spent an hour to write up some code that extracted the interesting bits and I have that in air quotes on my slide notes because the interesting bits are gonna depend on your app. You know your app. You know the things that are interesting. You know the things that aren't interesting. I selected some stuff that I thought was interesting but you can write this code, it's really simple. And let's walk through it. So first thing to note if you're dealing with Rails you need to do this because that check mark that's in the param stream that I always forget is there and then things just completely fail on me because UTF-8. BigQuery can help us. We need to get the logs into a format it understands. Two formats are JSON and CSV. I'm gonna use CSV because I've used it before and I know the CSV library in Ruby. You can use JSON if you want to. Other reason I like CSV is that I can open it in numbers or other spreadsheet tools so I can make sure that I'm actually extracting the bits I want pretty easily. So next step, I need a file name to read in from the logs and I need a file to write the CSV out to. Great, argv. And if I don't have an out file I can't actually do much so I'm gonna abort. I like struct for quick and dirty scripts like this. So here's the struct that's gonna hold a record. This is gonna be one line of our CSV. We have a body. We've got a timestamp, an IP address, a verb, HTTP verb, get, put, set. No, set's not one of them, get, put, post. A resource, that's gonna be the URL. The status, the status code, 400, 200 I hope, in the processing time. And I'm just gonna put all the records into an array because why not? This is a pile of regexes. This is what most of that hour writing this code was spent doing. I'm not going to say that these are the best regexes for each of these tasks, but they work. In quick and dirty scripts, that's all that matters. And the whole idea of doing forensic log analysis is you wanna be done as fast as possible. So that's what we got here. And then here's the meat of the thing and this is actually probably a little small so we're gonna go through it line by line. Open the file. Again, UTF-8, that's important. We're gonna go through each line. First we're gonna chomp it. If it's blank at that point, we're gonna skip it. And then we're gonna make a new record from our record struct. And then I'm just gonna apply each of the regexes in turn and pull those bits off and store them in the appropriate fields in the struct. If you don't know the square bracket regex match syntax, it's awesome. I did not know it before I wrote this script and then I felt a little weird about it because I hadn't seen it before. It's great for stuff like this. It pulls the capture groups out really easily and allows you to do each of these as a one line thing. Go spend some time with RubyDoc and learn it. It's awesome. And then once you're done, put the record into the records array. That's it. There's some stuff to write out the CSV but it's literally directly out of the CSV doc so I'm not gonna show it. And here's what the basic CSV looks like when you get done. So now I've got a CSV and so it's time to upload it. Now if you have a file that's less than 10 megabytes, you can just upload directly to BigQuery from the BigQuery UI but our file's a little bit bigger than 10 megabytes. So I'm gonna upload it into Cloud Storage. You can use other storage options. You can think you can even pull from a website. You have, I'll check with that and let you guys know. But Cloud Storage is easy. So here's how you do that from the UI. We're just gonna go to Cloud Storage, Storage Browser. If you're familiar with AWS, you know about buckets and files. I'm gonna make a bucket called production logs. Create it. There's a couple different kinds of storage. I'm just gonna use standard because I'm lazy. And then I'm gonna upload a file from the UI. Pull it off of my hard drive and it starts uploading the production CSV. That's it. And so, so once the file's uploaded we'll just let that do its thing. No really. I wanna go on to the next slide. You need to give a schema. We're making, we're loading something into a database so we need a schema. It's pretty simple though. These are the types that are supported. String, integer, float and float and boolean. Do exactly what you think they do if you know databases at all. Record is just a collection of fields. This would be a place where if you wanted to use record JSON might be a better option. And timestamp has a lot of valid formats. The one you need is probably there. I haven't found one that it doesn't support yet. So for my example, body's gonna be a string. Timestamp oddly is going to be a timestamp. IP address is gonna be a string. Verb is a string as a resource. And then I'm doing status and processing time as integers. I very specifically did not pull the units off of the processing time in the CSV because I want to be able to search by ranges. I want all the things that took longer than 60 milliseconds, for example. And you can do the same thing with status codes. A lot of times people go back and forth on status codes but being able to say anything that's greater than 200 or greater than 300 is interesting to me is a pretty valuable thing to do. So we're gonna create a data set by going and clicking into BigQuery. And we're just gonna make a data set here. I was gonna try to pretend to do these as live coding demos, but no. So just gonna make a new data set called long forensics and make a new table in that data set. All I have to do is give it a name, give it the path to the file in Google Cloud Storage based in the schema. And I have one header on my CSV so I'm just gonna tell it to skip it. And then it's gonna do some processing that takes a couple minutes for it to load it up but not very long at all. And so now we can query it. So this is a really stupid Rails app and I threw a lot of load at it. And by a lot, I mean a sufficient amount that there was well over 2,000 blog posts in a couple hours, which was great. So I have a pretty, I tried to load up the page in my browser and I found that it was actually really, really slow and I added some basic search functionality and that was even slower. So I was pretty sure that there was gonna be some things that were gonna take a long time to process. So let's actually do a query and see what's going on there. So I'm just gonna, all I'm gonna do is I'm gonna count the number of rows that had a processing time greater than 60 milliseconds. And so just a simple where clause. I run the query, two and a half seconds and I have 10,000 rows that took a greater than 60 milliseconds. So you know, there might be something going on. So let's actually see what some of those are. So I'm gonna get the body and the processing time and I'm gonna sort by the processing time descending so I can see the worst ones first, right? Very, very simple stuff. And if you look at the processing time on that first one is 36,000 milliseconds and 35,000 milliseconds and it's spending the vast majority, like 34,000 milliseconds, 74 seconds in rendering the view and over 100, almost 200 milliseconds in the database. So there's clearly some problems here. And I made a educated decision based on experience that probably the fact that I was trying to render 2000 plus blog posts on a single page and then send that down the pipe might be the problem. So I added pagination and turns out things got better. I added a basic index to the database on, cause they're ordered by the updated at, on the updated at and also on the title so I could search titles and that got a lot better and the problem got fixed very, very quickly and there's lots of records here. You can sort through all of them. They're paginated for the, because BigQuery does not have the problem my blog did. So yeah, rendering was the biggest problem. Database was also a problem. And now onto why? Why should you even bother trying this? The biggest thing is that you can do it retroactively. You have a problem. Your current tool set isn't working. Having this in your toolbox is a great thing to do. It's also simple. You already know SQL. You already know how CSVs work. You don't have to learn new things beyond some basic UI stuff. It's actually pretty self-explanatory to use this. It's fast. If you're not on a cafe network, I was on a cafe network, but if you're not on a cafe network, you can go from having your CSV to something that you can act on in less than 10 minutes and generating your CSV shouldn't take that long, especially if you've written your, pull out the interesting bit script already, which I'm guessing a lot of you probably have. I've been on that team that set up the Mongo database to dig into a problem and that took a lot longer as in orders of magnitude longer. It's also really easy to share. You can pass links on to people. You can add people to the projects so that they can query the same data set that you can. You can make visualizations from it. It's much easier to share this than if it's just some data sitting on a developer's machine that you can only access on the network and only if you have their password or they've added you and cost. So this demo, including storing and uploading the data, cost me less than $5. Queries are free. So you can query the data as much as you want. The only thing you have to pay for is storage and the storage rates are competitive. I can talk to you about that. It's gonna depend on how much data you have, but I can also point you at a blog post that'll help you out with that, figuring that out. And if you wanna learn more, I will post a link, tweet a link to the blog post about storage. I will also tweet a link to this talk by one of my colleagues in Japan. It's about using Fluent D to live stream your Rails logs into BigQuery. It's a really cool blog post. It's got a lot of cool stuff in it. And then there's also the BigQuery web UI quick start and this is just how to BigQuery using the sample data sets and I'll tweet a link to this as well. I'm happy to show folks more stuff with the public data sets especially. There's a lot of really cool things you can do. There's the Wikipedia logs. There's some GitHub commit logs that are all available publicly accessible for interesting queries. I have a blog post that's mostly written tracking how multiple births have increased since 1968 in the US. Because one of the pieces of data the CDC records is whether a birth was a singleton, twins, triplets and so on. So thank you. I wanna thank the conference organizers for organizing this conference. It's awesome. I like coming here. Salt Lake is great. I wanna thank my coworkers for helping me with this talk. I wanna thank everyone who helped me deal with the fact that my laptop had issues and then my work laptop had issues and Ryan for letting me use his laptop to actually finish this talk. Thank you guys very much.