 So everyone, yeah, thank you for coming. Yeah, so yeah, we'll be talking about Bash and CSVs and kind of command line data analysis in general. Yeah, so am I, yeah, I work at a database infrastructure at GitHub. Here's my social media stuff. But relevant to this conference, my first job at a school was at this place, the Delaware Valley Regional Planning Commission. They do a lot of urban suburban rural planning for the greater Philadelphia region and I think I was like one of two developers there at like a hundred person organization. And I think it really sparked my interest in open data and GIS info and especially, you know, give you the feeling how having like a little bit of programming experience can really go a long way into helping like reduce kind of manual toil and things like that. So this talk is about using the right tool for the job. Bash is not going to be great for everything, but in certain cases, I think you can get a pretty good first pass on your data or logs without having to rely on external tooling. So why do you use Bash? In certain cases, it can be pretty useful. This article is a little bit tongue in cheek how command line tools are 235 times faster than your Hadoop cluster, but I think it brings up an interesting use case where, you know, when you're in the multi gigabyte range, you know, opening two gigabytes in Excel or Google Sheets is not going to be a fun time and it's really not enough to really make use of, you know, big data tools like Hadoop, but so what are you left with? You can import into a database like a SQL or something like that, but that could be, you know, a bit of a hassle if you know I have one up and running already or you can do a pass with your command line tools. So when is Bash a good choice to use? When your data is line-based, all the core utilities understand lines really well. If you have multi-line data, like a stack trace or something, then probably Bash is not gonna be a good idea. Next is when you have a consistent structure. CSVs are a great example of that. I would also recommend it for things like server logs, but you have something like a hard structure, like a JSON or an XML, that would probably be easier to use like a tool specifically for that. First step in any kind of data analysis is to peek at your data, see what you're looking for. So head as a utility that will show you the first end lines of the file. If you're more familiar with SQL, it's kind of like a select star limit two. Tail is the opposite of that. It'll show you the last end lines of your file. So it's like an order by descending limit two. WC stands for word count, but here we're gonna pass the L option to count lines. WC can count a lot of different things. Lines, characters, words, probably characters and words are not the most relevant for our CSV file, but yeah, if you wanna write out a tweet ahead of time to make sure you're under the limit, it can be useful. So to recap from peeking at our population CSV file, we have a three column CSV, city, state, population. Looks like big cities at the top of the file, smaller cities at the bottom of the file, 761 cities in total. So let's say you have your file. Let's say you don't want all the lines in it. You're looking at specific lines. Grep is gonna be your tool for that. So that'll search for a string and return all the lines that match your string. You can use the dash V flag, kind of do the opposite to exclude lines that you don't wanna see. So you can see in this file, this is not the population of Honolulu City, it's the census designated place, the same thing with Anchorage and Lexington here, so that's kind of interesting to see. So now that we have the lines that we want, let's say we don't want everything within the line, we want certain things. We can use cut to chop up the lines in a consistent manner. So here, we're getting the first and the third column, so city and population, let's say I don't wanna see state. Cut gets really interesting when you can do multiple passes with it. So let's say at the top here, I have a server log, but I only want the time portion of it. We can do that with two passes of cut. So first, we're gonna chop up the line based off of space, and then we can do another pass that's chopping up the line based off of the colon character. So that way we can easily isolate the exact portion of the data that we want, but it's not an easily parsable format. Sort is one of those utilities where it does what it's named. By default, it works off of the whole line alphabetically, but in general, we don't wanna sort on the whole line. We can give it a key column to sort on. So in this example, we're sorting by state alphabetically. It gets more interesting when we can do multiple key columns. So we're able to do kind of like sort of advanced stuff on the command line here now. So here we're sorting by state alphabetically, and then we're sorting by population numerically, that's what the end there is for. So it treats it like a number, not a word. And then the R is for reverse, so it's gonna order descending. So you have the biggest cities at the top. Now we can chain things, even more things together. We can start to answer questions like what are the biggest cities by state? So we take the sort command that we did in the previous slide, and we can sort it again, giving it the U flag for unique. So what that will do, and we're giving it the state column as well. So it'll give us the unique output by state, essentially giving us the biggest cities by state here. Another question that we might have is how many cities per state? Yeah, how many cities per state are there? So we can get just the state column, that's the second one, and then we can use unique and the dash C flag to give us the count of how many times that state occurred. The important thing to note here is that unique expects sorted input. So try to think of sort and unique as a pair together because otherwise you will not get the output that you want. Yeah, so we can see, yeah, in this file, 178 cities are California, 65 are Texas, 25 are Michigan, which is, yeah, not what I was expecting, so that's cool to see. Another question that we can answer is where does Portland rank? So we can sort the file by population, and we can look for Portland. We can pass the N flag, and so that'll return us the line number that this occurred on. So we can see that Portland, Oregon is the 26th most populous city in our file, and Portland, Maine is the 537th most populous city. Eventually we'll probably have to do something with multiple files. So let's say I have a file that has a state FIPS code and a city name, and then I have state FIPS code and the state name. So if I have this data and I want to see city name and state name, how can I join them together? There's a join command that works pretty much exactly like how you expect a SQL join to work with. So it's useful when you have these type of identifier columns here. Another way to work with multiple files is paste. Paste is kind of weird in that it basically zips two files together, so if you don't have an ID column or something to join on, if you pass the dash S option, it'll transpose the rows and columns, it's generally only useful if you have like a very specific output format that you're looking for, but it's there. A good example of paste here is, let's say I wanted to add like a number to each line of this file. That's like a row number or something. Sequence is a command that basically prints out one item per line. So what we can do is basically zip out the output of sequence with our population CSV file. So our generated output is we have a number for each line now. Common is an interesting way to compare files. Com produces three columns of output. The first column are lines that are only in the first file. The second column are lines that are only in the second file. And the third column are lines that are in both files. So if you wanna do like type of like diffing operations. Common is also weird in that you tell it, you pass it a column and then it will not print out that output. So if you wanted only the first column of output, you would pass it two and three. Yeah, it's weird. In general, I find myself using most of the time this last one, com one two. So that's a, we'll only return the lines that are in both files. So kind of, yeah. So think of it like a Venn diagram type thing. We don't always get data in the format that we want. We'll have to do a little bit of cleanup. Set is a good option for that. Set is really powerful, but I find it best to do for these type of like simple like search and replace type operations. So yeah, we know in our file, we have city repeated a bunch of times. I don't want to see city anymore. We can search for space city and replace it with nothing. And that's what it'll do. TR is in kind of a niche command. It stands for translate. So you're, the way to think about it is you're translating one character set to another. So let's say I wanted to, the first example is translating space to new line characters. So if you have a big long row of things and you can separate it out so it'd be a bunch of lines of things. TR also has the concept of character classes. So you can even do things like upper and lower casing on the command line. And you can even do things like, let's say I have a bunch of numbers, I want to shift by one. So you can do that. And it also works with letters. I've done a lot of commands out. Let's go through an example. So throughout the stock I've used, yeah, this population CSV file. It's got nice three columns, pretty simple to easy to see what's going on. Where did I get it? So I did not download it directly from the American FactFinder website. I'm sure you won't be surprised to hear that, but I did download all the raw data from the census. This is what I can download directly. It's a little bit hard to see what's going on. There's lots of great data in there, but let's say I don't want all of it. So first step, you'll see there's two header rows in this file, which is nice to see, I guess. So the first thing, let's get rid of those header lines. We can use tail for that. So this version of tail, I'm basically saying start at the third line and give me everything underneath that so we can get rid of that. Next, there's a lot of columns of output here. Let's only get the columns that we're looking for. And through trial and error, we can figure out that we want the ninth, the 10th, and the 19th column of output. So this is an important one because you'll see here that actually city and state are quoted field. So technically that would be one column of CSV input. In this case, Bash does not understand that really all that well. In this case, it kind of works out in that they're sort of the same, but yeah, if you have more advanced CSV data, you might have to do some interesting things with Bash to get it to work. So next, let's get rid of the quotes in our file. We can use said to do that to search for quote and replace it with nothing. We have the G at the end there because of the fact we have multiple occurrences of quotes. So by default, it'll only look at the first thing and then move on to the next line. So with the G, it'll search and replace throughout the whole line. Next, yeah, so we have a leading space in the state name now, let's clean up that. So we can do another pass with said and now we have our nice three column CSV file. We can save that to population CSV and we're all done. Kind of another fun example would be like the Shakespeare word count that you see in all the Hadoop examples. So we can download like Shakespeare.text which clocks it from Project Gutenberg. So that clocks in at about five megabytes, which is all that much. So and we can run a bunch of commands and kind of get a word count here. The interesting things here are that we use TR to delete everything that's not a number or a letter and a space. So this is using the idea of those like character classes. And then we can, so that's the first TR. The second TR will squeeze all the spaces together into one space. That's what, and so we're using the space character class because it deals with tabs, new lines, things like that. And then we turn everything into a new line character, all the spaces into new line characters. That way you get one word per line. Remember all the core utilities like working with lines. Then we lowercase everything, sort unique it, and this is the output we get. So we see kind of the what you would expect out of the most common occurring words. When we get down into the single digits, does it get interesting, you know, Fomans, Fomans? Yeah, so this is what we're able to do here. Thank you. Yeah, all the materials, all the CSV files and stuff and slides are at this GitHub repo. Feel free to ask me any questions on Twitter or in person here. So thank you. Thank you.