 All right, so welcome to today's lecture, which is going to be on data wrangling. And data wrangling might be a phrase that sounds a little bit odd to you. But the basic idea of data wrangling is that you have data in one format, and you want it in some different format. And this happens all of the time. I'm not just talking about converting images, but it could be like you have a text file or a log file. And what you really want is data in some other format, like you want a graph, or you want statistics over the data. Anything that goes from one piece of data to another representation of that data is what I would call data wrangling. We've seen some examples of this kind of data wrangling already previously in the semester. Like basically whenever you use the pipe operator that lets you sort of take output from one program and feed it through another program, you are doing data wrangling in one way or another. But what we're going to do in this lecture is take a look at some of the fancier ways so you can do data wrangling and some of the really useful ways you can do data wrangling. In order to do any kind of data wrangling, though, you need a data source. You need some data to operate on in the first place. And there are a lot of good candidates for that kind of data. We give some examples in the exercise section for today's lecture notes. In this particular one, though, I'm going to be using a system log. So I have a server that's running somewhere in the Netherlands because that seemed like a reasonable thing at the time. And on that server, it's running sort of a regular logging daemon that comes with system D, sort of relatively standard Linux logging mechanism. And there's a command called journal-ctl on Linux systems that will let you view the system log. And so what I'm going to do is I'm going to do some transformations over that log and see if we can extract something interesting from it. You'll see, though, that if I run this command, I end up with a lot of data because this is a log that has just like, there's a lot of stuff in it, right? A lot of things have happened on my server and this goes back to like January 1st and there are logs that go even further back on this. So there's a lot of stuff. So the first thing we're going to do is try to limit it down to only one piece of content. And here, the grep command is your friend. So we're going to pipe this through grep and we're going to pipe for SSH, right? So SSH we haven't really talked to you about yet, but it is a way to access computers remotely through the command line. And in particular, what happens when you put a server on the public internet is that lots and lots of people around the world try to connect to it and log in and take over your server. And so I want to see how those people are trying to do that. And so I'm going to grep for SSH and you'll see pretty quickly that this also generates a bunch of content, at least in theory. Oh, this is going to be real slow. There we go. So this generates tons and tons and tons of content. And it's really hard to even just visualize what's going on here. So let's look at only what user names people have used to try to log into my server. So you'll see some of these lines say disconnected, disconnected from invalid user and then some username. I want only those lines. That's all I really care about. I'm going to make one more change here though, which is if you think about how this pipeline does, if I here do this connected from, so this pipeline at the bottom here, what that will do is it will send the entire log file over the network to my machine and then locally run grep to find only the lines that contain SSH and then locally filter them further. This seems a little bit wasteful because I don't care about most of these lines. And the remote side is also running a shell. So what I can actually do is I can have that entire command run on the server. So I'm telling SSH the command I want you to run on the server is this pipeline of three things and then what I get back, I want to pipe through less. So what does this do? Well, it's going to do that same filtering that we did, but it's going to do it on the server side and the server is only going to send me those lines that I care about. And then when I pipe it locally through the program called less, less is a pager, you'll see some examples of this. You've actually seen some of them already like when you type man and some command that opens in a pager and a pagers is a convenient way to take a long piece of content and fit it into your terminal window and have you scroll down and scroll up and navigate it so that it doesn't just like scroll past your screen. And so if I run this, it still takes a little while because it has to parse through a lot of log files and in particular grep is buffering and therefore it decides to be relatively unhelpful. Let me do this without, let's see if that's more helpful. Why doesn't it want to be helpful to me? Fine, I'm going to cheat a little, just ignore me or the internet is really slow. Those are two possible options. Luckily, there's a fix for that because previously I have run the following command. So this command just takes the output of that command and sticks it into a file locally in my computer. So I ran this when I was up in my office and so what this did is it downloaded all of the SSH log entries that match disconnect from. So I have those locally and this is really handy. There's no reason for me to stream the full log every single time because I know that that starting pattern is what I'm going to want anyway. So we can take a look at SSH.log and you will see there are lots and lots and lots of lines that all say disconnected from invalid user, authenticating user, et cetera. So these are the lines that we have to work on and this also means that going forward, we don't have to go through this whole SSH process. We can just cap that file and then operate it on it directly. So here I can also demonstrate this pager. So if I do cat SSH.log and I pipe it through less, it gives me a pager where I can scroll up and down. Let me make that a little bit smaller maybe. So I can scroll through this file and I can do so with what are roughly VIM bindings. So control U to scroll up, control D to scroll down and Q to exit. This is still a lot of content though and these lines contain a bunch of garbage that I'm not really interested in. What I really want to see is what are these usernames? And here the tool that we're gonna start using is one called SED. SED is a stream editor that's modified or it's a modification of a much earlier program called EDD which was a really weird editor that none of you will probably want to use, yeah? Sorry, you may have to go ahead. Oh, TSP is the name of the remote computer I'm connecting to. So SED is a stream editor and it basically lets you make changes to the contents of a stream. You can think of it a little bit like doing replacements but it's actually a full programming language over the stream that is given. One of the most common things you do with SED though is to just run replacement expressions on an input stream. What are these looks like? Well, let me show you. Here I'm gonna pipe this through SED and I'm going to say that I wanna remove everything that comes before disconnected from. So this might look a little weird. The observation is that the date and the host name and the sort of process ID of the SSH team and I don't care about. I can just remove that straight away. And I can also remove that like disconnected from bit because that seems to be present in every single log entry so I just wanna get rid of it. And so what I write is a SED expression. In this particular case it's an S expression which is a substitute expression. It takes two arguments that are basically enclosed in these slashes. So the first one is the search string and the second one which is currently empty is the replacement string. So here I'm saying search for the following pattern and replace it with blank. And then I'm gonna pipe it into less at the end. So you see that now what it's done is trim off the beginning of all these lines. And that seems really handy but you might wonder what is this pattern that I've built up here, right? This dot star, what does that mean? This is an example of a regular expression and regular expressions are something that you may have come across in programming in the past but it's something that once you go into the command line you will find yourself using a lot especially for this kind of data wrangling. Regular expressions are essentially a powerful way to match text. You can use it for other things than text too but text is the most common example. And in regular expressions you have a number of special characters that say don't just match this character but match for example a particular type of character or a particular set of options. It essentially generates a program for you that searches the given text. Dot for example means any single character. And star, if you follow a character with a star it means zero or more of that character. And so in this case this pattern is saying zero or more of any character followed by the literal string disconnected from. I'm saying match that and then replace it with blank. Regular expressions have a number of these kind of special characters that have various meanings you can take advantage of. I talked about star which is zero or more. There's also plus which is one or more. So this is saying I want the previous expression to match at least once. You also have square brackets. So square brackets let you match one of many different characters. So here let us build up a string that's something like A, B, A. And I want to substitute A and B with nothing. Okay so here what I'm telling the pattern to do is to replace any character that is either A or B with nothing. So if I make the first character B it will still produce BA. You might wonder though why did it only replace once? Well it's because what regular expressions will do especially in this default mode is they will just match the pattern once and then apply the replacement once per line. That is what said normally does. You can provide the G modifier which says do this as many times as it keeps matching which in this case would erase the entire line because every single character is either an A or a B. If I added a C here it would remove everything but the C. If I added other characters in the middle of this string somewhere they would all be preserved but anything that is an A or a B is removed. You can also do things like add modifiers to this. For example, what would this do? This is saying I want zero or more of the string AB and I want to replace them with nothing. This means that if I have a standalone A it will not be replaced. If I have a standalone B it will not be replaced but if I have the string AB it will be removed. Which, oh, what did I, oh, set is stupid. The dash E here is because set is a really old tool and so it supports only a very old version of regular expressions. Generally you will want to run it with dash capital E which makes it use a more modern syntax that supports more things. If you are in a place where you can't you have to prefix these with backslashes to say I want the special meaning of parentheses. Otherwise it would just match a literal parenthesis which is probably not what you want. So notice how this replaced the AB here and it replaced the AB here but it left this C and it also left the A at the end because that A does not match this pattern anymore. And you can group these patterns in whatever ways you want. You also have things like alternations. You can say anything that matches AB or BC I want to remove. And here you'll notice that this AB got removed. This BC did not get removed even though it matches the pattern because the AB had already been removed. This AB is removed but the C stays in place. This AB is removed and the C stays because it still does not match that. If I made this, if I remove this A then now this AB pattern will not match this B so it'll be preserved and then BC will match BC and then it'll go away. Regular expressions can be all sorts of complicated when you first encounter them and even once you get more experienced with them they can be daunting to look at. And this is why very often you want to use something like a regular expression debugger which we'll look at in a little bit. But first let's try to make up a pattern that will match the logs and match the logs that we've been working with so far. So here I'm gonna just sort of extract a couple of lines from this file. Let's say the first five. So these lines all now look like this. Right? And what we want to do is we want to only have the username. Okay, so what might this look like? Well, here's one thing we could try to do. Actually, let me show you one thing first. Let me take a line that says something like disconnected from invalid user, disconnected from maybe four, two, one, one, whatever. Okay, so this is an example of a login line where someone tried to log in with the username disconnected from. Missing an S. Missing an S. Very beginning. First, disconnected. Disconnected, thank you. You'll notice that this actually removed the username as well. And this is because when you use dot star and any of these sort of range expressions and regular expressions, they are greedy. They will match as much as they can. So in this case, this was the username that we wanted to retain, but this pattern actually matched all the way up until the second occurrence of it or the last occurrence of it. And so everything before it, including the username itself got removed. And so we need to come up with a slightly clever matching strategy than just saying sort of dot star because it means that if we have particularly adversarial input, we might end up with something that we didn't expect. Okay, so let's see how we might try to match these lines. Let's just do ahead first. Well, let's try to construct this up from the beginning. We first of all know that we want dash capital E, because we want to not have to put all these backslashes everywhere. These lines look like they say from, and then some of them say invalid, but some of them do not, right? This line has invalid, that one does not. Question mark here is saying zero or one. So I want zero or one of invalid space. User, what else? That's gonna be a double space, we can't have that. And then there's gonna be some username. And then there's gonna be what exactly? There's gonna be what looks like an IP address. So here we can use our range syntax as a zero to nine and a dot, right? That's what IP addresses are, and we want many of those. Then it says port, so we're just gonna match a literal port, and then another number, zero to nine. And we're gonna want plus of that. The other thing we're gonna do here is we're gonna do what's known as anchoring the regular expression. So there are two special characters in regular expressions. There's carrot or hat, which matches the beginning of a line, and there's dollar, which matches the end of a line. So here we're gonna say that this regular expression has to match the complete line. The reason we do this is because imagine that someone made their username the entire log string. Then now if you tried to match this pattern, it would match the username itself, which is not what we want. Generally, you will want to try to anchor your patterns whenever you can to avoid those kind of oddities. Okay, let's see what that gives. That removed many of the lines, but not all of them. So this one, for example, includes this pre-auth at the end, so we'll wanna cut that off if there's a space pre-auth. Square brackets are special, so we need to escape them. Now let's see what happens if we try more lines of this. Now we still get something weird. Some of these lines are not empty, right? Which means that the pattern did not match. This one, for example, it says authenticating user instead of invalid user, okay? So it has to match invalid or authenticated zero or one time before user. How about now? Okay, that looks pretty promising, but this output is not particularly helpful, right? Here, we've just erased every line of our log file successfully, which is not very helpful. Instead, what we really wanted to do is when we match the username right over here, we really wanted to remember what that username was because that is what we want to print out. And the way we can do that in regular expressions is using something like capture groups. So capture groups are a way to say that I want to remember this value and reuse it later. And in regular expressions, any bracketed expression, any parentheses expression is going to be such a capture group. So we already actually have one here, which is this first group, and now we're creating a second one here. Notice that these parentheses don't do anything to the matching, right? Because they're just saying this expression as a unit, but we don't have any modifiers after it, so it's just match one time. And then the reason matching groups are useful, or capture groups are useful, is because you can refer back to them in the replacement. So in the replacement here, I can say backslash two. This is the way that you refer to the name of a capture group. In this case, I'm saying match the entire line, and then in the replacement, put in the value you captured in the second capture group, right? Remember, this is the first capture group, and this is the second one. And this gives me all the usernames. Now, if you look back at what we wrote, this is pretty complicated, right? It might make sense now that we've walked through it and why it had to be the way it was, but this is like not obvious that this is how these lines work. And this is where a regular expression debugger can come in really, really handy. So we have one here. There are many online, but here I've sort of prefilled in this expression that we just used, and notice that it tells me what all the matching does. In fact, now this window is a little small with this font size, but if I do here, this explanation says dot star matches any character between zero and unlimited times followed by disconnected from literally, followed by a capture group and then walks you through all the stuff. And that's one thing, but it also lets you give in a test string and then matches the pattern against every single test string that you give and highlights what the different capture groups, for example, are. So here, we made user a capture group, right? So it'll say, okay, the full string matched, right? The whole thing is blue, so it matched. Green is the first capture group. Red is the second capture group, and this is the third because pre-auth was also put into parentheses. And this can be a handy way to try to debug your regular expressions. For example, if I put disconnected from, as let's add a new line here, and I make the username disconnected from, oh, that line already had the username be disconnected from. Great, give me a thinking ahead. You'll notice that with this pattern, this was no longer a problem because it got matched to the username. What happens if we take this entire line, or this entire line, and make that the username? Now what happens? It gets really confused, right? So this is where regular expressions can be a pain to get right because it now tries to match, it matches the first place where a username appears or the first invalid in this case, the second invalid because this is greedy. We can make this non-greedy by putting a question mark here. So if you suffix a plus or a star with a question mark, it becomes a non-greedy match. So it will not try to match as much as possible. And then you see that this actually gets parsed correctly because this dot star will stop at the first disconnected from, which is the one that's actually emitted by SSH, the one that actually appears in our logs. As you can probably tell from the explanation of this so far, regular expressions can get really complicated and there are all sorts of weird modifiers that you might have to apply in your pattern. The only way to really learn them is to start with simple ones and then build them up until they match what you need. Often you're just doing some like one-off job, like when we're hacking out the usernames here and you don't need to care about all the special conditions, right? You don't have to care about someone having the SSH username perfectly match your logging format. That's probably not something that matters because you're just trying to find the usernames. But regular expressions are really powerful and you want to be careful if you're doing something where it actually matters. You had a question? Regular expressions by default only match per line anyway. They will not match across new lines. Yep. So the way that said works is that it operates per line and so said will do this expression for every line. Okay, questions about regular expressions for this pattern so far? It is a complicated pattern so if it feels confusing, like don't be worried about it, look at it in the debugger later. Yep. What if after adding the question mark suffix had a user that had two copies of that log as its username? Oh, so keep in mind that we're assuming here that the user only has control over their username, right? So the worst that they could do is take like this entire entry and make that the username. Let's see what happens, right? So that's the worst and the reason for this is this question mark means that the moment we hit the disconnect keyword we start parsing the rest of the pattern, right? And the first occurrence of disconnected is printed by SSH before anything the user controls. So in this particular instance even this will not confuse the pattern. Yep. So this looks like very much security application but is this really? Well, so if you were writing a this sort of odd matching will in general when you're doing data wrangling it's like not security, it's not security related but it might mean that you get really weird data back. And so if you're doing something like plotting data you might drop data points that matter. You might parse out the wrong number and then like your plots suddenly have data points that weren't in the original data. And so it's more that if you find yourself writing a complicated regular expression like double check that it's actually matching what you think it's matching and even if it's not security related. And as you can imagine these patterns can get really complicated. Like for example, there's a big debate about how do you match an email address with a regular expression? And you might think of something like this. So this is a very straightforward one that just says letters and numbers and scores in percent followed by a plus because in Gmail you can have pluses in email addresses with a suffix. In this case the plus is just for any number of these but at least one because you can't have an email address that doesn't have anything before the ad. And then similarly after the domain, right? And the top level domain has to be at least two characters and can't include digits. You can have dot com but you can't have dot seven. It turns out this is not really correct, right? There are a bunch of valid email addresses that will not be matched by this and there are a bunch of invalid email addresses that will be matched by this. So there are many, many suggestions and there are people who have built full test suites to try to see which regular expression is best. And this particular one is for URLs. There are similar ones for email where they found that the best one is this one. I don't recommend you try to understand this pattern but this one apparently will almost perfectly match what the internet standard for email addresses says is a valid email address. And that includes all sorts of weird Unicode code points. This is just to say, regular expressions can be really hairy and if you end up somewhere like this there's probably a better way to do it. For example, if you find yourself trying to parse HTML or something or parse JSON with regular expressions, you should probably use a different tool. And there is an exercise that has you do this. Not with regular expressions, mind you. Yeah, there's all sorts of suggestions and they give you deep, deep dives into how they work. So if you want to look that up, it's in the lecture notes. Okay, so now we have the list of usernames. So let's go back to data wrangling. Like this list of usernames is still not that interesting to me. Let's see how many lines there are. So if I do wc-l, there are 198,000 lines. So wc is the word count program, dash l makes it count the number of lines. This is a lot of lines and if I start scrolling through them that still doesn't really help me. Like I need statistics over this. I need aggregates of some kind. And the said tool is like useful for many things. It gives you a full programming language. It can do weird things like insert text or only print matching lines, but it's not necessarily the perfect tool for everything. Like sometimes there are better tools. Like for example, you could write a line counter in said, you just should never. Said is a terrible programming language except for searching and replacing. But there are other useful tools. So for example, there is a tool called sort. So sort, this is also not gonna be very helpful, but sort takes a bunch of lines of input, sorts them and then prints them to your output. So in this case, I now get the sorted output of that list. It is still 200,000 lines long so it's still not very helpful to me. But now I can combine it with a tool called unique. So unique will look at a sorted list of lines and it will only print those that are unique. So if you have multiple instances of any given line, it will only print it once. And then I can say unique dash C. So this is gonna say, count the number of duplicates for any lines that are duplicated and eliminate them. What does this look like? Well, if I run it, it's gonna take a while. There were 13 ZZZ usernames. There were 10 ZXVF usernames, et cetera. There, and I can scroll through this. This is still a very long list, right? But at least now it's a little bit more collated than it was. Let's see how many lines I'm down to now. Okay, 24,000 lines. It's still too much. It's still not useful information to me. But I can keep burying down this with more tools. For example, what I might care about is which usernames have been used the most. Well, I can do sort again. And I can say I want a numeric sort on the first column of the input. So dash n says numeric sort. Dash k lets you select a white space-separated column from the input to sort by. And the reason I'm giving one comma one here is because I want to start at the first column and stop at the first column. Alternatively, I could say I want you to sort by this list of columns. But in this case, I just want to sort by that column. And then I want only the 10 last lines. So sort by default will output in ascending order. So the ones with the highest counts are going to be at the bottom. And then I want only the last 10 lines. And now when I run this, I actually get a useful bit of data. It tells me there were 11,000 login attempts with the username root. There were 4,000 with 123456 as the username, et cetera. And this is pretty handy, right? And now suddenly this giant log file actually produces useful information for me. This is what I really wanted from that log file. Now maybe I want to just like do a quick disabling of root, for example, for SSH login on my machine. Which I recommend you all do by the way. In this particular case, we don't actually need the k for sort because sort by default will sort by the entire line and the number happens to come first. But it's useful to know about these additional flags. And you might wonder, well, how would I know that these flags exist? How would I know that these programs even exist? Well, the programs you usually pick up just from being told about them in classes like here. The flags are usually like, I want to sort by something that is not the full line. Your first instinct should be to type man sort and then read through the page. And then very quickly it will tell you, here's how to select a particular column, here's how to sort by a number, et cetera. Okay, what if now that I have this like top, let's say top 20 list, let's say I don't actually care about the counts. I just want like a comma separated list of the usernames because I'm gonna like send it to myself by email every day or something like that. Like these are the top 20 usernames. Well, I can do this. Okay, that's a lot more weird commands, but they're commands that are useful to know about. So awk is a column based stream processor. So we talked about said, which is a stream editor. So it tries to edit text primarily in the inputs. Awk on the other hand also lets you edit text. It is still a full programming language, but it's more focused on column or data. So in this case, awk by default will parse its input in white space separated columns and then that you operate on those columns separately. In this case, I'm saying just print the second column, which is the username, right? Paste is a command that takes a bunch of lines and pastes them together into a single line, that's the dash S with the delimiter comma. So in this case, if I run this, I will get a comma separated list of the top 20 usernames, which I can then do whatever useful thing I might want. Like maybe I wanna stick this in a config file of disallowed usernames or something along those lines. Awk is worth talking a little bit more about because it turns out to be a really powerful language for this kind of data wrangling. We mentioned briefly what this print dollar two does, but it turns out that for awk, you can do some really, really fancy things. So for example, let's go back to here where we just had the usernames. Actually, let's still do sort and unique because we don't, otherwise the list gets far too long. And let's say that I only want to print the usernames that match a particular pattern. Let's say, for example, that I want eek dash C. I want all of the usernames that only appear once and that start with a C and end with an E. This is a really weird thing to look for, but in awk, this is really simple to express. I can say I want the first column to be one and I want the second column to match the following regular expression. Actually, this could probably just be dot. And then I wanna print the whole line. So unless I mess something up, this will give me all the usernames that start with a C, end with an E, and only appear once in my log. Now, that might not be a very useful thing to do with the data. What I'm trying to do in this lecture is show you the kind of tools that are available. And in this particular case, this pattern is not that complicated, even though what we're doing is sort of weird. And this is because very often with Linux tools in particular and command line tools in general, the tools are built to be based on lines of input and lines of output. And very often those lines are gonna have multiple columns and awk is great for operating over columns. Now, awk is not just able to do things like match per line, but it lets you do things like, let's say I want the number of these, right? I wanna know how many usernames match this pattern. Well, I can do wc-l, that works just fine, right? There are 31 such usernames, but awk is a programming language. This is something that you will probably never end up doing yourself, but it's important to know that you can. Every now and again, it is actually useful to know about these. This might be hard to read on my screen, I just realized. Let me try to fix that in a second. Let's do, yeah, apparently fish does not want me to do that. So here, begin is a special pattern that only matches the zeroth line. End is a special pattern that only matches after the last line. And then this is gonna be a normal pattern that's matched against every line. So what I'm saying here is on the zeroth line set the variable rows to zero. On every line that matches this pattern, increment rows, and after you have matched the last line, print the value of rows. And this will have the same effect as running wc-l, but all within awk. This particular instance, like wc-l is just fine, but sometimes you wanna do things like, you might wanna keep a dictionary or a map of some kind. You might wanna compute statistics. You might wanna do things like, I want the second match of this pattern. So you need a stateful matcher, like ignore the first match, but then print everything following the second match. And for that, this kind of simple programming in awk can be useful to know about. In fact, we could in this pattern, get rid of sed and sort and unique and grep that we originally used to produce this file and do it all in awk, but you probably don't wanna do that. It would be probably too painful to be worth it. It's worth talking a little bit about the other kinds of tools that you might want to use in the command line. The first of these is a really handy program called BC. So BC is the Berkeley calculator, I believe, and BC. I think BC is originally from Berkeley calculator. Anyway, it is a very simple command line calculator, but instead of giving you a prompt, it reads from standard in. So I can do something like echo one plus two and pipe it to BC dash l because many of these programs normally operate in a stupid mode where they're unhelpful. So here, it prints three, wow, very impressive. But it turns out this can be really handy. Imagine you have a file with a bunch of lines. Let's say something like, oh, I don't know. This file. And let's say I want to sum up the number of logins, the number of usernames that have not been used only once. So the ones where the count is not equal to one, I want to print just the count. This will give me the counts for all the non-single-use usernames. And then I want to know how many are there of these? Notice that I can't just count the lines. That wouldn't work because there are numbers on each run. I want to sum, well, I can use paste to paste by plus. So this paste every line together into a plus expression, right? And this is now an arithmetic expression. So I can pipe it through BC dash l. And now, there have been 191,000 logins that share the username with at least one other login. Again, probably not something you really care about, but this is just to show you that you can extract this data pretty easily. And there's all sorts of other stuff you can do with this. For example, there are tools that let you compute statistics over inputs. So for example, for this list of numbers, let's say I just took the numbers and just printed out just the distribution of numbers, I could do things like use R. R is a separate programming language that's specifically built for statistical analysis. And I can say, let's see if I got this right. This is again, a different programming language that you would have to learn. But if you already know R, or you can pipe them through other languages too, like so. So this gives me summary statistics over that input stream of numbers. So the median number of login attempts per username is three. The max is 10,000. That was root as we saw before. I'll tell me the average was eight. This might not matter in this particular instance, like this might not be interesting numbers. But if you're looking at things like output from your benchmarking script, or something else where you have some numerical distribution and you wanna look at them, these tools are really handy. We can even do some simple plotting if we wanted to. So this has a bunch of numbers. Let's do, let's go back to our sort of NK11 and look at only the, let's do top five. GNUplot is a plotter that lets you take things from standard in. I'm not expecting you to know all of these programming languages because they really are programming languages in their own right. But it's just to show you what is possible. So this is now a histogram of how many times each of the top five usernames have been used on my server since January 1st. And it's just one command line. That's a somewhat complicated command line, but it's just one command line thing that you can do. There are two sort of special types of data wrangling that I wanna talk to you about in the last little bit of time that we have. And the first one is command line argument wrangling. Sometimes you might have something that, actually we looked at in the last lecture, like you have things like find that produces a list of files or maybe something that produces a list of arguments for your benchmarking script. Like you want to run it with a particular distribution of arguments. Like let's say you had a script that printed the number of iterations to run a particular project and you wanted an exponential distribution or something and just prints the number of iterations on each line and you were to run your benchmark script for each one. Well, here's a tool called xargs that's your friend. So xargs takes lines of input and turns them into arguments. And this might look a little weird. Let's see if I can come up with a good example for this. So I program in Rust and Rust lets you install multiple versions of the compiler. So in this case, you can see that I have stable beta. I have a couple of earlier stable releases and I have a bunch of different dated nightlies. And this is all very well, but over time, like I don't really need the nightly version from like March of last year anymore. I can probably delete that. Every now and again, maybe I want to clean these up a little. Well, this is a list of lines. So I can get for nightly. I can get rid of, so dash V is don't match. I don't want to match the current nightly. Okay, so this is a list of dated nightlies. Maybe I want only the ones from 2019. And now I want to remove each of these tool chains for my machine. I could copy paste each one into, so there's a Rust up tool chain remove or uninstall maybe. Tool chain uninstall, right? So I could manually type out the name of each one or copy paste them, but that gets annoying really quickly because I have the list right here. So instead, how about I set away this sort of the suffix that it adds, right? So now it's just that. And then I use XARGs. So XARGs takes a list of inputs and turns them into arguments. So I want this to become arguments to Rust up tool chain uninstall. And just for my own sanity's sake, I'm gonna make this echo just so it's gonna show which command it's gonna run. Well, it's relatively unhelpful, but or hard to read at least. You see the command it's going to execute. If I remove this echo is Rust up tool chain uninstall and then the list of nightlies as arguments to that program. So if I run this, it uninstalls every tool chain instead of me having to copy paste them. So this is one example where this kind of data wrangling actually can be useful for other tasks than just looking at data. It's just going from one format to another. You can also wrangle binary data. So a good example of this is stuff like videos and images where you might actually want to operate over them in some interesting way. So for example, there's a tool called ffmpeg. ffmpeg is for encoding and decoding video and to some extent images. I'm gonna set this log level to panic because otherwise it prints a bunch of stuff. I wanted to read from dev video zero, which is my webcam video device. And I wanted to take the first frame, so I just wanted to take a picture. And I wanted to take an image rather than a single frame video file. And I wanted to print its output so the image it captures to standard output. Dash is usually the way you tell a program to use standard input or output rather than a given file. So here it expects a file name and the file name dash means standard output in this context. And then I wanna pipe that through a program called convert. Convert is a image manipulation program. I wanna tell convert to read from standard input and turn the image into the color space gray and then write the resulting image into the file dash, which is standard output. And then I wanna pipe that into GZIP, which is gonna compress this image file. And that's also gonna just operate on standard input standard output. And then I'm gonna pipe that to my remote server. And on that I'm gonna decode that image and then I'm gonna store a copy of that image. So remember T reads input, prints it to standard out and to a file. This is gonna make a copy of the decoded image file as copied on PNG. And then it's gonna continue to stream that out. So now I'm gonna bring that back into a local stream. And here I'm going to display that in an image displayer. Let's see if that works. Hey. So this now did a round trip to my server and then came back over pipes. And there's now a, there's a decompressed version of this file, at least in theory, on my server. Let's see if that's there. SCPTSP, copy PNG to here. SCPTSP, hey, same file ended up on the server. So our pipeline worked. Again, this is a sort of silly example, but it lets you see the power of building these pipelines where it doesn't have to be textual data. It's just taking data from any format to any other. Like for example, if I wanted to, I could do cat dev video zero and then pipe that to a server that like a niche controls and then he could watch that video stream by piping it into a video player on his machine if we wanted to, right? It just need to know that these things exist. There are a bunch of exercises for this lab and some of them rely on you having a data source that looks a little bit like a log. On Mac OS and Linux, we give you some commands you can try to experiment with, but keep in mind that it's not that important exactly what data source you use. This is more find some data source that where you think there might be an interesting signal and then try to extract something interesting from it. That is what all of the exercises are about. We will not have class on Monday because it's MLK day. So next lecture will be Tuesday on command line environments. Any questions about what we've covered so far or the pipelines or regular expressions? I really recommend that you look into regular expressions and try to learn them. They are extremely handy both for this and in programming in general. And if you have any questions come to office hours and we'll help you out.