 All right. Charging ahead, no rest for the weekend, except moving this to another screen. Great. All right, so data wrangling. Data wrangling is a kind of weird topic, but it is what you usually end up doing on the command line once you start getting used to it, and that is you have a bunch of text, and you want a different bunch of text. Usually, you want less than what you have. For example, my system log, or this is my server system log, has a lot of stuff in it. And it's really annoying to try to find anything in that log because it is just super long. You're going to have to condense it down, and the command line is actually really good at giving you tools for narrowing down the stuff that you're looking at. We did a little bit of this in the past lecture, talking a little bit about things like grep that help you do very basic things, and pipe that lets you combine commands. But today, we're basically going to look at all the tools that help you massage data from one format to another into the kind of form that you want it to be in. We're going to operate basically from my own server log. So in this case, we'll probably cover SSH at some point. But the basic idea is that you can basically run the command on a remote machine. So for me, this SSH is to a machine I have in the Netherlands and runs the journal CTL command, which just prints the entire system log. I happen to have done this before, although I could have written all these commands just piping it through whatever I write next. I'm going to just then use the tsv.log file because that way I don't have to. It has all the same stuff. That way I just don't have to do it all over the network. And if the network cuts out, everything still works. So let's imagine that we want to look at SSH stuff. Specifically, my server, because it's public, gets a lot of people trying to log into it. I run my server on an alternative SSH port, which is really nice. It means fewer people try to connect to it. But I still get lots and lots and lots of visits to that machine. And I want to look at some statistics about that in particular. Let's figure out the kind of user names that people are trying to log into my computer as. Now, starting with the log, we might imagine, first of all, that we're just going to grep for SSHT. Actually, let me go to bash. Right, so we're going to grep for SSHT. And that's still far too much text. This is a lot of noise, especially at this font size. And you don't really know what you're looking for. However, you'll notice that there are these disconnected from messages. So this says a username. If we go further up, disconnected from invalid user. So it looks like those are lines that contain user names. So let's actually look for disconnected from. Let's see what we find. That looks a little bit better. So all of the lines that are printed now at least seem to contain user names. OK, so that's a start. Now we sort of want to get rid of the other crap that's on this line. And the way you normally do this is using a tool called set. So set is a streaming editor that basically lets you write commands that edit a given line in a text file. In this case, we're going to use set. And we're going to say we're going to substitute. And then I'm going to come back to what this means. So notice that what this did was cut off the entire beginning of that line. So if I give you an example, let's do tail and five. So these are the last five lines from the thing above. Those turned into this, which is a lot shorter, right? So this command set, I don't know if I have the noise around it. This command, so I piped through set with this argument. So set is a line editor, as I mentioned. And the way it works is you give a bunch of commands that operate on every line of the event. In this case, every line is a login attempt. And what I'm trying to do is at least remove all the crap at the beginning of the line. And the s command for set does that. It substitutes whatever matches the pattern between the first and the second slash with the contents that are between the second and third slash. In this case, it's replacing it with nothing. Thereby, we're moving it. The stuff that's in between, it's what's known as a regular expression. Some of you may have heard of these in the past. Some of you may have some experience with them. Regular expressions are pretty hairy beasts, but they are really, really useful. And so I'll go through some of the basics. And then there will also, I recommend you look them up on your own, because they are really handy. And you can do a lot of cool things with them. In particular, if you have a regular expression, there are a couple of basic symbols that you're going to end up using a lot. Dot means dot. Yeah, dot is fine. So dot means any single character. It matches any single character no matter what it is. So it's sort of like a question mark when we talk about globbing. Star means zero or more of the preceding thing, the preceding pattern. So if I write dot star, it means zero or more of any character, which is basically any amount of text. So in the pattern above, when we wrote, we substituted like this, what this really meant is substitute any string of characters followed by the string disconnected from, which is what trims off the entire beginning of the line, including the date, the host name, the name of the process, the process ID. I didn't have to match those explicitly. I just said any string of characters preceding disconnected from. You also have, so star is zero or more, plus is one or more. So this is convenient if you explicitly want to talk about things that are non-empty. You also have square brackets. So if you write something like this in a pattern, it means A or B or C. Any single character from that set. So if I did this, it would be any sequence that is non-empty of A's, B's, and C's in any order. You also have patterns like this. Where RX1 and RX2 are patterns, let's call them P. This means anything that matches either pattern one or pattern two in this location. So if it matches either, then this pattern matches, otherwise it is not. There is also carrot, so carrot means start of line and there is dollar sign, which means end of line. These are handy for anchoring matches, saying I want to remove from the end, for example. If I wrote a pattern like this, it would mean substitute the foo if it's at the end of the line and substitute it with nothing. If there were other foos in that line, they would not be removed. And then you get to combine these in interesting ways that basically let you express a really complicated pattern. So we'll see how that turns out by looking more at this log. Said, so there are a lot of different implementations of regular expressions. Said uses a particularly ancient one that's a little bit of a pain to use, but it is the tool that people most often end up using for this purpose. In general, in said, whenever you use any of these special characters, except for dot, star, and square bracket, you need to put a backslash in front of them. So inside a set command, if I wanted to do zero or more a's, one or more b's, and like C or D, if I wanted to write this with said, I'd have to put a backslash before all of these to make them have their special meaning. This is just something to be aware of that said is a little bit annoying in this way. The other thing you can do if you don't want to write all those things is you can pass the dash capital E flag to said, and then we'll just assume that all special characters are special. This brings in more in line with regular expressions you might experience in other programming language, like Python, Ruby, whatever. Okay, so let's look back at what we had up here. So this worked pretty well. This set expression that was like dot, star, and disconnected from to remove the beginning of the line. However, it puts us in a little bit of an awkward spot because what if someone tried to log into my server with the username disconnected from? Right? So let's take an example line from this log, right? So this, I'm just going to echo that line. And then I'm going to replace the username with disconnected from. So that's the username they tried to log into my server using. Let's see what happens if we now try to use the same thing. So I'm going to ignore the greps because there's only a single line and we're going to try to use the same pattern. Disconnected from, with nothing. It removed the entire username from the line too. This is because star and plus in regexes are greedy by default. They remove as much, or they match as much as possible. This means that even though they encounter something that matches, they will keep going and see if they match something else later in the stream. Normally, if you have the same implementation of regexpressions, you can curb this thing here by putting a question mark after a star or a plus. And if you do this, you're saying, don't be greedy about this operator. Stop when you hit the first match. Unfortunately, said is not smart enough to understand that operator. We can switch to using Perl instead. So Perl also has like a line by line editor mode that supports things like substitutions. So if we do that, then you'll see it exhibits the same behavior as said, but if I put a question mark after the star, then now it keeps the user made. It basically does a non-greedy match, right? And so generally, using Perl can be nice in these settings, but said is usually the tool that you have to work with because otherwise you require that the user has a Perl or Python or whatever command line, whereas said is just always installed. It comes with every Linux distribution and macOS, like all of them have said. So the question is, how can we make this better? Well, we're gonna stick with said and we're gonna see if we can just make the problem go away by ignoring it for long enough, which is often a solution in computer science problems. In particular, let's ignore that problem for now and look at the log that we got so far. It has a bunch of crap at the end of it, right? So following this sort of invalid user or authenticating user or user, there's an IP address, a port and some stuff at the end. We wanna get rid of that too. We just want the user, this is all we care about. In order to do that, let's try to remove all the stuff at the end too. So this set expression we had, we're gonna extend it to try to match more of the stuff that we have. And here I'm gonna add the dash E flag so I don't have to add all the backslashes. In particular, it's gonna say something like disconnected from invalid or authenticating. Question mark means zero or one of the preceding thing. So notice that this is saying there's gonna be a bunch of stuff at the beginning, then it's gonna say disconnected from, then it's gonna say either invalid space or authenticating space, if at all, that's what the question mark means, followed by user. So this is gonna match anything that says disconnected from invalid user or disconnected from authenticating user or disconnected from user. Any of those are gonna match this pattern. And in fact, we can test it. This removed all the user stuff at the beginning. Now it just starts with the username. Although there's a space there that we can remove by adding a space. And then we're gonna match on the username itself which we have no idea what contains, any string of characters. We could use a plus if we want because we know that it's not empty. And then after that, there's an IP address. This business here. And an IP address we all know is really just anything that doesn't have a space in it. So you can do this. Square bracket, remember, was a set of characters, right? You can prefix that set with a character and that means anything that's not in this set. So what this is saying is any character that is not a space, plus of that, so one or more non-space characters which will match any IP address because IP addresses can't have spaces in them. And similarly, if there happened to be like an IPv6 address or a host name or something, those would also be matched because they cannot have spaces in them. Following that, it says port and then there's a port number which we can do zero to nine plus. So you can also have ranges inside of square brackets. Ranges are any character that is between these three. So in this case, zero to nine means zero, one, two, three, four, five, six, seven, eight, nine. So any of the digits and one or more of those will match all the port numbers that could possibly be there. And then we'll notice some of the lines have this pre-off at the end, but not all of them. Like this one does not have pre-off at the end, right? So we want to optionally match on pre-off. And that's gonna be the end of the line. Let me see if I can give that to you without all the line wrapping. Or with more convenient line wrapping. So, let me make this, let's do this. Yay, don't worry. All right, so this pattern, it looks pretty complicated, but you sort of see how we constructed it, right? We sort of piece by piece go through what's in the line and how do we want that to be different? So at the end here, what I ended up adding was optionally, right? Remember this is a trend with a question mark? So this is saying either have the literal string space open square bracket. Notice the backslash now is saying don't treat this as special. So square bracket pre-off and square bracket, either that or not that, followed by the end of the line. And you'll see that if I run this, I get lots of empty lines. Now why is that? Well, it's because what we told set to do was replace anything that matches with the empty string, right? But we actually sort of want to keep the username. Otherwise, this is pretty useless. It turns out regular expressions can support this too, using something called capture groups. So any pattern that's enclosed in parentheses is automatically sort of saved by the engine. So in this case, if I put parentheses around this dot plus, which was the username, and notice that this doesn't change the pattern at all, right? The pattern still matches the same thing because parentheses don't match anything in and of themselves. But this causes the thing that's matching to remember whatever was matched by the stuff that's inside that parentheses. And you can refer to that by using backslash and the ordinal number of the parentheses. So remember, there's one parentheses here, that will be assigned to backslash one. There's one parentheses here, that'll be backslash two. There's one parentheses here, that'll be backslash three. So in this case, what we want to do is substitute with backslash two, right? Because backslash two is where the username is. Does that make sense? Right, because the backslash one is gonna be whether it's set invalid or authenticating or nothing. And backslash three is gonna be whether it's set pre-off or not. The backslash two is the thing we just added, which is gonna bind the username. So if we run this, we get the actual usernames. And now let's reflect on what happens if someone chose the username disconnected from. Because we've told this that it has to match the entire line, if it said disconnected from as the username, that disconnected from would not match this disconnected from at the beginning of the pattern, because it has to be followed by invalid authenticating user and the IP address and the port and pre-off and then the end of the line. So this pattern actually gets rid of the entire problem of the user using it we were using it accidentally or on purpose, I won't tell you. And it turns out that there's a really good tools for dealing with this. So if I can, who knows how this is gonna work? .github.io. Yeah. So I've linked to this tool called regex 101. So this is an online regular expression tester. And so here I've put in this entire pattern and it will tell you how the pattern works. So it will sort of explain the pattern to you in plain English. And then you can give it test strings and see what matches what. So in this case, if I put the friends back around the user here, it will also highlight the capture groups, right? So invalid here is capture group one. WP user is capture group two and pre-off is capture group three. So this is a handy way to debug your regular expressions if they're not quite matching with your thought they would be, right? And it will tell you like, where are all the groups? Which of these strings matched in how? I've linked this in the lecture notes so feel free to check that out there. All right, so back to this. Now we have this list of usernames. So that's pretty exciting. I'm not gonna cover too much more about regular expressions except to say that they can get really, really hairy. So there's one other thing I wanted to show you which was regex. So it turns out you can write regular expressions to match nearly everything. Like you can match any email that's valid according to the RFC, but it's not entirely straightforward. So if this works maybe, I can't zoom in anymore. Ah, apparently I could. Right, so look at this regular expression. This matches a valid email address according to the RFC. You probably don't wanna write it yourself but regular expressions can do a lot of really cool things like this regular expression right here will, if you take any number and you print that many ones of it and then you match it against this regular expression, it will tell you whether or not that number is a problem. It's amazing that this is possible, but it is. Regular expressions can do a lot of stuff but because they have the flexibility, they're also sometimes a real pain to try and beat. And this is why I will not cover everything about regular expressions because I don't think I physically could. So we're gonna continue just from, sort of now we have the usernames. Yes. Can you do this search inside of file? Like can you do all of this to look inside for example? Like if you didn't have this log, I had like a JSON. Yeah, I mean I'm just counting the file at the beginning here, right? So I could, whatever file I have, I could totally run this on. If you're on the command line and you have tools that aren't text files, there are usually very good tools for extracting data from those files to text. So if you have JSON for example, is a tool called JQ. In one of the exercises, I basically asked you to try to figure out how to use JQ but it lets you write patterns to extract data from JSON. There's also one called POP which is great for extracting stuff from HTML. And there are tools for like dealing with CSV files. All of this, like there are good tools for it. In general, all of the tools I'm teaching you here are more the sort of things that you compose once you have your data source, right? In some meaningful format. All right, so we have usernames, great. Let's see if we can do anything interesting with them. I have these like greps at the beginning which are kind of bothering me. I mentioned that sed is like a line editor and it turns out sed is really powerful. Basically as a complete built-in programming language. So I don't actually need these. In fact, I don't even really need the cat. I can do this only with sed. I don't think normally you would want to do this, can. So with sed you can write multiple expressions by giving dash e. So each thing that follows the dash e is a separate expression and they're run in sequence and each one modifies the line before the next expression is run on that line. So one of the things that I can do is I can say that any line that matches SSHD is not deleted. In fact, sed has a lot of really powerful tools like it can inject lines into the file. We've also already seen substitutions but it can inject lines above, below, it can sample lines that are nearby. Most people don't know all the things sed can do and that's partially because if you look at the map page for sed, these are the kind of commands that you get. They're all single character commands that are combined in really awkward ways but you can do cool things if you work with a character. If you want to give yourself a challenge, I suggest you try to do some of the exercises using only sed as your data management tool. It's gonna feel painful but it's possible. All right, so we have this command that gives us all the usernames and that's all well and good. But let's see if we can get something more interesting out of this. So in particular, let's try to look at usernames that people are commonly using. So if we type this through WC-L which is word count and the dash L means count the number of lines. So this is telling me that there are 3,595 usernames since the last reboot of my machine that I've tried to log in. But presumably all of those usernames aren't distinct. So let's try to find out which ones are people using and which are being used more than once or commonly. So there's a tool called sort. If you type things to sort, it will sort the inputs. It will give you the input lines in sorted order in the output. So now all the things that are the same are next to each other. And then there's another tool called unique, not fully spelled out. Most of Linux command line utilities have names that are four characters or shorter for various historical reasons. Unique will normally just print for any line that does not have other lines that are the same as it. So for example, it would only print Zimbra ones. It would only for any consecutive sequence of lines. So it does not look everywhere in the input, just consecutive lines. It will only print one if there are many. It will also, if you print and use the dash C flag, it will give you a count for each one. So in this case, we'll see that Zimbra was used 11 times, Zabix was used 23, whereas Sakaraputya, it was only used once, perhaps not surprisingly. This is still not very helpful though. This isn't really telling me what the most common one is. I sort of want the ones that have the highest count towards the bottom or towards the top. Just any way I can get at them. And so we're gonna pipe through sort again. Sort has also surprisingly or perhaps unsurprisingly a lot of really useful command line arguments. It has dash N, which sorts in numerical instead of lexicographic order. It has dash K, which lets you sort only by a given field of the input. So normally it will sort by the entire line, which in this case would be fine because the number is first. But you could totally imagine that the numbers aren't first and I'm trying to teach you general tools. So we're gonna use dash K. Dash K is really stupid in how it's organized. This is the argument you have to give to sort by only the first field. The first thing that follows K is a one index of the first column you want to sort by. And the comma one says, stop when you reach column one. Otherwise it sorts by all the columns until the end of the line. This means you can do something like one, two. This sorts by the first two columns. One three does not sort by columns one and three. It sorts by the first three columns, which is stupid. There are ways to make it sort only by like one and three by using stable sorts that it's paying. But one one is generally what you want or something like three three, which would sort by the third column. No, the three one is not what you're wanting. So one one, now we're gonna get this sorted by the count. So you'll see that perhaps unsurprisingly, most of the long ends are for me. There are also a bunch of long end attempts, root, admin, test, user, et cetera, and various common usernames. As you scroll further up, you see less common usernames. And we probably only want to look at the most popular ones of these. So let's do something like tail dash at 10. So this will print the 10 lowest lines in the output. In this case, the 10 was popular. If I wanted the top, I could do head. I could also reverse sort. So put the highest valued things on top and this gives me the same output, except in the opposite order. Or of course I could instead of giving the dash r flag like use tack, which prints its input in the opposite order and that gives me the same thing. But that's all unnecessary. So let's just stick with tail and 10. Okay, so now we have the tail, which is what we wanted. But let's try to do something more interesting. Let's, in this case, just show the usernames. We don't need the counts anymore now. We just want the usernames. We're gonna use a separate tool called AUK. So AUK is another editor. It is also a line-oriented editor, but the reason you sometimes want to use AUK instead of set is that AUK knows about fields. So AUK does not just operate on lines, it operates on the fields of that line. In particular, with AUK, I can do this. So let's talk about paste first. Paste takes its inputs as lines and pastes them together by the character that follows D. So in this case, instead of getting a list of usernames, I'll get a comma separated list of usernames. It's really handy because you can do things like use plus as the limiter and then pipe into a calculator and then it will compute the sum of them. Or you could delimit them by multiply and then pipe it through a calculator and now you're multiplying them. In this case, I just wanted it not to be as many lines. And this will indeed give me all the usernames. I might actually just run them in the recent ones and it gives me them commas separated, right? So what does AUK do? What does this AUK print to business? Well, AUK is also line-oriented. And in AUK, every part of AUK is a combination of an optional pattern followed by a block. The pattern is saying, if this pattern matches, then execute the code in this block. If you leave out the pattern, it matches every line. Within a block and within the pattern, you have access to a number of variables. Dollar zero is the entire line that you're currently on. Dollar one is the first field of that line. Dollar two is the second field. Dollar three is the third, et cetera. By default, AUK will split by white space, any sequence of white space. And it will trim the white space at the beginning and end the line. So that is why, in this case, dollar two ends up being the second field in the output from Unique, which is the username instead of the count. If I made this print dollar one, it would print the count instead of the username. You can also change what the limitor is being used by using dash f and then giving it a different character. So for example, if you wanted to parse CSV files, you could do this. And now you can operate on AUK with the field being comma separated. Or if it was tab separated, you can do this. Not that you can see that that's a tab, but that is a tab. And I think you can also do like this, this indexing basher, I think. For saying a tab, I could have a new line separated field list. I could have whatever, right? The reason this is really neat is like, this was a pretty simple example. AUK is nice for just like, if you're extracting a single column, but there are other tools that can do that better. But AUK is really powerful. And I'll try to give you an example of why. Let's say that we want to find all of the usernames that are used only once and that start with the letter C and end with the letter E. In AUK, this is trivial. So, we're now gonna take the entire list of usernames and counts. I mean, now I'm gonna add a condition. And that condition is gonna be that the first field is one, so that is the count. And also, the second field matches the following regular expression. It starts with a C. It then has any amount of non-space characters and then it ends with an E. And if that pattern matches, then print two. And that gives me all the usernames whose count is one and whose the matches the pattern C. And this is any regular expression that I might want to write. This makes it really easy to work with column or data and usually what you end up with on the command line is column or data in one way or another. Like for example, if I type the output of PS which is all the processes on my machine, then this is also column or data. So here, I could type this through AUK and say that I want anything that is say, run by root and where the process ID is greater than 2000. I wanna print, I wanna print its run time, right? So AUK makes it really easy to express things in terms of columns. Back to this though, I can back to this through this and it will tell me, wow, there are four usernames as if I couldn't tell that from this. Of course, I can also say anything that's not single use. It will tell me there's only one. And I can have it print out if I wanted to, both the count and the number for those. All right, so what else kind of stuff can you do with AUK? Well, what I showed you previously was, I was piping this through WC-L, right? But again, AUK is programming language. We don't need other tools. I couldn't replace this entire command line with an AUK strip, but I'm not gonna do that because I don't wanna do that to myself. But AUK lets you write multiple patterns and blocks in a single expression. There's a special pattern begin that matches before it starts running, where I have variables. So I can say rows is zero. Notice there's nothing in between here. AUK just, it's always just pattern and conditional. Sorry, pattern and block. Okay, so just say things that start with C. So what this is saying is, first, at the very beginning, set a variable called rows to zero. Then, for every line where the count is not one, so this is a username that's been used at least once, and the username starts with C, then add the number of occurrences, right? Dollar one is the number of occurrences from unique dash D, two rows, and at the end, print rows. So this will give me the number of times a username that has been used more than once and starts with C has been used. 64. Not that that's a stat, I pretty much know what needs or care about, but hopefully the sort of way the program is structured kind of makes sense. This is an entirely arbitrary statistic that I chose to compute. Right. I've linked also a sort of guide to introductory AUK because again, AUK is also one of those languages that it's a complete program and I'm not gonna remind you what I'm not gonna teach you, but this is hopefully taught you those sort of basics of how you might use it in some kind of data rendering pipeline. Okay, so let's talk about things that are a little bit different. Just because it's handy, I already mentioned this idea of paste and using that to do math. So for example, remember how we found out that there were, we found out that there were 3,595 login attempts to my machine. Let's try to compute that a different way that is more inconvenient, because why not? So you need to actually see, remember it gives us the count for every username that occurs and so the sum of all the counts shouldn't be the same as the number of total occurrences, right? Let's check that that's the case. Just to, you know, check math. So we're gonna extract with awk just the count, right? So that did indeed give me just the count. Then we're gonna paste that by plus and we're gonna pipe that to BC. So that's gonna look like this. And then we're gonna pipe that to BC, which for a stupid reason requires the dash all flag. And hey, that gives us the same number. I could also do something like I wanna multiply that number by two. So I can use process substitution for this, right? You remember our friend, process substitution, right? So this is saying, I'm gonna print this string. It's gonna be two times and then substitute out this entire command, which is gonna be all those pluses we saw above, and then take that entire thing and send it to the calculator. That's not all what I, oh, I'm missing an open frequency. I want two times the entire number, maybe. Great, okay, that's better. I don't care about multiplying the first number. And this might not have been particularly interesting, but we can also do things like compute statistics about our data, which is arguably more important. So remember, this gives us all the counts, right? Let's say that I want some statistics about this. There are lots of really cool command line tools that we'll just like, you just pipe stuff into and then you get statistics out. But some of you might know like there's a program language called R, which is really good at doing statistical analysis. Well, you can pipe things into R, too. So what we're gonna use is R's slave mode, which basically lets you run R and then send stuff to it and have it compute to print things for you. And I'm gonna run the following program in R. That's standard in, right? So remember standard in from last time. Quiet is true because I don't want it to print anything when it's scanning. And then print me a summary of the resulting, so X in this case is gonna be basically an array in R or a single column matrix of all of the data points that we have, in this case, all of the counts. And then it's gonna print me a statistical summary of that data. So it's telling me, well, the minimum and the median are both one. So that means like most of the numbers are or most of the usernames only have a count of one. The mean is four, the max is 575, that was me, you remember. And the third quartile is two. I can print other things too, like if I wanna do, I can get like standard deviation. Okay, so if you care, the standard deviation of that list of numbers is 28.36, right? In fact, you can also do plotting. So let's do, let's go back to sorting this again because I don't want all the data points. We're gonna print out those. And we're gonna do tail and 10. I want the user's system. I want a bar plot of this, don't you? So GNU plot is not the best plotting tool, but it is really convenient for command line plotting. Bar plot of the 10 most common login by username and count, right? This is useful for just like quick visualizations of data. It is not what you should use to actually plot things if you're like writing a paper or something. R has a really good plotting package called ggplot2, which I recommend you use if you ever end up doing this. But this is a really, really neat tool. As for the syntax for GNU plot, so you can sort of ignore the thing at the beginning and that's just to make it a bit nicer. GNU plot is basically plot followed by a dash, which they might standard in like in the other command line tools that we talked about. Using, and then you write the which column the input data should be used as the x-axis and which column of the input data should be used as the y-axis. And then you can say like with lines, with boxes, with points, however you'd like to plot that data. And then the dash e is saying, execute this program as the GNU plot program and dash p is make it permanent on my screen, like don't close at the moment the input finishes. So that gets us pretty far. We're also gonna talk a little bit about how the command line can be useful outside of just like plotting and looking at data. So one command I actually ended up running earlier today. Ooh, but now we're gonna move them. I don't know how to do this easily. So there's a tool called Rustup, which manages your like Rust installations. So you can have multiple versions of the compiler installed at the same time. There are similar tools in Python and Ruby and whatever. And it has this thing called toolchain. So toolchain is one version of the compiler and I can list all the versions. I happened to have a bunch of these different old nightlies lying around and I wanted to get rid of some of the old ones. And I ended up doing basically data writing. So I grabbed for nightly, right? Because I wanted to only remove things that are nightly. Then I did not want to remove the current nightly, which is the one that just named the one at the bottom that doesn't have a date in it. So I did not want to match that. And then in order to remove them, I have to remove them by giving the name, but not this entire prefix. So I want to substitute everything from dash x86 and onwards, right? Ooh, oh, substitute with nothing. So that gives me currently only the one nightly that's here in the, when I last run this, there were like 20 lines here. And then XRs is a really, really handy tool for saying take all the lines of my input and make them arguments to the following command. So for example, if I do XRs echo, that's gonna end up calling echo with the argument of the input. How do I demonstrate this in the festival? Sure, LS. So in this case, LS was given the argument nightly, right, and it says notes of file directory. And in my case, what I wanted to do with this was rustup, tool, chain, uninstalled. So now that's gonna run the command rustup, tool, chain, uninstalled with the argument of the particular compiler that I want to remove. In fact, it's gonna run it with all of them, all of the different nightlies that I want to remove. So instead of going through and removing all of them one by one, this command line lets me have the machine do all the work for me. So it'll be like copying and pasting and whatnot. And this is what data-rounding algorithm gives you, it gives you these really convenient ways for expressing things that would otherwise be a lot of manual labor. Oftentimes, you might want to end up with data that you're just gonna stake an Excel or some other programming language to do actual data analysis on. But sort of the shell data ranking is handy for getting to that point in the first place when all you have are some fundamental laws. I think that's all I wanted to cover sort of at a high level. Part of the reason for this was because if you have more questions about regular expressions or AUK or SENT or data analysis, I wanted to have some time to go through those or like playing around with the visual debugger. I have more things I can tell you about regular expressions that are useful, but I figured I'd feel some questions if some of you, then if you haven't, so far. Anything that you are wondering how you might be able to do? So at the bottom of the data wrangling page, there are a bunch of exercises too and some of them are really neat. For example, they're telling you to like do particular things with the logs in your machine. There are some that are doing like word analysis. So most of your computers are gonna come with this word list and user share dick words that has just words because sometimes it's useful to have a list of words. So there's some like data wrangling exercises on those. There's also, so someone asked earlier about like analyzing data that's not just textual data, but that's like JSON, HTML, those kind of things. And so there's an exercise in trying to extract particular information using those tools. So just looking at the top of the JQ. And there's also sort of some challenges in how can you do things with fewer commands, right? Like how can you extract both the maximum and the minimum, for example, using a single command. And often this will entail using some of the bash tricks we've talked about in the shell lecture last time. So for Pup, for example, Pup basically lets you write. So in this case, it's just fine. Okay, so it lets you curl. So curl is a command that will access a website and download all of its source code, all of its HTML, and then just print it to send it out. And this is really handy for doing things like data wrangling or just for fetching a script online. You might, for example, have seen pages that tell you just run like, my website.com slash install and pipe it through it like sudo sh. Like there are lots of install instructions that will tell you to do this. For example, just run this command. Some of you might at this point realize that this is not a great idea. What this is doing is it's downloading a script off of the internet and sending that script into your shell and telling the shell just run the ones that were running it in there. I'm sure it's fine. You may not want to do this, but at least now you know what that does and probably why you shouldn't do it. In any case, Curl is a really handy tool because what it does is it gives you the source code, but usually that's just like a bunch of HTML types, right? So this is what the HTML types of this particular page looks like. This is a mess to try to parse out with regular expressions or graphs and sets, right? Trying to parse this would be a pain. Instead, you can use PULP, which takes the standard input HTML code and then you can write selectors. For those of you who know CSS, these are CSS selectors. So in this case, this is saying, find every table and every table that's contained inside of one of those tables. Find the table row that is the, every third row starting from the bottom. Find the column of that row whose class name is titled and find the linked insight. And then give me the contents of those tags. Trying to parse this using regular expressions would be a pain, whereas this actually parses the HTML and gives you the appropriate stuff. If I can even extract things like attributes in this case, it's extracting the target of all of the A's and all the anchors, all the links in the page. Similarly, JQ is a tool for operating with JSON files. Great. So again, they're sort of giving this example of you wanna curl something. Curl just gives you, in this case, is gonna give you some JSON back. So JSON is the JavaScript data notation which has all the squiggly variants. And JQ similarly lets you write selectors for extracting data from JSON. So in this case, it's saying extract the following parts of objects. So you can subset things, you can also say extract only the following attribute of every item in a array so that what you end up with is not JSON but it's like line-separated text, for example. And once you get the line-separated text, then you can start to combine that with all of the tools that we've talked about today to do other interesting stuff with it, right? So for example, this is one of the things that I linked to on the website. These are various open data sets that you can find on the web. Many of them come in like CSV or JSON or whatever and extracting these downloads, CSV flat files, right? So these CSV files, we now know that we could just operate on using ARC, right? Dash F comma is gonna give you the amount. But yeah, so some of the exercises here try to download some of these data sets and try to do interesting stuff with them. It doesn't have to be information that you care about but just to get exercise in how to do this data manipulation. Because the way this ends up being convenient for you is to do it over and over so that you sort of immediately remember what kind of commands you end up writing. Very often, you're just combining the same types of things. You're like grepping and then you're setting to remove the stuff on either end and then you're walking to get yields and then you're like pasting it together in some usual fashion. In that case, if there are no more questions, I think we'll probably end it there unless you have other things you want me to go through. No? All right, nice. Thanks for coming. Next time you're doing editors and version control, is that right? Yeah, nice. So we'll talk about Vim and Git.