 All right, today I wanted to talk about command line pandas killers. So I have a collection of utilities that I turn to when I want to do some simple data manipulation. Now none of these are going to be pandas replacements or dply replacements or data table replacements. None of them are going to completely replace those libraries. But it is interesting to see that in 2021. There are not just one but many tools that can be used to do data manipulation on CSVs and they do it really well. So I'm reviewing some utilities that I think are very good at manipulating CSV files. So I do have a little post here on my website. I'm going to skip some of this if you're interested in what I ended up finding works well and what doesn't work well. You can take a look at some of these tables. One usually means present, well one always means present, zero means it's not present in that library and I also wanted to look specifically at some summary functions just to get a sense of what these different packages can and can't do and just get side by side comparisons. Now I say that but the idea here is that I don't want to discount a package just because maybe it has fewer features because features isn't really the only thing and speed isn't really the only thing. I didn't even look at speed here just because I didn't really want that to affect the results. Sometimes with different tools you might be attracted to it because of more than just its feature set or its speed. Maybe there's a way it expresses operations that you might like. So I just wanted to instead of do a ranking by features or ranking by speed I wanted to do something a little different and give these different tools awards or maybe they're almost more like personalities like what would I how do I feel when I use the tool like what does it make me think of. So the first one on the list and I guess I could go through this really quick that the tools I'm comparing and looking at is XSV, CSVTK, TSVutils, MLR also called Miller and Q. Again all of these are great picking and maybe you shouldn't just pick one I think it'd be useful to to get to know a handful of these and explore them. All right so the first one XSV I gave this the Stonehenge Award. The reason is that this tool just feels really rock solid. And the source code is clean, compiled, memory safe and I think that this tool definitely will stand the test of time. It's not going anywhere. It could do the standard things that I would normally look for especially when it comes to things like summary statistics, filters, selects, joins. It can do all of that. Really quick I do have a file on my GitHub repository where I take a dataset it's a data slash weather dot csv. I also have a shell script that just pulls this but I take this file and I manipulate it with these different libraries so you can get a sense of how these different libraries have their own options, flags, grammars for manipulating a csv file. So for example if you want to do some summary statistics on a csv file which I'm calling df so dollar sign df is environment variable. All you do is xsv stats use this s option and select temperature and humidity and from this I'm getting a handful of summary statistics. And this is really concise which I like and pardon me kind of doesn't like. The thing I don't like is that I feel it would be nice to use a very specific summary function not just this everything option but that's just what xsv gives you. So that's xsv for summary functions. I have a few other things that I really liked with xsv. One thing I thought it did very well was joins. I thought it did joins probably it felt better. Joins felt better with xsv. So here I have a left join with xsv all you do is type this join option with this left flag and you say which column number you want to join on for two csv's is really straightforward. So I liked xsv because it was really solid really great code source code. And when it comes to things like filters join selects it does everything in a very clean concise way. So that's xsv great tool. All right the next one csv tk the word for this one is the order. That's that's like someone who likes to speak or elaborate or expound. The reason why I gave it this word is because I think that it gets a lot of things right in terms of the verbs it uses. So if you if you're familiar with a deep player library you'll be used to verbs like summary, rename, mutate and csv tk has these same verbs. So for example doing a summary statistic csv tk summary dash f for the field you want to select and then you get to say each for each column and each function you want to do a summary on you can specify that in csv tk. But it's not just the summary function that's cool. There's also let's see maybe I should just do a search here. All right so mutate mutate to. So this lets you make a new column and I think the way it does it is really clean. You say csv tk mutate to this is I believe calling an expression option. I'd have to look at the documentation remember I did this I did this a few days ago and I can't really remember the exact name or purpose of the dash flag. But what I'm doing here is I'm sitting I'm using this dollar sign temp and saying hey if dollar sign temp which is a column you call columns with a dollar sign and csv tk if it's greater than 32 degrees call it liquid else solid and we're going to name this new column water phase. So I thought that was done really well. There's other examples of things like that in csv tk so it also does joins well. The thing that you have to watch out for on csv tk is how it handles n a's and so this is the one thing I wasn't too much of a fan of with csv tk but csv tk does not remove n a's by default. It does have a flag that allows you to do this but it's not default. So in xsv ignoring n a's in every case that I tested was a default operation with csv tk it's not so you have to just use a flag to say hey if there's n a's remove them. But yeah I just really liked the way that csv tk expressed its operations. So that's csv tk tsv utils so the award here is the ranger and I call it the ranger for two reasons. One is that the utilities it's not just one package so csv tk is part of csv tk. You call csv tk and you say something like I want to do a summary or I want to do a mutate. With csv utils you call something like csv select that's its own binary or tsv filter that's its own binary and I actually really like this. I thought it made it I thought I thought it was a good way to look through things like the documentation the man pages had a good sense of what I needed to do without filtering through too many lines of of text and in a man page or read me. So let me see if I can find some examples where csv tk seemed to shine. All right so here's one csv summarize so this is going to be a summary function. Header is just saying we're using a file that has a header group by is saying what we want to group by so this is pretty pretty clean and we're grouping by origin and what we want to do is a count and that's it. So so that would be a count grouped by origin for this for this weather data set. Let's see. There's some others down here tsv select. I mean this is this is very clear. And I don't think I mean this is probably if I were to implement a a similar package this is how I how I would imagine implementing it. It's just so clean tsv select with a header file using these fields origin year month temperature from the tsv data frame. And then we can pipe it to another utility they have which is tsv pretty tsv pretty just makes the csv file look nice. And I guess I should mention here I'm looking at some of these other lines. Lots of times I'm piping the output of the csv to a application called TV which is short for tidy viewer and this is also a csv pretty printer. Since tsv has its own I use tsv pretty for for when I was working with tsv functions but when it came to the other command line utilities I was always piping to TV really nice really nice package for for csv pretty printing. So that's tsv select and the tsv utilities package. I thought that it was really unique that it was able to group different binaries in their own in their own package. So I like that a lot was the way I thought of it is like a rangers quiver of arrows. You know that's that's kind of the idea. These are the different binaries that the tsv utilities use uses. The thing I didn't like is that you cannot use csv's on tsv utilities. They have to be tsv's and they have some good reasons for doing it. But this is specifically looking at csv data manipulators. So when I wanted to use tsv utilities I used a function that they also provide another arrow in their quiver csv to tsv. So you have to convert your csv's to tsv and then you can use tsv utilities on on that file. So that's tsv utilities the ranger because it's got different packages for its different functionality. And I think it veers off the the standard path by having all of its operations work only on on tsv files which has its its pros and cons. All right Miller Miller I call the scientist Miller is a huge tool. Probably I didn't even look at source code but if I had to guess Miller would be probably an order of magnitude more more code than the others. And it's because it does a lot of stuff and it does a lot of stuff very well. So let me let me pull up Miller. And I guess first what I should say I don't have an example of using some of the functions that gives it the scientist title. But the examples that I have here is that you can do linear regression with MLR. You can also bootstrap with MLR. And that's just the beginning. MLR is kind of its own in its own league. And for many applications you may not need all that power right. And maybe you know something like CSV TK or XSV or TSV details is is perfect for what you need. Miller is when you want to do something a little bit more heavy at the command line and I think it does it feels its role well. It feels that niche well. So let's see where could I find a good Miller example. OK so here's one Miller with Miller you have to add these flags. I guess yeah flags. You're saying the input is a CSV the output is going to be a CSV and you're going to filter and most things in Miller are expressions. So you'll have things quoted and the way you select columns is with this dollar sign a lot like a lot like with what we saw with CSV TK. So you do I want to filter where origin is equal to LGA month is equal to six days equal to one and then I just pipe that to TV. That's one example. Let's see what else we have. This is this is a sort sorts are a little bit different on the command line utilities. Well they're different from what you might think. If you're used to using libraries when using command line utilities there's these flags like dash and our dash and and these different flags will say you want to do things in reverse order or you want to do something reverse order with numerical values or you want to do something reverse order with lexical values. So there's this this nr flag is something that's common within the CLI data manipulation tools but maybe something you aren't used to seeing with dplyr or pandas. So a little bit different there. Let's see. What else do we have. I really liked how powerful Miller was with making new columns making new columns isn't something that every command line utility has. So and Miller does it very well. You name the new column and then you can also input different columns as as inputs or arguments to a function. So here I just have a feature is equal to temperature over humidity. And in my mind that was really clear very concise. I think it does a great job at making new columns. So again I didn't show the full power of Miller and it's just because I'm trying to compare similar features across all these command line utilities. And if I were to go into some of the features that make Miller the scientist it would be the only tool that would have things like bootstrapping linear regression covariance calculations things like that. So if you are interested in doing more statistics type of analysis at the command line definitely look more into Miller. But even even without that even without the statistics portion it does lots of other things really well filters new columns things like that. Miller is great. Q Q is very popular and I've seen Q mentioned a number of times on my YouTube channel. People seem to really like Q and so I gave it the title defense of the agents. Dota right. It's a it's something that's using a grammar that already exists. And I think that's really cool. I don't think it's anything to scoff at you know it it's it's it's unique in that way. No other tools are really using SQL like expressions to do data manipulation. So let me show what I mean by that. So let's see if you know SQL Q is going to be a breeze to use. OK. So here I'm I'm doing a replace for for NA. So this is just said looking for NA's and replacing it with nothing. All right. That's it. That's an easy way to to take care of NAZ. It's not something that you that's unique to Q many different command line utilities need this search and replace for NA's. But anyways once once I do a search and replace for NA's I can pipe that to Q with Q I'm just saying hey I'm using a header file the delimitator is a comma comma and what I'm going to do here is I'm going to select these different columns do a sum or sum in max average count and group by origin. So this is you can see this is just just like SQL. This is how you'd express in SQL. And the cool thing with Q is that lots of times I didn't even look at the documentation right. I just tried typing what I would do if I was writing SQL and most of the time it worked. And so that's really cool about Q that that's a cool feature with Q. Filters are just like they are in SQL. So use that where clause. There's another one. Let's see sorting just like just like SQL. So joins Q does well also. So inner joins left joins. It seemed to do it seemed to do great. Surprisingly I there wasn't like a union clause. So in SQL when you want to do something like append more rows onto a data frame or append columns onto a data frame. So maybe if you're or you can think of this as bind columns or bind rows Q didn't have a way of handling that. And frankly most of tools didn't have a way of handling that. When it came to these table to table operations it seemed like XSV was the one that was that was able to do some of these appending by rows or pending by columns. So that wasn't available on Q. So Q doesn't have everything SQL has but for lots of things you would normally do Q has a lot. So I call it defense of the ancients because SQL is one of the first ways we had of expressing operations on data sets and Q sticks to that. And I think that's great. So that is all I've got for these different command line utilities. Again every single one of them are great. They're open source. The developers you can tell spend a lot of time on each of them. So I would say if you're interested in doing command line data manipulation pick a handful you know maybe don't just try one but try maybe three or you can try all of them just pick a handful and give it a shot. See how it goes and if you've got any feedback or you feel like there's something I might have missed in these tables provide that feedback in the comments. And that's it. Thanks for watching.