 All right, thank you. My name is John Curl. It's pronounced like Curl. We do funny things with R's and L's in English, and I got both of them in my name. So it's just to mess with you a little bit. And just here we go. Lost my signal. Just now. OK, moving in. All right, so this is a talk about Miller, which you're going to find out about. It's a Swiss army chainsaw. And I want to, first of all, thank my co-worker, Dirk Edelbutel, who you may know from our fame. He's invented the term Swiss army chainsaw, and it makes me very happy. This talk is in English. Pero si me permiten. Es un sueño de toda la vida visitar a la hermosa hacía de Buenos Aires, so thank you very much for having me. I'm in love already. I am a software engineer. My day job, I work at a company called TileDB. And this link here is the project that I work on. This is joint work with CZI. Kate was in the room earlier. We probably know people in common, and we are hiring. All right, so I've got, the meat of my talk is two slides. I want to go through in some detail about an actual walk through, but I want to tell you a little bit of the story about why does this tool exist. Why, why, why one more tool? And the answer is esta herramienta se nació en una piscina de lagrimas. This tool was born in a pool of tears. So I'm not a data scientist. I'm not a data analyst. I'm a software engineer. It's what I do during my day job. And we have data files, and I kept seeing people grepping their CSV. And every time they did, it just hurt me a little bit, and I finally had enough. And so even though that's my day job, in the evenings in my spare time, trying a little bit on Laura Assion's talk, in my spare time, I came up with a tool for this. Um, so grep and cut and sort, if you're familiar with these. Anyone, yes? Ock said? Cool. If you don't, it's cool. There's two different audiences, but they're line aware. And they've been around for 50 years, and they're awesome. And you can do so many things with them. They're just everywhere. They're universal. They're really good for lines, and they're really good for integer fields. So you can pick up column seven, and you can get the ninth line. But CSV has column names. That's why they're there. And sometimes you just want the column name something. Likewise, CSV files, you can have a new line embedded inside double quotes. And grep is gonna see that as two lines for us. It's one record. So that's really what it's about. So this started off about 2015. I wanted a tool that instead of, if you grep for purple in this file, you're gonna find the lines that have the word purple. And then while they're data lines, the header's gone because the header doesn't have the word purple. Whereas using Miller with the CSV flag, if you look for purple, you're gonna find the records that have the word purple. And for CSV, they're key value pairs. The header is the same keys. And the values are the data lines. So you just get all the records. A little more detail if you switch to, so this is dashed, using CSV. You say your input is CSV, and your output is pretty printed. And then you wanna filter for color equals purple. That's actually looking at this color column and seeing if it is exactly equal to purple. And so that's really the high level summary of why Miller exists, is to be able to do that. And if you were to just fall asleep now, that would be the primary payload of the Y. And the rest is kind of the how. I also wanna say in case you think that I think there are not other tools out there, please don't think that because there are a lot. XSV, you can build indices on your data, which Miller doesn't, can have an index next to it. CSV is like so fast, just learned about it. There's tools that interoperate with R, tools that do SQL, which Miller does not and will not. JQ's amazing if you don't know about it. The new shell I just found out about, and drawing on Karthik Ram's talk about life cycles, I think if I were today that I was inventing Miller, maybe I would just go with new shell, because it's really cool. So you should check it out. And there's more tools, data set, friction list this morning. The, if Jenny was talking about is super cool IDE. And that's you're about to find out, Miller is. So how does Miller fit in with these, right? So how is it different? So first of all, unlike some of these, like CSV, TK that handle CSV, it's in their name, right? So Miller is multi format from the beginning. I wanted, I realized that you have these records in multiple formats. And it's really the same thing under the hood. If you can parse CSV into a list of records, do stuff to that list of records, the record stream, printed out as some other format, maybe the same format, you could reuse a lot of code. And so that's really what it's there for. It's got two parts. There's the things that are kind of the equivalent sort and cut, they're just record aware. Like I told you earlier that you can, it knows about column names and things like this. And there's also an Oc like programming language, which I won't touch on much today, but it exists. So the real, the heritage is the Unix toolkit, the Unix pipe, it runs on Windows, not to worry, and Mac, but with just basically a record aware command line tool. So in terms of barrier to entry, if you're not familiar with the command line, that's gonna take some getting used to you. But if you're familiar with the command line, then you'll just like, oh, it does that. It should be a pretty much a lateral move. You don't need cloud infrastructure. It's just a single binary that goes on your laptop. And you don't need anything else. No dependency problems. It's free and open source. And it handles bigger than RAM data, so people use it to handle hundreds of gigabytes. Miller does streaming when it can, so if you want to sort a data set, it has to load it all into memory and then sort it and write it out. But if you're just doing something like filtering color equals purple, it'll just go through this little working set. So people definitely use it for out of core processing. That's one of the reasons I invented it. I should also admit there's a saying that telling a software engineer that there exists a tool to do X is like telling a songwriter that there's, you don't have to write a love song because there's already songs about that. But like the songwriter said, I want to write that one. I want to write this one. And Miller's just a lot of fun, so I won't lie. Installation just really quickly. Basically it's been ported to a bunch of different operating systems. If you can install it on Windows or Mac using Brute Choco Yum, the things you know how to do. I did want to say one little caveat which is older distributions have older versions. So if you get like some Red Hat something from 2020, it's going to have the version of Miller from 2020. So if you click this link here, you can find out how to get the current version, which is 6.7. All right, so this slide is, thanks to my co-worker Aaron Wolin to LDB. He suggested, but you died, this is a lot of words, so don't read them all. The point is, I'm going to drill in on a couple things like cut and sort. And he said, why don't you let everybody know there's a lot of options? So basically just know that there's more than I'm telling you about. So there's the ability to reshape your data, do all sorts of bootstrap sampling and kind of pivoting and things like this. And there's a bunch of functions for data cleaning, like stripping white space and removing commas and removing dollar signs and all that kind of stuff. Just know that you have the option. I'll let you get out a magnifying glass and read that later, but. All right, cool, so yeah, so this is the meat of the talk, is this slide and the next one. And so what I wanted to do is just walk really quick through looking at a CSV file, which should be news to no one in this room. And I want to just site CSV files. At this, organizations-1million.csv is just a CSV file. And by the way, it's fake data. So what I want to do when I'm looking at data, I have a new file, what is it? So the first thing I want to do is see what it is. So at the command line, you might do head-n-2. For Miller, you can do head-n-1. That'll be just the first record. So that's gonna be the header line and the first data line. And of course it's scrolled off to the right. So it actually goes out to about here. So one of the reasons that I have multiple formats is like JSON is a file format, but the other ones that you might not recognize, like the pretty print and XTAB, which stands for Transpose Tabular. It's not my best naming ever. It's actually kind of a silly name, but now I'm stuck with it. But all it means is the format in which so. And put a CSV in output is this Transpose Tabular where you have columns going down like this and the data lines next to them. And if there were another record, you'd have a blank line and then more of those just paragraphs. So that's a really nice way to look at data. So here we have, now we know what the data looks like. So there's organization ID, the company name, again, completely fictional country, description, so on and so forth. So what can I do with that? So just a couple random things. So again, saying Miller.sv, you, there's such a thing as a .millerrc file. So if you get tired of typing .sv, you don't have to, you just put it in your home directory and say, I'm doing CSV and less otherwise specified. You don't have to type that every time. And from this, and grouped by country. So basically this is like the system unique command. I just wanna see how many countries are there? There are two and 43 countries. Or I wanna see how many there are in Argentina. It's a little more interesting. There's the thing I didn't tell you, things like filter, much as it can. There's another, it's kind of like a pipe. It's called these streams of data, but I can just use things within the same invocation. So I wanna do unique dash c. These are there in each one. So we have Argentinian alphabetical. So that tells you there isn't so many. Smallest argument in the information. So I'm gonna filter for Argentinian. Then I use a catch and I can see if there's a dash s. Because it's kind of small. I don't wanna see the index of an organization. And I also wanna move to country columns, because they're already called before. So that's the same thing, they're the same column, column, column, column, column, column, column. And then I wanna sort of numerically, but you wanna reorder the column names. And reorder is my, one of my favorite words, because what this does is it lets you give a column. And I think this gives me the name of the number of employees on the column. So what you can do is reorder it, put the column names up front, then everything else out there. Or you can put them in there. Oh, important. And, or you can put one column out and bring up the one that's longer. So it really just stays in relation to what you're doing. Like, why can't you put it super long? And then we're gonna look at that. So there's so-called RR, RR, RR, RR, RR, RR, RR, RR, RR, RR, RR, RR, RR, And then the last sort of difficult example I wanna show going from KCCSB input to baseline, all that, and I wanna do something else. So you just wanna fill in for a picture that you're doing. I've got the column name. And then I wanna put something new in there. And so this is me using the DSL, this is a small thing. I wanna put problem, I'll call them full problem. It's gonna wanna make it a map. The conference is 27, and I'll end it, run it into a file, and I do that, I get a, you know, and then I wanna look at it, and then here I'm gonna do my dash dash case on it because it's a JSON problem. I wanna look at the first two records, and you can see that there's the database in there, so if there's not, there's also new stuff in there. So if you're a JSON person, you just go, oh, nice, you get the nested structure that's a little part of it. And this is close to the end. So about that, thank you very much. I don't know actually if I'm supposed to go live. But we might actually be able to go live. So about to get it. So if you wanna work with people, they're the leader of things. So there's the online help, the help is interactive, it's just a thing. There's also some online docs, you can read the docs, there's a lot of docs, I like what it's been. Nicholas, did I mention you're a sophisticated glossary? No, it's been around for eight years, but it's been a glossary about a year ago. Also, you can go to, I used to think it's been about issues, and if you're this and this is the problem, I know I think we're doing that. Discussions are quite, you know, it's something more different if you have something to look at from this. So if you go to the discussions, it's talking about these slides there, that's what they're already doing, if there's any of them sort of moving to the back. So now you know how to say help on it. There's a lot of feature requests. So what happens when a lot of people ask for features, and if there are any, no it's open source, and just like, part of the time is the most precious resource that you're going to be able to run on this ad. Also, anybody that does single cell biology. And I guess that's it, and there's no problem for any question. Yes, we have time for about two to three questions. So thank you, thanks for, it's a really, it was really like that. But I have a question that you think of like, do you have plans to work with like that on the browser? Never, but. Because Unix, you know, like comment line, but I think as a JavaScript guy, it should be really great though. Oh, yeah, yeah, okay, cool. So I do have one thing that's really cool, is you can do Miller and then HTTPS colon slash slash, and the internet, which is, that's cool. So that's part of what you want. That's a part. And a really other question or suggestion. Do you have some samples where people could put their examples of some useful comments? Because I think that could be useful to use with some of these new NLP tools to generate the output. Yeah, that would be a good topic for discussions, I think. If you have anything there, and am I talking about like people just posting things that they're useful? Yes, for example, to put a column and group by some other column and then put the comments that I could review. Oh, the comment. Oh yeah, okay, cool. We should chat a little bit. I think it up discussions is the best place that I have right now. And there's probably got to be a little bit more like code sharing, like Karthik was talking about development, right? Like that last thing people tended just so if, yeah, there should be a place for things like that. Yes, yes, so thank you. Really great tool, thank you. More questions? For the other person, I think that it's very useful to put all useful comments on command line tool or cheat.sh, that's what I always find things like this. I went to point out that I was astonished about how well you and other people in the project answer a question I posed on Unix Stack Exchange. So I really want to point out the kindness. For example, I suggest to move the documentation to read the docs. It was me. It was me, and yeah, and you was very kind in the answer and on the questions and all the things and it was really, really a good time. It's a very difficult command I'm used to, to grab the key answer and this was tough. So I want to ask two little questions. If you are thinking to add to the attach auto-complete, it's very useful to have an auto-complete such a lot of options and common. And the other question, if there are any performance difference between, for example, using Unix or the cat here than doing with pipes in the traditional way? Yeah, yeah, so let me take those in a very low, I'm taking the order one, three, two. First of all, thank you about the community. That's always been my hope that it's a kind place and I'm really glad to hear that. I also want to point out Andrea because he's really all over Stack Overflow. So if you ran into Miller, it was probably him. And he's a contributor there. Third one is about using pipes, it's fine. I mean, you can use Miller cut, pipe to Miller sort, pipe to Miller, something else. If you want, it'll work. It just seems a little bit wasteful to be string parsing and string drying or you can use then. It's really the same and you might find it works a little fast or not. It uses coprocessing, multiprocessing either way. And then ZSH completion, yeah, that would be really cool. I should get on that. And if you know anyone that knows how to do this, I think it's something that's copy-pastable. Once you know how to do it once, it's easier to do it again. But yeah, and what the real underlying question is, there's just a lot to Miller. There's just a lot of legs and there's just a lot. And that's its weakness and this would help. All right, thank you everyone. Let's give John a round of applause.