 How many just like, we are waiting for people to show up, so how many people have used like shell scripting and anger, okay, cool. And how, actually this talk came out of like, you know, deep anger and frustration out of having to stitch together some tools and utilities that I had to do, I had to do at work, but it was motivated by a very early experience I had, I think about five years ago with the data processing problem that my team lead posed to me saying that we had some NGINX logs upon our servers and he wanted me to figure out where certain kinds of which IP address, GUIP are certain kind of requests coming from, okay. And at that time, I like, I was a pretty newbie programmer, so I spent some time on it and I discovered that there is this thing called bash and this thing called shell in which you can apparently unzip files and then, you know, look at the contents and you know cut, translate, grep information. So, I muddled around for a bit and then came up with like a literally like a very, very horrible two line, two liner, okay, with cut, translate, tr, I did not know set, I did not know arc. So, it was basically cut tr, grep, xargs, xargs and maybe one or two things, other things, but basically like a handful of programs like this and it turned out that the data size for our days worth of NGINX logs at that time was something like 600 gigabytes compressed, okay. I ran like I started, so I basically tested out the code on like a subset of the data obviously, can't test it on 600 gigs, the Zcat and Zgrep, the Gzip versions of those commands, okay. And then, I thought, okay, let me run this on the whole thing, let me see what happens. So, I SSSed into a jump host and started running the like pipeline on like the entire data set and then I went away and then somebody complained saying that, oh, you know, the jump host is like, like the CPU is like fully consumed or whatever it is and then I said, okay, and I think about a few hours like by the end of the day I came back and I saw that it had actually done the analysis and it hadn't failed, okay. So that was an eye opener for me that and this was like a use and throw kind of situation where we didn't really need to preserve the that piece of code or whatever it is, okay. And that was an eye opener and that kind of prompted me to start looking at Shell seriously for these like one-off jobs and eventually I discovered that I kind of would reach for Shell first in order to prototype something or in order to create like a stitch together a bunch of software tools to solve a particular problem because it lets me avoid taking on third-party dependencies and it lets me use whatever is there on my system already like in a team setting I would reasonably expect everybody to be on the same version of the operating system usually, okay or my local like virtual machines for testing would probably have the same version of Linux as my servers would have. So that's a reasonable expectation I can have within a small team setting and I can exploit the power already embedded in my system to do something useful until you know until we discover that our use cases have exceeded the solution and we actually need either a off-the-shelf product or utility or we need to build like a serious application in a serious programming language, okay. So that has been my background and experience. So are we going to start you wanted to give an introduction, okay I am just talking. So anyway so my name is Aditya and I currently am a freelancer the title of my talk if many of you might have seen it is Doctor Strange Pipes or I need to fix this sorry just give me a second I will fix the display where you do this you know I want to turn off the sleep, okay hopefully that does it yeah I will that's better all right. So is Doctor Strange Pipes because pipes appear strange to people at in the beginning but they end up being a very powerful construct and having been like my primary programming experience has been like functional programming. So I wondered whether one could apply those principles and techniques in shell and I this is kind of how I end up writing shell code. I don't know how people usually write shell programs but this is how I do it okay and that's what I wanted to show all right let's begin. So these are my personal motivations text is everywhere so I want to learn to process it. If there is a problem the masters have already solved it better like I don't think I can write a better sort or a grep or a you know TR or whatever tool then people have already written and provided in my Linux. They are stable tools with very well-known watts and this well-known watts bit is important to me because the new shiny thing if I adopt the new shiny thing I'll only discover the problems with it as I go along and my job is not to discover the problems my job is to solve solve my problems okay. So I'd rather use a tool with you know known workarounds and known issues and use those workarounds because they are well understood. Personal or team productivity and tool building autonomy this becomes very important in a team context I would rather begin with building like prototypes of our own solutions that solve our problems adequately rather than go out and like take four or five different off the shelf components one specialized for each problem and have to learn each systems own constructs and each systems own ways of interaction okay. So wherever possible I'd rather build custom tools to learn about good and successful design. So a lot of things came out of the early Unix days are around still because they are successful namespacing process management then isolation of environments then what else namespacing. So these are all ideas that originated in as far as I know in Unix history in the good old days to also learn about bad and successful design because sometimes like one of the problems people encounter when using Unix tools is similarly named flags do different things okay and sometimes this bad design is accidental because being an open standard and being an open system eventually like after the proprietary Unix is kind of became non proprietary when the code became available more generally everybody wanted to implement their own Unix and the tools are also reimplemented standardization efforts took up while and even now not all tools conform exactly to POSIX standards okay. So there are variants the idiosyncrasies of tools are an artifact of the long history of Unix and that like if you live long enough you will live to see yourself become the villain okay so eventually given enough time there will be enough problems with a system to make it intimidating at first and as a entry point into learning computer history or like finding like interesting tidbits of information just for the fun of it so I hope to convey some of these things in the rest of the talk okay so to begin with some thought experiments which of these are functional and which aren't the whole list of which of these like which of these are functional and which aren't just say words is bash sorry sorry okay Haskell is functional Erlang is functional okay Clojure is functional f-sharp is functional C++ can be functional okay all are functional you are not risking anything when you are like because I gave away the answer okay all are none he's a more nuanced answerer anybody else I would say that this conference itself is a example of showing that okay you can do functional programming in Java you can use a functional technique a set of functional techniques to write whatever .NET code or C++ or whatever it is right so then the question becomes like if these programming languages aren't inherently functional what makes a functional language functional like why do we say that this language is more functional than that language okay is it mathematical is it because the language is so called declarative like what you say is what you get you don't specify loops or you don't specify the procedure to compute something is it because they are data oriented is it because they are stateless is it because they are memory managed is it because it's a system okay so we like quickly cycle through these mathematical like the concept of the function is a mathematical concept higher ordered functions typed so these come in the mathematical output them in the mathematical bucket is it declarative like map filter reduce are examples of the declarative idiom recursion etc okay this by the way map filter reduce we will see quickly is automatic in Unix okay if you are using pipes and depending on how you use the utilities in between this is just automatic so we will see that when we get to the demos okay so you basically don't essentially you don't need to have a function called map or filter or reduce or whatever the way you use the tool automatically does it for you right so data in data out okay so called referentially transparent message passing okay so these are all this and mind you this is not a obviously a complete list is based on my limited experience and understanding so there is obviously room for discussion and correcting me myself okay that's one of the reasons I am doing this talk because I'd like to learn as well is it stateless like no shed state no mutable state no side effects is it memory managed garbage collection immutability these are ways to manage my access to memory right is it a system is it a human discipline for example accounting is a functional discipline okay whenever you make suppose you make a mistake in making an accounting entry you never delete it it's an append only system okay you make a correcting entry saying that I have if I've credited something wrongly I will make an entry saying I've debited something that exact value back alright so this is like the original log in a way because it's append only it is I would say it's time series and it relies on following a particular system like a human discipline automated rules or guidelines and this is where the programming language itself comes into the picture because everything that we've seen so far all those other points they are but aspects of functional programming I would say and different languages make them default or automatic or idiomatic to different extents okay and it's harder to do some of this stuff in some languages so you will see people complaining about oh it's so hard to write when you write Java in a like traditional Java in a functional way then it looks bad okay so I would say it looks different and one is not used to coding in that fashion and it looks different because the language design was not oriented towards programming in that in that fashion right and managed environments are also part of the system so when we one of the reasons for using Java for example is the JVM is awesome okay it does a it does garbage collection it lets you tune garbage collection it gives you like a layer of safety from the underlying system it gives you portability and it gives you a certain set of invariants that you can rely on right likewise the operating system also does that all right it manages processes for us it schedules like schedulers are amazing pieces of technology they are available for free over here okay if only we could trust ourselves to rely on the operating system to do hard things for us okay so right in the same vein what makes a Unix Unix there is no like one canonical standard definition of Unix so I got this from a nice slide by Rob Pike is something like the combination of higher level programming language hierarchy hierarchical file system uniform unformatted files which is basically text line oriented interface a separable shell distinct tools so basically I could use any shell that I want shell interface that I want it's not bound to the operating system itself okay distinct tools like each tool is a function all right and each tool can be composed with other tools pipes regular expressions portability and security okay at this point I'll say that the way I write code is it assumes that it's going to be used in a limited environment and an environment that I know and control or I or my team knows and control controls because once you start thinking about writing general purpose shell scripts for a wide audience or whatever it is then you're going to be in serious trouble because you have to like consider a very large footprint of operating systems a very large footprint of operating environments and differences in implementations of those things so like a certain flag will behave differently with a different version of Ock for example I've been bitten by that bug once or twice so that's very hard also secure shell scripting is very hard and it's worth looking into it and it's worth figuring figuring out why and how but usually like the I would say the 95% use case is don't do it sometimes you will have you will find programs at provide their installers as a shell script okay that those shell scripts will be hairy and they'll be complicated okay because they have to defend themselves from the wild wild world out there any questions at this point by the way now so this was a little bit of introductory gyan in the conference proposal I had placed this little piece of code this is a solution Douglas McElroy wrote in 1986 to a problem posed in and in a reply to Donald Knuth okay in a problem posed saying that given a file okay given a text file compute the word frequencies of from that file so Donald Knuth wanted to demonstrate literate programming and structure programming so he wrote a solution he implemented all the algorithms from scratch and he the program was I think 50 pages of Pascal according to the records that I found Douglas McElroy wrote a solution this solution in answer to that okay and the properties of this solution are that it uses the software tools philosophy it's reusing all the tools already available on the system it avoids writing extraneous code okay and it is almost impossible to misunderstand okay once you figure out how TR sort unique and said work like four programs these are all distinct programs TR sort said and unique okay and like this program the other property of this program is that I can take it okay so just to run through what it's going to do the first step is a transliteration step it's going to take the wall of text and convert it into a column of words okay according to some definition of words which is the a to z and a to z uppercase lowercase pattern so anything that fits in that pattern pick it and convert it into a word I will come to the hyphen c s is the they are flags for translate I'll come to what that means later then because you want to tokenize it the second step is a tokenization step you translate uppercase to lowercase everywhere then you sort it to bring identical words together okay then unique C will replace each run of duplicate words so you have adjacent words adjacent repeated words in your column and unique will pick the first instance and optionally with the hyphen c flag it will also spit out the count how many instances of that word it from right and then because you want a frequency distribution you reverse numerical sorted that hyphen Rn with sort right and the said expression there simply says that print the given number of lines or 10 by default so this is the positional argument one which this program would expect and if provided no argument it would fall back to this value okay this is the piece of code and it turns out that I can take this go to my terminal if you have not read this man page it's a good man page to read it's a nice piece of documentation about bash or whatever your shell is okay so I'm going to say man bash and paste it just like that and it works and this is a program copied from 1986 which is working on a system in 2019 and that's what I like okay so so there's a pipeline there's a pipeline there's a pipeline there's a pipeline okay pipeline pipeline pipeline what does this pipeline mean by the way it essentially means that the output of this program is going to be passed to the input of this program okay and why it works is because the system provides me provides all programs with certain defaults and the defaults in Unix land are STD IN, STD OUT and STD ERR file descriptor 0, 1 and 2 okay so the STD IN of TR this program is connected to the STD OUT of bash via the pipe okay if I were cheeky I would suggest that this is like the original monad but I won't okay because every program man bash can potentially generate a stream which goes to STD ERR okay so it is a program with one input and two outputs how do you how do you chain like a function with an input and with potentially different different types of outputs you would use monadic composition okay so this is how you do it in shell and the reason this works is it's a line oriented format all these commands are line oriented all like we call them commands but they are like actual programs so if I say type A of TR okay it TR is user bin TR which means that it's a tool it's a distinct software tool that sits whose binary sits in that directory sub directory okay so this was the motivating example and some demotivating examples okay and this is what scares people away in the beginning all right so I'm going to do some sort of processing with and I'll show like three four variants so actually variant one this is like variations on the same theme this is completely different because this is using only said and this is using only arc to achieve the same result okay so okay so let's say I want to program something that requires like a text game or something that requires my terminal to be of a certain size okay so if it's below a certain size I want to fail I want to say that I am not going to start the game because their terminal windows too small okay so in that case I would want to know what is the current size of the terminal and it turns out that if I do stty a by the way if you type spaces and you type a ls or whatever okay it won't show up in your history I want this to show up in my sees I'm going to copy it from the first character put it there all right so you see that it printed out row 17 columns 72 all right how did it do it let's rebuild the program from scratch stty a okay so it prints out a whole bunch of information about my tty standard tty is telly type I believe right so which used to be the original line oriented input output interface for timeshift systems nice bit of history to explore so now I need to figure out which where is the row and where is the column okay and it turns out that is the same line okay and I want to extract that data out of the line so it turns out that stty offers me a standard format none of these fields will change position how many ever times I evaluate stty so I can rely on I can rely on its output all right so okay so now I want to e is extended grep or e grep and I want to say on the row and the column okay here we got it all right now I want to extract the second and third column so here is one system provided default again these things are these things are fields okay and they are separated by field separator which happens to be a space this supposedly blank thing is the input field separator for for any kind of a record and it is made up of let's say space is space okay I didn't prepare this before so maybe I need to quote it yeah okay so so again I am quoting is important there are interpreter rules that one has to learn over a period of time and you will forget them so you should have a list handy okay all right so this is let's say space is a space and I will take that output and translate that let's say slash n is a new line sorry slash t is a tab okay so oh and I should not echo I should print it so the input field separator through this demonstration is composed of three defaults a blank which can either be a new line or a space or a tab all right many programs respect these defaults some programs don't some programs assume that they will get a tab separated input and if they don't they let you provide a flag which says that okay the delimiter is going to be something else okay so tr is a program that translates our deletes character that's its only job okay and if you see I incrementally built up this solution by using tr again and again and again it's very limited that way all it can do is given one character you'll either translate or delete that character or squeeze it all right so that's ifs input field separator now cut does not cut only respects tabs so you need to give it a delimiter all right so let's say we now so I'm going to give you another workflow okay so I'm going to use a commenting workflow and I'm going to comment out the rest of this code all right so I got grapy the next step is cut let me say I want to uncomment cut and see what is doing so if I look at this piece of text then it's semi-colon separated okay semi-colon space separated so let me try and use that as a pattern to try and extract like change the behavior of behavior of cut to say that that's the field separator all right so I get row 17 and column 72 everybody can see that right now from a output so I have got extracted the data that I want now I need to format it for output because it's got an annoying leading space and a annoying like semi-colon space in between whatever it is I want to convert it into a format which which conforms to IFS the default IFS input field separator okay so we've already seen the use of TR so I'm going to use TR here and I'm going to delete the semi-colon right there you go semi-colon is gone but annoying this guy is still here and the said expression says that look up the starting any number of starting spaces trim those replace them with a nothing so given this regular expression which says opening which says the pattern should start at the beginning of the line which should be a space and again this is extended regular expression syntax this is not portable okay if you want to write portable strips you'll have to write regular standard you can't assume e-grip everywhere and the plus means that any number of spaces this is a group regex group which now will capture any number of spaces from the beginning and it'll replace it with nothing and it'll do a global replacement for that whole line all right now you got row 17 column 72 okay now it turns out that all these man pages are available so being a curious character I will curious person I will look at man TR man cut man what is the other one said okay and I discovered these basically through stack overflow or through some other you know command snippet and I was thoroughly confused because there were 50 variants of them like you know you look at the comments and you go crazy okay so then what I did was I tried I experimented with these guys and let's say it turns out that using output control of cut I should be able to eliminate the TR stage all right so I do the same thing grep grep cut but in this case now I add an output delimiter when I specify hyphen D is a semicolon it uses that to delimit the output itself but it knows what delimiter is supposed to use so it can get rid of it all right so here we go okay so it got rid of that delimiter but the annoying prefix is still there I want to get rid of that I can continue using the same said expression pipe it to said so this is the variant on the theme the next variant on the theme is I discover after reading grep's man page about regular expressions okay and then I say I could use fancier regular expressions and I could use grep's output control to possibly drop cut but then I discover that I have to bring TR back in and this again I discover by incremental development so I say okay grep lets me do this all right so this expression says that the hyphen o stands for output only the matches okay and it helpfully outputs it in a as a stream of text if there were a thousand it would have output a thousand confirming to this newer fancier regular expression which I learned all right now I need to do what I need to confirm this output to the ones which I had before so I need to translate the new line get rid of the TR hyphen D slash N slash N all right ha I can't do TRD because you see row 17 column 72 doesn't have a space anymore so I need to translate slash N to a space and that does it for me row 17 column 72 but now I have another problem I have a trailing space which I don't want so I need to get rid of that and I can use the same said grep expression except tell it that look for the space at the end so instead of using the carrot I would use the dollar and here we go so I get output format in the same way now this was kind of easy because it was variants on the theme I didn't need to learn about other programs okay what if I decided to learn you know go back to stack overflow and find what other people have written so it turns out that after messing around with said you discover said can do a lot many more things all right with better fancier regular expression matching and I can now do this same output okay just with said and it completely is based on regex capture and printing the captured values and I could do this because I know for a fact that the input is structured text if it was structured text and or if the order was not guaranteed this would not work the structured text bit is key for making all these things work all right so now let's look at variant 3 and then I discover AUK and I spend a day learning AUK and I say that oh it's even simpler all right because AUK is one of those programs that by default respects IFS as the input field separator but I can override it with a more complex field separator it could be multiple characters so I say that the input field separator is semicolon space okay and I tell it that based on these so now if I don't print let's say I print the whole thing out okay so what is doing is it's using the rows or columns match pattern match regex match and it's saying that for whatever line that contains this one of these words I'm going to print that line out right and even though I gave the field separator as a semicolon space like because I gave the field separator as semicolon space is printing those things out okay so I could then now print out just the first field speed is this all right these many pod then I could say 4 line is 0 yeah and this is where we run into problems because it's not space terminated so it recognizes the whole thing captures the whole thing as the last field all right these are the little bits of there's a joke that like you have a problem and you say oh I'll use regular expressions now you have two problems okay but given relatively straightforward domains or data domains you should be able to and structure domains you should be able to use grep expressions quite effectively all right so I will go back to printing the fields that I actually wanted which are these fields okay and now you despair because you wonder what to do all right you wonder what like how am I ever going to learn how to construct programs if there are so many ways of doing the same thing all right fear not there is hope before that we will take a quick detour to software tools principles let me just check so this is from a book from like 10 or 15 years ago that I used to learn shell scripting today and I could use it because the commands in change the tools in change most importantly do one thing well process lines of text not binary use regular expressions default to standard IO which is 012 and error don't be chatty generate the same output format accepted as input all right so the if you think back the TR example which I built up incrementally to show you what IFS consists of it worked because the input output formats were the same okay and the structure was the same let someone else do the hard part I don't like I am not competent enough to implement an efficient sort which will scale to almost any size of data all right and D2 to build specialized tools so if your you know prototype shell script doesn't really solve your problem then you have a harder problem to solve so go solve that problem and hopefully you will be able to use the that C program or C plus plus rust whatever it is in as part of your Unix toolkit all right so that with that demotivating perspective let me go to a remotivating perspective there's a Eric Raymond has this very funny humorous section of his website called Unix Coons and I'm going to read out one to you it's called master foo and the shell tools okay a Unix novice came to master foo and said I am confused is it not the Unix way that every program should concentrate on one thing and do it well master foo nodded the novice continued isn't it also the Unix way that the wheel should not be reinvented master foo nodded again why then are there several tools with similar capabilities in text processing said or can pull with which one can I best practice the Unix way master foo asked the novice if you have a text file what tool would you use to produce a copy with a few words in it replaced by strings of your choosing the novice frowned and said Perl's regular expressions would be excessive for so simple a task I do not know Ock and I have been writing said scripts in the last few weeks as I have some experience with said at the moment I would prefer it but if the job only needed to be done once rather than repeatedly a text editor would suffice master foo nodded and replied when you are hungry eat when you are thirsty drink when you are tired sleep upon hearing this the novice was enlightened so I am going to go back to the same example go here okay so master foo and the shell tools we had the original solution was grep cut tr said easy things which I could like pick up very quickly not very complex said and Ock requires you to learn lot many more things okay so thus fake master foo when you are hungry eat I had a immediate problem I managed to solve it with simple patterns okay and I learned a thing in the process that I can write a function okay and now this is available as a command line tool all right so if I do STTYA and pipe it to this all right so this is a function now all right and this is what I knew how to do right now so I just used it I added this to my library later on I discovered that I got curious I got thirsty and I wanted to drink more from the fountain of grep and said so I learned about that and I said I can do it this way as well okay and I say now STTYA is now I have two ways of doing it V1 and V2 and this two works they are the same right by the way I can make sure that they are indeed the same okay so when a program exits successfully it exits with the exit code of zero zero means success non-zero codes means falls okay so if this is the case then I can say that and I can use short circuiting so ampersand ampersand is logical Boolean I can say echo okay yes I will yeah so this is a process substitution and I will come back to this later again I just wanted to like on the fly I thought I should show this again essentially it presents the result of executing the command inside so this this program the output of that is presented as a file interface okay so one way I can show you why that is true is I can say this and this is automatic memory management as far as I am concerned because the system did this for me I did not have to create a temporary file the program diff expects either STDIN or files as its input this is one of the techniques I I tend to use to avoid creating temporary files because then you have to clean it up and yeah okay and then finally when you are tired sleep because after reading after studying off you will definitely need some sleep so now you have these three guys all right and let me add to this thing and I want to diff the third thing also I must say STTYA pipe it to extract three oops sorry not this what did I do diff oh okay no it expects how many files did I expect report when two files are the same okay I will have to read the man page better but and by the way this is my process of discovery so you discover you expect I expect this program to take multiple files but it doesn't work so then I look at the man page this is another bit of hygiene that you want to okay we have some computing to do I won't get into that it's okay you get the point all right so and if you see like I try to do something which I had never done before I had never tried diffing three files at one go but the principle is the same okay the important bit is the principle I can treat these variants as referentially transparent because they are doing pretty much the same that's what I try to demonstrate with diff which didn't really work for three files but I can treat I can use whatever I I am comfortable with today the other reason why I want to do this is the same problem okay I am going to take the same program Doug's program and run it on different sizes of files all right and to do that now that I know how to write functions I am going to use functions to delay evaluation and redo things so this is Doug's program I just pasted it there and say Doug McElroy's program bash works okay doing the same thing now I am going to say set up some base set up some data all right so what this function is doing is I will just walk through it all right so first I am going to regenerate like I have I have set up these files three files okay so 1x 1000 x and 10000 x all right and I am going to basically replicate the man page just as an example all right so the first thing this is a way to rotate a file there are many more clever ways but I tend to use this because it's simple for me like I can understand this better so dev null is a special file quote unquote file available on your unixes which is empty so if you overwrite the contents of your target file with an empty file that empty file is going to get truncated that thing file is going to get truncated and then I want to recreate it like recreated from scratch okay so this is again the short circuiting that we saw if this fails for some reason which is extremely unlikely the file will not get regenerated okay likewise I rotate the 1k file and for for underscore because I don't care about the value of this let's say for i in let's look at seek first seek five will generate a one indexed array of five digits all right so for i in seek five so the semicolons are necessary when you are typing printing typing expressions statements in a in a single line okay so do echo num dollar i so what this will do is this will perform commands command substitution just like the example of process substitution at the the output of the command was presented as a file interface the output of this command will be punched in line okay and for will iterate over it number num 2 num 3 num 4 num 3 that was for so for so now I just want to do something perform an effect a thousand times so I don't want counters and I don't care about the value so here so I'm going to take the 1x file and write it append it to the 1kx file a thousand times okay and then I'm going to take the 1kx file and append it 10 times to the 10kx file because that's how math works okay here you go let me do this set up test data okay by the way you can inspect that the what this thing is let's see tup test data right and I'm going to go up a bit so it said that this is the function and helpfully it prints out the function definition also okay also another comment this function syntax is kind of frowned upon the way you're supposed to write shell script functions is like this okay so this is they are both equivalent in the while you will also see function set up with that bracket at the end parenthesis at the end this is considered a no no nowadays I believe it has some portability issues or whatever okay I use this because that's how emacs auto completes my function definition so if I want to write a new function here okay I would say function and something something and this is accepted in in bash world this will work 4.4 4 plus at least so set up test data and I'm going to run this as a job okay so the ampersand at the end will background the task and jobs let me know that this program is still running and the reason I did that was generating the 10k file takes about 60 70 odd seconds of water okay so this is another thing that the system provides me I can run jobs all right so this is running in the background the system is managing the process for me meanwhile I'm free to go and code like I'm not blocked over here I have no problem so that that's happening and then I'm going to use take this function it's going to basically run Doug's program with timing okay so out here I simply put in this string Doug Mikkelsruhe's program which is going to invoke the program I'm going to send it the first argument okay of this function all right and helpfully I'm like printing out some information about the size human readable size of that file and how many lines it has okay before running the before running the timing program and then I'm going to take the output of this and redirect it to DevNol because I don't care about the output okay right now okay and with the ampersand I'm going to run it as a background job all right so this whole thing is a now a function the job got done by the way okay it's over so I have now run Doug's program with timing what does it need it needs a it needs a file which is data I didn't bother to name like create a local name I could have done I've done it ahead will I I'll introduce locals so local uh f name equals to dollar one so I could do that and then use f name here instead of dollar one all right so now I want to say this has three of these files so I'm going to give it the 1x file first okay so there's 1x.txt so I'm going to give it 1x run Doug's program with timing data bashman this auto completion is provided by a program called read line and it's very handy in creating interactive scripts and this is the reason why my functions are usable on the command line because I can name them however I want and I can autocomplete them okay so 1x.txt okay it got done in like sub second I'm going to do it on 10x 1000x all right so 1000x so it's still running still running this will take about I think 59 odd seconds whatever okay so but this is a so that this is now running on a 300 meg file 350 megabyte file okay and I didn't change a single line of code it's not blown up it's not doing and it's just taking more time and how much more time it's going to take linearly more time on this dataset so it took about a minute okay minute plus if I run the same thing with the much bigger file okay now this is four orders of magnitude from where we started okay why is this happening wc is taking some time now okay so it's got units 10 100,000 100,000 million 65 million lines in this file and I haven't changed a single line of Doug's program okay 1986 right and is running as a job so I'm free to present continue presenting we'll come back to this later because this will take linearly more time about 9 to 10 minutes all right we don't have that much time to wait so okay so that was motivating example one that the scaling potential of your shell tools is limitless almost okay and the reason this works is because pipelines are automatically buffered the page size there are some default page sizes for pipelines and each of these shell tools programs are phenomenally optimized okay they they manage memory and consume as little CPU as they can okay and note that this this whole pipeline is a map reduced pipeline because at the end we are doing a frequency distribution which is and the whole pipeline is eager the translate the tr steps the said steps whatever these are map operations okay unique is a reduce operation because it wants the whole data set same as sort okay so this is a map reduced pipeline and because the entire shell pipeline is a blocking pipeline it's a potentially a blocking pipeline it's eager I can use job control to let it run in the background and let the operating system manage it from right so I'm doing that we can go one step further okay so again harking back to the yes consuming 88 percent of one core 97 percent of one core and I have a this is a super computer it's a five core machine it will probably be the unique and sort probably so basically sort is a n log n algorithm I think unique will be more or less linear so there is a n log n an o n and a n log n in the frequency distribution pipeline okay so that's the complexity of that and it will be see that bit will be the tr bit will also be CPU intensive because it's doing a lot of text processing yes somebody else are you had a question so correct me if I'm wrong that's the facility that the pipe provides us okay so pipes are blocking and they they wait for input as long as there is something to deliver to an output okay so the pipe is the thing that's buffered and if for okay so one example is cat okay so if I say cat so it's going to catenate that output but if I just do this it's going to block indefinitely because it's waiting for something to so it's the program as well as the pipes all right nothing special it's it's just automatic okay so that's what I've been doing so man bash would ordinarily really send be connected to my stty but then with a pipe the behavior of batch changes and I can pipe it to whatever program also yes 30 minutes I haven't encountered one in the problem sizes that I've dealt with which is up to a few hundred gigabytes so I would say on a reasonable system no but there could be limits like if you're writing to a file and if that file gets too big then you might suffer okay or if your system gets memory constrained in some way or the other then you might suffer so if you if you run out of file descriptors because of other processes of this program no not not in my experience it's CPU like it's a heavily CPU intensive process okay no memory if it's a very fast data slows source and a very slow sync you could try to read from dev random so that's infinite it will block it will block because the consumer the consumer is slow it will block until it can write again you'll have to wait the entire program terminates after the last after the last program terminates so the signal propagates up and they all die all those processes die that's my understanding of it okay so as long as there is something to read if I was not if I were not getting a finite file you see this taking about 10 minutes if I was sending a stream to it this would run forever and at at that point I think sort or unique one of these would blow up because they I don't think they can work on streams as far as I know okay so very quickly I can take a Doug's program there is a I forget the no this is going to yes I mean you can think of it that way I don't I don't actually know how it works but you can think of it in in abstract terms you can think of it as a blocking queue it's still a stream like as far as the other programs are concerned there is a buffer so I'll try it I'm not actually tried it worth exploring so I mean so the nice thing is like you have these questions try it out okay so because now you've got like the basic concept like how to actually construct these things right and I'll quickly sorry I'll take questions later because of only 30 minutes and there's a important bit I want to explain so Doug's program was a pipeline like this okay but if you look closely each of these is a is semantically different from the other okay what if I turn each of those into functions what could I what potential use could that be okay so let's say this is specific to how Doug is identifying words okay so let's say this is a function that will flatten something to a word list okay and I say man bash so this is domain specific because what if I'm trying to process humble or kanji or what I don't okay so this is this representation of characters is incidental to the dataset that he is not essential to the problem so I take it out okay then this tokenization bit is also potentially domain specific because I could try to tokenize it in any number of ways okay so let's say now I do this tokenize lowercase all of it is lowercase all right then I say frequencies so this idiom sort unique sort reverse numeric is frequencies all right so why don't I now I have frequencies okay I get a frequency distribution then I say this is actually take n so I say take n all right now I say oh by the way you see that output that program finished running and it took 10 times as long as the previous program linear okay and that's awesome that tiny pipeline scale so much all right so and now we are breaking the whole thing into like tools so we have and in this definition of take n I have given a default of 10 so even if I don't provide an argument give me 10 okay if I say take n 5 is going to give me 5 so this is a little bit careful design of that take to give it a same default because you don't like if I would let it stream then I would run into these problems like this take would block forever what else could we do so now frequencies can work on this I don't want to tokenize okay the counts are different I am updating where do platen modal is taken and if I add after this I want to say tokenize counts are different okay because somewhere you will have a uppercase t for the though right so already you see mix and match in my data analysis workflow I would want to do frequency distributions on different things okay and frequencies is a standard EDM then let me say that I want to do a dictionary sort okay so look up these flags later all right I will share the source code you can look up these flags later I am going to take these frequencies take n 5 and do a sort dictionaries all right so it's not a frequency distribution ascending descending frequency distribution anymore it's sorted by the dictionary thing what if I try to sort it by rhyming rhyming words all right this is a another EDM reverse sort reverse so what reverse does this okay and this is another bit of syntax so I could have actually done let me just ignore that syntax bit for a bit so I will say all right so now I want to do instead of sort rhyme do something sort rhyme and let's do let's take there aren't that many oh yeah see ha ha ha now I am going to be in trouble because I can remove this whole thing so take is yeah this will work we are called a comma sepa opera all right and you see how fast this was right what if I want to eliminate stop words so I have some definition of stop words I put them in all right and now I want to say flatten words list and I want to put it here drop stop words tokenize lowercase so just to see it in action I will say I will get rid of these and so you see the is and or the whatever they disappeared whatever my definition of stop and I can use this stop words thing here or I can use it afterwards okay because I might want to do something like let's say I want to do the original frequency distribution okay and I want to send that output okay so I want to preserve the original frequency distribution and get rid of stop words after that now I have a frequency the original frequency distribution is preserved so this is caching right I have cached intermediate output and I am doing some processing later all right and last two examples I will show you is same thing but I can do bigrams and trigrams all right what is this going to be let's say man bash flatten two words list so this is single words and let me say give me biograms all right and I want to do frequencies of biograms same thing all right I just re or I just change the input data representation and I treated each of those as records right that's what it did okay so this biogram business also if I had to drop stop words I could do it here okay or I could do it later because it's going to still work because it's a reg X pattern match all right so I don't care about off the and the whatever so or I could modify my definition and do something more interesting likewise we can do trigrams all right so where else can I get reuse so I have this little function called git commitors let's say I will just use the definition okay so I am going to say cat cd to ampersand ampersand so by now I have figured out how to read man pages and I realized that I can avoid writing code by using a git interface a n will print out the author name okay I am a pipe that to frequencies and recognize oh sorry git log master hyphen fn that's what happens when you okay so these are the this is the commuter list all right and to reverse that a bit frequencies I will get rid of frequencies so that simply generated a list of commitors and I have this frequencies idiom already all right so that's what happened I've already shown you process substitution all right the other thing that it did was this ran in a sub-shell and the sub-shell environment is distinct from the parent shell okay so cd potentially could have changed the directory if I run it locally over here but it doesn't because the parent environment is exported to the child the child uses that and does the cd operation internally and it exits and that doesn't affect me at all but its output is presented as a fd to the outside command oh head head also has a default it has a default print of 10 so the said q said 10 q was pretty much the same thing so at this point I have got a time check I am going to do all right so I am just going to recap these so most importantly do one thing well we took Doug's program and we broke it down into domain specific things and general purpose things and we realized that we can actually use those at the terminal for doing a lot of interesting stuff plus zero mutation okay there is no explicit mutation in any of the pipelines or any of the functions that I am using all right fn plus plus plus this what we did so I am going to cover the Unix shell tools and concepts seen so far very quickly okay there we go so the role of a software tool is to be single function single purpose it must behave like a function as far as possible it should be the fastest that it can be at at that task if it can't be fast in all the cases it should be fast in the general case at least so for example grep will slow down immensely if you are trying to look up a single token there is a particular like if your grep expressions reg X's are not optimized grep will slow your system down a lot but in the general case it is very fast it must be domain agnostic as far as possible be useful by itself so tr has no concept about the input domain of the data that is getting it only cares that it sees something that looks like text and is going to try to do whatever on that right be streaming wherever possible so that is the default like that is the idiomatic way of thinking about files or about so then you can read things of a socket you can you can basically pipe things so I can do a okay and I can let us say a cat so I can say to the temp pipe let us say temp pipe okay so I sent it to a pipe name pipe okay this is another thing that shall provide you that the yeah the unix system whatever mind you I am not very good with the word sometimes so okay and be composable and interrupt with any other tool via the standard io model so it turns out that my own functions by default respect the standard io model so I can use them in combination with existing unique unix tools that is amazing power okay and if I write larger scripts which perform like the job of a tool if I am careful enough and I make the entire script respect standard io then it is going to do that okay so the role of the shell is to provide a concise syntax to compose standard software tools into custom software tools that is a way to think about the shell so you have this whole collection of tools a la carte of the menu and you have a thing that you can use to glue it together okay and obviously it is a powerful user interface right so standard io error okay rely on this pipes and pipeline composition you seen pipes are amazing so you could do a function with a like you could do little just like function composition you could do little functions that do one job and one job well you could do abstract like construct abstract pipelines out of them and then you could construct in a function again inside a function body okay so any of these things like the dictionary sort of whatever we did here is gone so anyway one could put that pipeline in a function definition in a function body right and then you want to pass the output of that to some next stage processing you can do that because it is all like getting routed through that pipeline sd in out right redirection I showed you examples of redirection again use that to your advantage t is also a very nice way to cache the intermediate output of a pipeline yet let the pipeline continue okay so t as you might know it kind of forks the it creates a copy of the stream and sends it to two different destinations and one of the destinations is std out okay so you send it to a file and another destination environment isolation so function called bodies are isolated so if I for example cd inside a function body the the cd will only happen in the within the scope of the execution of the function okay script calls subshells all those things functions and command line usability again we have seen this so this is what I meant in my proposal about creating swiss army tool kits okay a swiss army knife you can open a tin can you can you know maybe fix like one of these things but you couldn't even build a doghouse with that because it's not to build a doghouse you'd reach for a hammer screwdrivers nails like these are independent specialized tools and I could use these things to construct a bigger house as well okay so there is a scaling advantage of composing those specialized tools which you will get here systems guarantees and environment defaults so again I'm interested in learning more about my system and trying to use it the best that I can and I discover over a period of time that oh I have sd i and out I have input file separator I have you know some automatic buffering of pipes or whatever it is so I use them I don't try to reinvent the wheel okay text processing matching patterns and regress so again this is like the lingua franca of your operating system everything is text like so it's good it's good to if you're in this setup it's good to model your computation in terms of streaming text substitutions again command substitution I have not had an opportunity to show you read only globals but they are just like there's a way to say what is that declare hyphen are variable name equals to bar and it's going to complain okay so now this variable I cannot even evict from my current terminal this particular terminal session until I close it and restart it because it lives for the life of that that shell and shell environment okay so when used inside a script is going to live only as long as a script is run even if you invoke the script from here food or SH whatever so that read only won't get this thing process substitution again it's like a I treat it like automatic memory management I don't need to care about the files or whatever expansions there are some pattern expansions one example which I showed you was a cat so this is what is called a here string format okay and if you want to do a here doc you do something type something type something and you are okay a lot of substitutions are also possible within here docs so you can use here docs as declarative like I didn't do a println println println or echo echo echo whatever it is as I said I want the output to look like this please make it so and that's a here doc format so people who are familiar with templating engines would have used something similar in other situations all right and I wanted to give a demo of tic-tac-toe or whatever but it's not going to be possible I'll share this source code with everybody anyway so is it this finally so this is how I do it so I avoid bad situations in my scripts as far as so these are basically general programming techniques as you will discover you can reuse them in your shell environment because many things might have effects on my system I typically want to do some system administration generate files or do some stuff I'll fail fast fail early I'll avoid manual state management with some of the techniques that I demoed I will avoid portability unless it's a design requirement okay and in that case I will say that please somebody else do it because I am not the expert all right then I will avoid adding code so for example the use of set and grep patterns it eliminated three stages of pipeline for me okay so it like by reading the man page I figured out how to avoid writing code in the first place and that's another thing that I try to do design redesign my data better instead so in the frequency distribution example we discovered that the frequencies function works the same on one grams two grams and three grams okay so I don't really need to I didn't have to design redesign my frequencies function I just laid out my data differently in a way that frequencies would understand use tool flags and options judiciously but not over smartly okay so there is a bit of a part of practice that you will discover again as far as possible I don't control flow I don't do imperative stuff I try to figure out how to pipeline it okay compose functions compose unix tools compose pipelines and try to compose programs as far as possible compose compose compose your program should emit structured line oriented data that's what I try to do because it's amenable to structure text processing and where I control the format I can get pretty good results okay out of it design again function as a command line util use standard input output error to your advantage you can emulate higher order functions I won't demo that now look at it in the code and then this is again a general design principle where you have an imperative shell pun intended and a functional core so you push as much logic as you can into pipelines and then invoke the script or whatever like create temporary file setup tear down or whatever it is separately for that also I like to use functions so I can do the following okay I can enforce invariance with a function that some globals or some environment has to be correct before everything else can proceed so if I put that at the head of a pipeline and if it exits with a fail code the pipeline won't execute okay because my invariance are not satisfied and to control the order of effects this is basically what we use functions for many times safely said we said global so or re-initialized state like I use that function to generate man page bigger and bigger man pages for improve maintainability and understandability because if I can make referentially transparent things obviously I am going to be able to replace them swap them out at will and it improves testability I am not there are ways to okay so there is at this point I will say shell is a terrible environment to write large very large scale programs they get very very hard to maintain that is balanced by the fact that you can get a lot done with a with a very like you know with a very tiny shell script if you use it well okay so because it doesn't have a module system at least in bash it doesn't have one name spacing yeah I mean there are many many many different reasons so so basically you end up doing things like under under bar under by util name function which python people are used to doing so to make something as private by convention rather than by design okay and basically I don't use shell if it's a bad idea for any programming languages language or system there is a domain that it's good for and shell is a good for a good for a lot of things but not good for many other things so it's again judgment call that one figures out okay so yeah that's it okay questions hi I just tweeted a link to a stack exchange answer to the question about pipe buffer size to the fnconf id and the fnconf 19 hashtag my question is you say don't use shell if it's a bad idea I totally agree with that so for you where is that boundary and what do you use instead so usually I reach for it if I have to do some like quick quick job and I don't want to fire up another tool I don't want to like install some third-party program and then figure out how to make that program use how to make use of that program things like for example if I had to do like if I had to curl test some apis if I had to use curl to like inspect some apis and like figure out what the result is going to be for a one-off use case or for an intermittent use case in my own workflow I would use shell if it wasn't a team setting I might want to use a more like a purpose-built tool like a postman or whatever it is because it lets you you know share your your tests or whatever like it's got a lot more features so it's like I think the answer is it's you develop an intuition for it over a period of time and there are some thumb rules like if I if I end up spending like more than five or ten minutes trying to solve a problem with shell then I'd probably google up whether there's a tool that can do it for me as far as do you use any other languages like python or ruby I happen to like my first professional language was closure but yes I've used python ruby if you can call like DSLs like terraform if you're doing like I've set up log pipelines before so you end up interacting with arcane DSLs they're like if you're using artist's log you have to use something called rhinoscript which is interesting okay thank you just want to say I wrote a tool called rex rexy which is exactly for this kind of thing command line ruby so if you want to check it out nice google rex github yeah thank you cool is it right to say that shell scripting and all that is scalable right but that is vertically scalable not horizontally scalable like when we're running on a cluster and all then we need to use something else like spark or like with skull or something like that right you can use a bigger machine yeah so vertical liability right aws like you'll find on aws machines with terabytes of ram okay okay and your big data problem is if it gets bigger than that then you're in like you're probably got like hundreds of millions of dollars of funding and you know like six figure seven figure customers so okay you can afford spark at that time but until then you can just vertically scale there are ways to do distributed computing also by the way because ultimately huh in shell yeah because see it it's ultimately you're talking about streams of data and a map reduced job can be farmed out over ssh to multiple machines and I can get the result back okay okay it'll be a brittle pipeline it'll be a not a very capable pipeline in terms of like oh I want to modify to do this modify to do that then you'll start getting you trouble but yeah it's possible and if you look up there are people who are actually doing this there are distributed job runners also okay so you wanted to say a lot of times your log files are on multiple systems right and then you can use a tool like Ansible to a distribute your shell processing pipeline over all these systems it's already distributed in nature because your data is already distributed yeah and all the distributed systems like I know it from her do for like the data data the file system itself is distributed in the first place right so the problem you they have solved first is distributing the data and then they are doing distributed computing on top of that yeah so if you can distribute your data first then you can use the same pipelines to distribute it sorry I didn't get that yeah so that's basically HDFS has implemented it like to be 46 compliant so it gives similar out yeah but what I'm trying to say is like as long as you distribute your data across 10 you do not even need to distribute across multiple systems as long as you split your file you can run five pipelines and process it and then merge it together so the story I heard was that amazon it's up there on the internet somewhere amazon's a9 search engine was originally a collection of shell scripts it's a search engine written in shell bash okay if you look at the bioinformatics industry they use orc heavily because they are machines they print they export output data in like structured text formats and if you if you take a bio bioinformatics course online orc is part of the curriculum many in many cases okay we can take one last question so all of this seems to be useful when you are working with streams of data right or is there a way to do that when you're not working with streams like where you need to do lookups and maybe some random so you want to do joints and things like that as well like joints can still be done in a streaming fashion maybe but when you have to like you know in build a state in memory and something like that like you would my question is basically that would you recommend this kind of thing if you're not doing stream or your data is not stream streamed or streamable so you're talking essentially about jobs like with some intermediate state where you can think of it that way okay I would say that I would say that the the one kb file is a stream that terminates okay so if it I mean that like I really I'm not able to understand the can you give an example of what you are trying to say like all the programs cannot be thought of as streams right like there are things like you keep a cache for example is not a stream right so that kind of thing yeah sure I mean like I did that with t over here like I cache the intermediate frequency distribution and potentially some other program could be spawned in parallel to work on the work of that so I don't know like I'll sit with you later and understand better sorry okay so that's it from me thank you so much okay I hope you have heard something out of it