 Thank you. Good afternoon. This is the reason I love Unix. With five lines of code, we can build a spell checker. So we can take a document from the web. We can translate all the words into and separate them into a word in each line, move the upper case to lower case, sort them, get one word on each line, so remove duplicate words, and then find out, with common, the words that are not in the dictionary. And those are the spelling errors. So pipelines are extremely powerful. Last year, I showed you another example where I used code. But one problem with pipelines is that they are linear. So we can have a pipeline from one process to the other, but that's the only thing we can do. Last year, I showed you an example from the Unix history repository where extracted various metrics, such as how the length of identifiers and the filenames increases over time, and the number of users over the register keyword of the go-to keyword decreases. In order to do that, I pipeline many processes in a graph such as the following, find all files, and then pass them through a preprocessor and through various graphs to find, include names and various identifiers such as register or go-to, and so on. But as you see, this is not a linear pipeline. It is a graph. So how can we do that? How can we program such a graph with the Unix shell? There are a number of alternatives, but I didn't find any of them particularly satisfactory. So one of them is to run each pipeline separately, but this is inefficient, so because some parts of the processing get repeated. Another thing is to use temporary files, store something and then reuse it. This is awful because you have attached a disk and this performance is very damaging. Bash offers us a syntax to do that where we can send out the output to a process and we can use T to simulate those processes as files, but this quickly gets ugly. Also, it finds out, but you cannot find in again the results, so this is limiting. In order to solve these problems and to create a better alternative, I created DAX, the best thing since sliced bread. It can process streams. It can process big data sets. It uses Bash, it uses existing Unix tools, so the tools you use and love you can still use them, and does this in an expressive and efficient way. How does it work? Let me give you an example. Say you want to compress, to compare various compression programs, and I do that on the Karamazov brothers, a tech file and we get this output. So what can we do this without storing anything on this? So doing everything on the fly, obtaining the packets and then passing it through the different compression programs. Here is the graph that we want to have, so you run T to send the output to file, to word count, to accept bzip and gzip, then you word count the outputs, you also print the labels and you coordinate everything together. These are normal or almost normal Unix tools. This is how you express this in DAX, so you curl, you T, this is the special syntax and multiply block, where you asynchronously run all the commands that you want. At the end, you pass, you pipe again the output of the multi-pipe block through cut to concatenate the results together and get the output that I showed you. So the mechanisms you have in DAX are first of all multi-pipe blocks. You can also have Unix commands with multiple input or output channels, such as cut to gather the results or T to find out the result or paste to gather the results and also stored values as symbolic links through the graph. We've adapted a number of programs, for example, T that can have an arbitrary number of output channels or paste and cut that can merge together many other channels and there are other programs that can do similar things. There's an API with a single command, you can convert any Unix program to work with DAX. So what you do is you call DAX Negotiate, you pass the tool name, the number of file descriptors that you want, the number of input file descriptors and output file descriptors that you want and as a result, you get the input file descriptors and output file descriptors that your tool can use. So you say, I want four input file descriptors and five output file descriptors, you get them and you play with them and that's all. For backward compatibility, for existing tools, if you don't want to adapt them with the API, there's a command called DAX RAP that allows you to specify whether the command is mute, doesn't have any input or whether it doesn't produce any output and you can also specify pseudo arguments that get converted into file descriptors so that it works correctly with any other DAX command. Let me give you some further examples. First of all, a motivating example, say from a debugging perspective, say you want to find the differences between two S trace outputs and what you would do is run or can sort the result. The problem with this is that the flow goes in this direction, then goes back to come and then forward to more. It goes in a crazy way, backward and forward. With DAX, what you do is you have a multiply block with the two orcs, you pipe that to come and you pipe that to more. So the direction is exactly as we are used to read programs. Another example, say you want to find duplicate files. An efficient way to do that is to run a MD5 sum to create a checksum out of each file and then find those that appear multiple times. The problem is that once you do that with unique, you've lost the names of the files. So what I'm doing here is I'm creating those checksums with the unique, but I'm also concatenating the output to get the copy and then passing the result to join some orc magic to get names of duplicated files. As another example, say you want to run Git log and you want to find who has done the most commit and in which days these commits have happened. You can do that with a sort unique and sort again sequence, but you want to do it once for each time without running log multiple times. So here I have a function that prints the number, I run log for the authors and the dates with T I find out to two processes, one for the authors and one for the days and with cut I join them again. And if I run this on the Linux kernel, I get a list of authors and a list of days where most commits happen. It appears that for some reason Tuesday is the day where we feel most productive. Getting back to the misspelled words, the pipeline I initially showed was maybe impressive but it doesn't show us where the words occur. We can do that if we find the list of common words and then we pipe that to fgrep and for some more show we also ask fgrep to color the words together with the original text. So what I'm doing here is I have one multi-pipe log to find the errors and another one to pass through the text to grep with fixed strings and colors to get a result such as this if I run on Tom Sawyer. Another interesting example is you want to find C and C++ symbols that should be static. You can do that very easily if you run the output, if you run NM on compiled files and you find the files, the symbols that are exported from various compiled files but are not imported into any other file. Those obviously symbols should be started because no one uses them outside that file. So again a multi-pipe log with the output of NM to find those that are exported and those that are undefined and those that are not imported are the ones that we want. So if I run that on bash, I see a list of files. The list goes on, we can do weblog reporting, we can do 2D Fourier transforms, NMR processing with completing tools that no one wrote for this purpose, parallel FFT computation, parallel word counts so you can do map reduce just by using this tool. You can find git commuter activity over time and so on. What I'm asking you to do is first of all go there, download it, use it, profit from it, package it so if you're associated with a distribution, create a package for it, adapt existing tools or even develop new tools that are compatible with DAX and also contribute enhancements and bug fixes. As a small motivation, there's a treasure hunt so go to the page of DAX, find the word associated with this book, this fine book debugging, find the script there and email me the output of that script, you need DAX to run the script and you get a digital copy of the book in three, four months. Thank you very much. Thank you. We actually have another five minutes or more than five minutes so if you have any questions, then let's... Okay, yes, yes, yes. Sorry, yes, I will just press the button. Hold that. Thanks for the presentation. I was wondering how do you create the graphs, the visualizations? Right, I left out the implementation part because this is supposed to be a lightning talk. There is a lot of magic happening behind the scenes so what is happening is first of all, the processes get connected together in a linear fashion in a way that allows all processes to talk with each other at the points where you have find out or find in so at the input and output part of the concentrator blocks there are some concentrators that create those paths that join the processes together and then messages start flowing around and the negotiation process happens where each process says how many inputs it can provide and how many outputs it can provide and this packet goes around and around until everyone agrees or if there is no agreement the processes exit with an error if there is an agreement they get the file descriptors and then the actual processing begins. In order to get the descriptor what happens is at each negotiation part they initially they communicate with sockets instead of normal pipes and at the end of the negotiation they pass file descriptors through those sockets that they can use from then on. Did that answer the question? Partly, sure. But I specifically wanted to know about the beautiful graphics you had because I think it looks like you're using a tool for that so I'm interested. That's much simpler. Before calling DAX you set an environment variable with the name of the output file and instead of running the process just creates the dot graph for you. So do you have any numbers on the speed-up you can achieve on daily tasks for instance? Yes. Yes, thanks. So this is... So this is the time for DAX, the blue one. The red one is with SH with temporary files and for some programs I also implemented them in Java or Perl and the time was a lot longer but of course DAX uses more cores so it can be a lot more efficient. So you typically get the speed-up especially if you can use more cores. How do you control the number of processes which are executed in parallel? Sorry, how many? How do you control the number of processes executed in parallel? How do you control the number of processes? It's done statically so at the beginning it launches all processes that you have specified. There is a tool called DAX Parallel where you specify how many automatic processes create a multi-pipe block homogeneously with as many processes as you say so you say minus N, 10 and creates 10 processes but other than that it spawns off all the processes you have specified. Another question, this seems like a really cool tool but it would be awesome if it were integrated into something like Bash. Do you have plans for that? Do I have plans to integrate it with Bash? Actually it is integrated with Bash. It's just not... I'm using the source code of Bash so if Bash wants to take it upstream it's there on GitHub. I think we have another two minutes. Hi, what's it like when errors occur in some of the pipelines? What happens if an error occurs? There are three things that can happen. The negotiation may fail, then the packet goes around and says to everyone, go away, we messed up. Another thing is that one command may fail at random. At that point there is also a message that goes around but it may not be able to go around because the command has extended prematurely. There is an alarm and a timeout that can take over at that point and if no command gets a message within a specific time then they again quit automatically. We had trouble sorting that out in the beginning. Any other questions? Sorry, but I missed the first part of the presentation so my question is what was your initial frustration that led you to write this beautiful program? It was trying to implement a graph such as this one with Bash so it would get all backwards if I wrote it with T greater than open bracket. Hi, sorry, there's something I didn't understand. Did you re-implement the Spark directed by cyclic graphs in Bash? Did I re-implement? Yes, is this a re-implementation of the Spark DAGs? What we've done is we took Bash and added the capability to express multi-pipe blocks in its so double curly brace which means a block where things run in parallel. This syntax. Is there any theoretical difference with the DAGs from Spark? It's just like a directed, a cyclic graph of process, right? Where do these DAGs occur? Which tool? In Spark. We can talk about this later. Let's take this offline. So thank you very much. Yes.