 Alright, so titled with a long talk, I'm gonna talk about how to, I'm gonna talk about how being that bastard guy that Hillary just talked about, the one that tells you how to do stuff, that's what my talk is about. So, well, I mean, I'm not here preaching an absolute truth, you make up your own mind, right? But this is how I do my data analysis. So my name is Emil Bay, I go by Emil Bay's online. I currently am working on a project called Commodity Trader. And what we, so like, what we basically do is that we trade grain online. So if you want to buy like 500 tons of grain, like soybeans or whatever, talk to me and I'll ship you like a whole ship of grain. So yeah, that's basically it. But I mean, not too much data analysis in here, it's actually a lot more about cryptography. So yeah, totally unrelated, kind of. Still mad though. Yeah, I used to study math at university mostly because it's like this masochistic thing or like trying to get at the absolute truth of the universe. And today I'm also here to talk about absolute truth. So previously, I have a pretty varied background. First I worked at an advertisement agency as a software developer. Then I've worked at a high-performance computing lab. So like a super computing lab I built, it was at my university. I built the Hadoop cluster. And then I've also worked at a daily newspaper as like a data journalist or more like the actual title was editorial developer. So it's like the programmer helping the journalists get at what they're trying to get at. Other fun facts. I've been part of a TV show. I was like a hacker dude in like a show where we had to catch people on the run. I've also done a project with some Syrian journalists trying to get back at a peaceful Syria. They're all in exile in Istanbul. That was pretty interesting as well. And I'm also a shag attack survivor. So that was just fun facts. Okay, first theme on the menu, decoupling. So decoupling is a software engineering term. It means to take a program and try and pull it apart. So it's easier to reason about. So contrast is pretty bad in here, but you can see here we have like a kind of shortage program. It's pretty easy to see what's going on or like get the overview of what's actually going on. And if you came into a project that had this code, it would be pretty easy to figure out what was going on because there's not that much to it. But then over time you expand your analysis and suddenly you're doing more modeling inside your project. And over time it just builds and builds. And I mean now it's starting to get, oh well now it is at a place where it's, there's so much stuff going on. Like I mean you can't even see it on the screen right. So what can we do about it? Like I mean it gets messy right. Doesn't scale well for human memory. One thing that Hillary also talked about was that human errors are going to happen. So we have to try and build a process around how to rent these errors. And I know for one that you come back at a project just even a month after and you have no idea what's going on. So how can we work around this? We can use something from software engineering, something called the Unix philosophy. It's kind of like the ten commandments of software engineering. It's a philosophy developed in I think the sixties at Bell Labs. The first people building like a time-shared general purpose computer and they were developing an operating system called Unix. Unix we're still living with legacy. There's a Unix legacy inside the Mac OS X or Mac OS. There's a Linux, it's also a Unix derivative and free BSD and all of these still live by the principles of the Unix philosophy in contrast to the way that Windows have gone. So what is the Unix philosophy? It's a right programs that do one thing and do it well. So often you get these small programs that are easy to reason about. It's about to writing programs that can work together. So you can take simple operations and compose them into more complicated operations. And the third tenant is to have programs that work with text streams, which is something we're going to ignore in the way I do things. So we have this huge program, just one big r-file called dataanalysis.r. And what we're going to do with this is we're going to build this graph. It's a directive graph. You have these small programs that all do one little task well, and you compose them together into this directive graph. And the way these programs talk with each other is through exporting data. So you have a very clear contract between what's the output of one program and what's the input of another program. So this is the example project I put in the abstract. So it's a project about indoor climate at my university. And it has a couple of parts to it. It has a crawler for gathering weather data. Then it has an import script that's like your stock ETL script, like cleaning up the data and putting it back into a format that you can actually use. Then it has a model, a domain model, taking this data from the weather data. And the imported readings from the buildings and building this domain model. And instead of this domain model file being like one huge script that does the different types of models we're exploring and does all the plotting and exports like all the CSV and JSON files for other people to work with, we decompose it. So how to do that? So in R, we have these two functions called load and save. And this is basically going to be the input and the output of these small programs. The load function will read our data file and whatever variables you put into that eight data file will appear in your environment when you run the command. And you can also save commands, which is going to be the output. So here, this is from the domain model, the R file. You can see that we export the weather model and we have the energy model, which has data frames. And then we also have an auxiliary building vector and we write it to a file in the eight data directory. So now the contract between the programs, how the programs communicate together, how you build your big analysis is going to be through these eight data files. You get a couple of nice things from this. Your programs become much smaller. All you have to reason, like before, when you have a huge R data program, you'd have to reason about the whole program at once. So you often have a lot of global variables that are all intermediate steps in your analysis. Well, when you take your variables, pick out the important ones and export them into R data files, you now only have to reason about a very small set of variables. So that makes it much easier for new people to come into the project or for yourself to come back into the project after a month. You also get some other nice things from using this technique of having intermediate files that save the data. You can checkpoint your computation. So for this analysis of the indoor climate, I was running a pretty hefty time series model that took like eight hours to run on my MacBook, and you don't want to do that too often. So you run the analysis once, write it to the eight data file, and now you have the results to continue your analysis. Your analysis also becomes reproducible in some sense with some tablets. It's reproducible in the way that you could share the intermediate results with your collaborators and you have the same starting point. Another property. It's a fancy word. How is it pronounced again? Adam. It just means that, oh, well, that's how your program should behave, that if you run the program again and again, it should always give the same R data file, unless of course you change the inputs or you change the code that transforms the input data. One problem, though, with this graph is that, okay, so a new person comes into the project and they have to figure out what the relationship is between all the R files and they have to figure out to run the R files in the right sequence for the model to actually work because now the dependency is no longer on your CSV files or JSON files or web services. The dependencies are now on the A data files. Turns out it's pretty easy to solve or it's been solved for, like, what, 50 years now we can use a tool called make or make files. So a make file is a task runner or like a declarative wave to say these files come from these commands and that way you can, like, automate this whole graph. So this is a sample R file. You have something called targets. Targets define an output and dependent inputs and then you can run whatever commands under these targets. So the cool thing with this is that you have total freedom, so you can run whatever commands you want here. This applies to whether you're doing Python or R or you're doing bash grips in between your analysis steps. This applies to all of those and it also has another couple of nice properties I'm going to get to in a minute but you can compose these targets. So now you can start to build this graph of how things work together in your program. It also has a nice property that if any of your inputs to your targets change, then only the targets that changed run will update for you. So this is example R file from the project. At the top we have the crawl.rda data that we had on the graph before. It defines its input. So its input to this target is actually the R script, so the program that we're going to run. So I have the command for running a R file from the terminal. Then we define the next target, which is kind of like the first one, it's just the other step of the graph. And in the end we have the more advanced target where we have the model. And it depends on the two R data files and on the domain model.rda. So the cool thing now is that if I ever change the import script, then make will go and look at your file system and it will realize that, okay, I'm missing the import.R script, which is the dependency of the R data file has changed. So I can just go in and run those pieces that I'm missing. Or if I've never had this project on my computer before, I just cloned it from GitHub or got it from a colleague and I'm trying to get at the model.R data file. Make will go in and it will look at the dependencies and it will figure out, okay, what other targets do I need to run in order to be able to generate this file? So that way you have like a manifest of how to build your whole project and no one needs to run the R data files in a specific sequence. It's proven technology. So like I said, this was built before my mom was born. So it has like a pretty good track record. It's still state-of-the-art. So if you've ever worked with any, maybe you've done scientific computing where you've done something in C or C++, then you probably ran a make file to build that project. And make files are totally underrated because you can use it for any project. In my daily work, I usually work on front-end JavaScript. And if any of you have ever like have worked with front-end JavaScript in the last couple of years, then you probably had to install like a whole sleeve of task runners that would manage your project for you. Well, make has been around for 40 years and these guys are just re-invending the wheel in my opinion. So here we have a more comprehensive R file or make file from the project. I have a convenience export target at the top. You can see it doesn't have any commands. It just defines what inputs does it need to do, like does it require to run the export target? So that means I can, from my terminal, I can type make export. It'll go in and it'll figure out how to fulfill those requirements by passing the rest of the file. You can also see I have a clean target. The clean target is for deleting all this intermediate R data files. It's very important that you can start over from any point in time. So you should be able to delete all these intermediate data files and reconstruct them again. I mean, that's the whole point of being reproducible, right? So and then it has some other stuff in here as well, right? So yeah, that's the command I'd usually run if I came into a project and to like build the whole project, I'll just run the make export, which is the top command. And if I get into a bad state, so some of my programs start to behave in an unexpected way, probably because I messed up one of the files. I'll just clean it all out, delete it all, and export it all over again. How am I going for time? Okay, there's just one assumption in all this. Something that from a computer science perspective, you call pure functions. So the thing with all this graph I showed is that this domain model.r can only depend on the inputs I actually give it. If it depends on randomness or it depends on the time of the day or it goes out and talks to a web service or something like that, then the model like this way of computing your models doesn't work anymore. Because then you have inputs that are not explicit. It doesn't matter over here or well, it does matter over here. But because this is the start of your pipeline, you have to get input from somewhere. Otherwise, your program doesn't really do anything. So I'll just show quickly pure functions. So here we have a function from the project. It's so the project collects data from these energy sensors around different buildings at the university. And you can see we have a lot of stuff going on here. The first couple of lines, they are making the URL to go and talk to a web service. Then we go and fetch that URL and we pass the JSON that comes back. And then we go and format some of the data that comes back to the right types in R and then return the data. But this function has a couple of issues in my opinion, because this is not a pure function. We have the inputs shown at the top, but we actually have another input to this function as well that's not explicit. And that's this function here because this function is what you'd call a side effect. It goes and talks to the web service. And it's all right to encapsulate this kind of logic. But then this is no longer a pure function, and we can no longer depend on this function to always give the same results, give them the same inputs. So for a function to be pure, for a function to be reliable, to be reproducible, it has to fulfill these properties. It has to have no side effects. So you can't go and change a data frame that you passed in that has to be explicit. So that's doing no state mutations. You can also no side effects. That also means that you cannot go and fetch a random number from a random number generator. That is something you'd have to pass in as a explicit input. And the output should be directly derivable from the input. So here's a rewrite of the function. It might seem very pedantic to just take the whole thing and split it apart for these reasons. But then the program is a lot easier to reason about when you have a, at the top we have a function to format the URL. So the first part of the last function I showed. So we do that here, read out the URL. Then we go and do the side effect. We go and fetch from the web service. And then we have a second function which actually doesn't change the data frame. So we pass in a piece of the data. We're very explicit about the piece of data we pass in. It goes and updates, but it returns it. And we do the mutation. We do the change outside the function. Because that way we always know where the state changes are. So I mean, of course, you could just, this would be inside one of the initial r-files in the dependency graph. But here we are very explicit about where we change stuff. And that way the program becomes easier to reason about. And you can know that there's nothing going on inside any of the functions that will modify your state. And that was actually kind of what I had. So I mean, that leaves plenty of time for questions. I guess if anybody has questions, we have five minutes, right? Yeah, we've got several minutes.