 Hello, and thank you for joining my talk. My name is Mark and in this talk I would like to demonstrate to you a method for deriving information from running our code. I will show you a few demos during this talk and you can find all the code for the demos in this GitLab repository and the code will also be uploaded to the conference website. Before I start, let me give you a small example that we will use as a kind of running example during this talk. Here I have created a small production script, which is sort of a prototype for the production script. We read some data, we filter out some bad cases, we aggregate the data computing the means over column 3 and 4 grouped by the values in column 1 and 2, we write this output to another CSV file and we warn the user that we're all done. So this is kind of a prototype production script where you read data, process it and write it out. Now what I would like to do is when I run such scripts in production is to maybe tap some information. For example it could be interesting to see how much memory is used after each statement or you could maybe measure how much time it takes to run each expression or maybe you want to see what happens exactly to the data that you read in and what expression does what to the data. So you can sort of estimate the influence that every step has on your output. So in general there are all kind of interesting ways in which you may want to follow the run of a production script while it's running. This talk is about a very general method that you can use to write software that actually can derive such information. So I want to derive this information and I have a few demands here. I have a sort of primary data flow that's the script that's written by the user, that's a script that you just saw. It reads data and step by step by step processes and writes data in the end. And then I want to add a kind of a second layer which I call a secondary data flow here where this information is tapped off and I would like to separate these layers as much as possible. The person who wrote the production script should not be aware or as little aware as possible of the fact that information is being tapped off and it should also not interfere with the process that is running here. So you would like to separate these two processes as much as possible. This is kind of the main challenge in this method. So let me first give you an example of what you could expect from a method like this. So here is an R session and one way to run a production script is to use source. So source is a base R function. You can give it a file name and it will run all the R code in that script. And if I do that, the script is called MyScript. If I do that, the file will run and in the end I get a message indeed all done with the penguins and I can actually read the output and here's all the aggregated data. So that's what you expect. Now I wrote another function, kind of a variant of source called runFile that also runs MyScript. That's exactly the same thing except that it also counts the number of expressions that were run in the script. So I'm going to use this kind of as the hello world example with the idea that if you can run a script, count expressions and have some control over how they are counted, if you can do that, you can do more complicated things as well. So let's see how this runFile function works. Because having a runFile kind of function is like the first step in the method. So the runFile is the easiest part, I think, of this methodology. What it does, it's a function, it accepts a single argument called filename. It reads that filename with a function called parse and the output of parse is just a list of expressions that are in the file. I create a new environment and an environment for R you can think of as a place, a box, a cabinet where R can store and retrieve variables. So we create a new environment, I initiate a counter and then for each expression in the list of parsed expressions, we evaluate the expression in our runtime environment. So when the data is read, for example, using read.csv, the data variable called penguins is created in this runtime environment. After evaluating the expression, we increase the counter and when the loop is finished, we print the number of expressions that were evaluated, and we return the runtime environment. So it's kind of a variant of source, a very simple variant, but it does a little bit extra, it allows us to inject a little bit of code between running the expressions. And the important thing is, the user never sees this variable and expression. The user script will only see what is stored here in runtime. So this really separates the two environments flows of data, if you will. Okay, so this is kind of the easy part. The more the harder part or what took me at least the longest time to figure out is how to make it possible for a user who writes this script to influence how data is stepped while you are running the script. So for example, ideally, I would have a situation like this, where a user just writes a script, and maybe the user wants to have some control over when counting is actually starting. So you could call a function start counting, everything is count until here, and then stop counting, and then these two expressions are not counted anymore. And then the question is, how do we get the information that a user gives the signal I want to start counting or stop counting? How do we get that information into the file runner? Okay, so if you know, already some are programming, there are a few things that will are kind of easy to think of. So one thing you could do is say, okay, start counting will set a global variable in the user environment. And then the file runner can detect that. That's not really desirable. We do not want a function to write something in user space because it can influence variables that the user created for another purpose, for example. So we rather not do that. Another option is to sort of set it as an option or an environment variable, which is a little bit, it's still in the user space, but a little bit on the edge, let's say. But that's also not desirable, because suppose that the production script crashes halfway, and an option was set, then this option is set even after the crash. So then you leave behind a state that's different from when the script was running. So it's also not desirable. And the third thing you could think of is run file is now at the moment only reading and executing expressions. But it could also sort of analyze the user expressions and see whether start counting is is run yes or no. But that makes it hard. And especially if start counting would have some arguments, for example, to, to, to really analyze it in a way that's very generally usable. So we would like to basically let our do the work as much as possible. So we don't have to think about it. And there is a trick for that. So the main idea is the following. We make very simple user facing functions. In this case, our functions start and stop counting just emit true and false. And we make sure that these results are captured by run file. And there are two techniques that I use for this. One I call a local side effect. So a side effect is where a function alters something outside of its own space outside of its own scope. But I'm going to do it in a very controlled way. So that's why I call it a local side effect and something I call local masking. So we make sure that the correct function is seen when we want it. It sounds a bit mysterious. But let me show you a demo what I mean with local side effect. So what I'm going to do is first I'm going to create some storage room. An environment in our again is a place where I can create, store and look up variables. So create a new environment called store. Now I'm going to make a function called new mean. And I'm going to make that function using another function called capture. It's a function that I wrote. And capture accepts three arguments, a function that is going to mimic sort of storage room and the name of the variable that will be created in the storage room. Okay. So let's see our store is empty. With LS, we can see what's in the store. If I use mean, I just get a result that we would expect. If I use new mean, I get exactly the same result as mean. So for me, the user new mean behaves exactly like mean, except if I now look in store, there is a variable created. And I can actually access that variable. And you see that the output of mean has now been copied to out as well as sent back to the user. So this is what I call a local side effect. The function new mean has a side effect. But it's in a place that I can specify very well by using this capture function. So let me show you what this capture function looks like. Here it is in the bottom half of this slide. The capture function accepts a function, an environment and the name of a variable that will be created in the environment. And it returns a function. And what the return function does is it just executes the function here with all the arguments that the user supplies. It stores the output, copies it into this environment under the name given here, and returns the output. So it's a very short, very simple function. If you want, you can even do it with one line of code less. But it takes a bit of knowledge and trickery of knowledge of our to come up with it. And it's one of the things that at least took me some time to figure out. Okay, so the second trick is called local masking. So the user sees the function. This one this is the user facing function. If a user inspects this function, it will just see start counting. And it will just return true. What we are going to do in our file runner is the following. We start the same, we parse the file, we create a new runtime. Then we have three lines of code extra. We create a store. And in this store, we create a captured version of our start counting function. And we also create a captured version of our stop counting function. Now, remember, when we started when we were started evaluating the user expressions that are stored in past, we looked over every expression in past, and we evaluate this expression in runtime. So now, when a when a user calls start counting, it will not find the user facing function, it will actually find the captured version of this function. And the captured version will not only return the value to the user, but also send that value to the store. And now to be to count when the user wants it, the only thing we have to do in here is to check whether store dollar count is true. And if so, increase the counter. And if not, we don't give you a very short demo to convince you that it works. My script two is the version of the script where start count and stop count is called. And now you see we counted only three expressions. Okay, so some applications. The first place where I use this is in the lumberjack package, which allows you to run a file. And user in the user's file can say start log on this R object using this logger. And then you run the file with run file. So this is the user facing function. And run file runs the file and you can follow, for example, a data frame. And afterwards you can read which expression did what to what cell in which data frame. Another application is the tiny test package that I built. Tiny test is a package for unit testing. It allows you to include unit tests, for example, in a package. And here the user facing functions are the expectation functions. So you could say expect the output of some calculation to be equal to a known value. And all these expect values are the user facing functions. And then what the function run test file does is just run the script that contains these expectations. And then capture all the output of these expectations and organize this in a nice way so the user gets a nice overview of all the test results. So two very different applications. But the exact same technique is applied here with a very clean separation between what a user sees and what the code sees that is deriving information. Finally, if you would like to read more about this, I wrote a short article for the R Journal. It's not published in the R Journal yet, but you can find the paper in the archive in the link shown here. Thank you for watching.