 All right, it is now 445 UTC, so we'll start the session. I'd like to welcome everyone to session 1B, R Packages 1. This session is brought to you by RStudio and the sponsor of the day is Absalon. So today, we have three talks on three different R packages starting with our CPP deep state, followed by a talk on the package risk metric. And finally, we'll have a talk on the poor man R package. Hello, everyone. I'm Akhila Chawli-Cullo. Today, I'll be presenting about R-CPP deep state. This is a joint work with Javi Dhillon-Hakim and Alex Karas. And this project is sponsored by our console team. R-CPP deep state is a simple way to process R-CPP packages. Before moving on to R-CPP deep state, I would like to discuss about a simple problem. Let's look at this R-CPP function that declares, initializes, and returns a value. And there are a few function calls where we are trying to access a value from a valid array index. Whereas in the next two function calls, we're trying to access invalid array indexes and the function returns an undefined value. These type of function calls, it's the developer's responsibility to identify and handle this kind of subtle box of the code. The purpose of R-CPP deep state is to make developers work easy by finding these kind of subtle box in the code. Before discussing about R-CPP deep state, I would like to discuss about the building blocks of R-CPP deep state, which are the fuzzing in the deep state. Fuzzing is a process of generating invalid unexpected and random data as inputs to a computer program, program and expected to crash, fail, or generate errors with these kinds of inputs. Deep state, it is a means to run the unit test with a lot of fuzz use. Unit tests in deep state are called as test harnesses. We would refer to this a lot in this presentation. Normally, deep state test harnesses are returned and C++ and they provide interface to various symbolic designation tools, fuzzing in general, like AFL home fuzz, with fuzzers and reclipsing. Fuzzers, they provide a standard input without type specific values and they generate a stream of random data. Whereas deep state, it provides an advantage over fuzzers by providing a C or C++ type input functions, such as deep state in deep state double, which generate in double values respectively. These are useful for producing fuzz specific data types in R-CPP deep state. And moving on to the purpose of R-CPP deep state, we have developed R-CPP deep state tool. It's an easy to use fuzzing system for our packages using the R-CPP framework. It's used for memory debugging and memory leak detection. Then C++ combined with R libraries, it includes the speed and efficiency of the test harnesses. It also generalizes and automates the test harness generation. It provides an easy interfacing with fuzzers like AFL and lip fuzz for input generation. R-CPP deep state does the hard work of automating the matching of multiple functional parameters, the type specific data generators, like integer, double vectors, both integer and memory leak. Moving on to the related work section, our inspiration for developing R-CPP deep state was from Vroom standalone package, which follows a do-it-by-hand approach to run the C++ role under AFL fuzzers. There is no external framework that is implemented to provide an interface with an R-CPP package and fuzzers in the Vroom standalone package. And there are other packages like R-unit test that are tiny test and unitizer, which use predefined assertions on provided inputs. Whereas the crown checks that provide automatic code analysis and bring the test under sanitizers, fuzzers and auto tests are other R packages which provide fuzzers specific inputs that are predefined or are mutated respectively. So to get a clear understanding about R-CPP deep state, we picked a package from our analysis, that's BNSL and it's function called MI. This is a simple definition of the function MI, where it takes X and Y as numeric vectors and an integer product value. The product estimate is based on the first prior MDL principle and views and empirical principles. The argument of the product is missing because zero is taken and as a piece function called SME. And let's look at the predefined examples of the BNSL package and the size of the numeric vectors is 100 and the product value keeps varying between zero and 10. These examples are present in the analyze example folder and we run them under valid and to see if they are producing any subtle bugs or if there are any unidentified issues in the code. So when we run the code under valid then, I see an empty data table returning no error messages or no address messages, showing that there is an error in the code for the examples. So can we see that this code is bug free as we didn't see any issues when we run the predefined examples under valid? No, we cannot. Is testing the code on predefined inputs enough? No, it's not. Predefined inputs are unable to find the subtle bugs in the code as which we might encounter running the code on various unexpected or randomized inputs. So is there a way to automatically run the code on various unexpected or randomized inputs? Yes, there is. RCDP deep state is a solution to the problem. And moving on to how RCDP deep state works, let's look at the building block of RCDP deep state. This is a basic test harness for the MI function which we saw earlier, where we will be generating randomized input data using these functions, RCDP deep state memory vector and RCDP deep state in DG. Input values for these memory vectors, X and Y are obtained from these functions. These are RCDP deep state's positive functions. And those inputs were accepted and those inputs were taken to make a call to this MI function. And we run this test harness to see for any issues. And we also include inbuilt RCDP exception handler. So there will be no more failures. So before moving on to discussing about how this test harness is being compiled or run, there are two main functions that we need to talk about for causing the function under RCDP deep state. That is deep state first one is the first function. The deep state first one takes RCDP exports and it looks at the RCDP exports and it passes it for the prototypes and the list of the parameters of the function that they provide. It also allows the test harness generation and it also compiles and runs the test harness. Once we compile and run the test harness, we get crash, phase, fail, pass, extension files, depending upon the type of the response obtained by running the inputs on the executable. We do have another parameter for time limit seconds where it will allow the user to specify the passing time on the executable. And I'll be using a three second default timer for now. And the next function is deep state first one analyze. It allows the user to analyze the binary files of the provided RCDP function under the presence of biology. It also looks for errors or bugs if there are any. It allows users to specify initial seed value. This will serve as a start point of the passing. Time limit seconds, this is another parameter which allows users to specify the time limit to analyze executable. This function returns a data table with the inputs that are passed to the function and the error message that was generated when those inputs were run on the function in the position where the error occurred, like the file and the line number where the error occurred. Let's look at our real execution of the functions. Here, the input to the function is a package part and the function name and time limit seconds. And we provided the package BNSL and its function MI and we would like to see if there are any subtle bugs in the MI function. As we didn't see any bugs in the examples, we're trying to test it under RZVPD state. So we are analyzing the compiler for the function. This step of the code generates, this step of the code generates the test hardness and it also compiles and runs the test hardness and it will also generate the dot crash, dot field, dot pass files that were the inputs that were run on executable. So during the analysis phase, we take the path to the function, path to the MI function and we run the deep state fast point and analyze on that and we see that executable is running and it's looking for any issues with the inputs that were run in the previous steps. And we see that there is an issue occurred and there's a data table return. If we look at the data table, there is an error of invalid read and a message of invalid read of size eight. It occurred in the file micmi.cpp at line 55. There is also a address trace message where it says zero bytes after a block of size 184 located. So let's get a deeper understanding of this message. Before that, let's look at the inputs that were passed to the function. That caused the error and the proper value that was passed is very large and the x and y vectors look something like this. The y vector includes not available values and there is an infinite value passed as well. So the function is adding some missing values to the numeric vector. So the line, so the error occurred at line 55 of the code and as a proper value is very large and my function makes it harder. And Jeffries, am I? And the line 55 is the highlighted part of the code. Issue didn't occur when we are trying to create a table x, that is, cx, and there's no issue occur when we are trying to create table cy. Well, there is an issue occur when we are trying to generate c xy table from both x and y vectors. And the size of the vectors x and y are not equal. The system generates an issue that causes an invalid read. So we need to specify a condition to check the size of the vectors x and y are equal or not. Moving on to the results of RCPP-DHK, this analysis was made on packages that were downloaded as of 2020-12-20 and we first stood around 1185 RCPs of packages and it quoted issues for more than 1,000 functions that were nearly 412 packages. These were not detected using standard crunches. These standard crunches are run on manually specified tests or example inputs, whereas RCPP-DHK on its tests on defined unexpected or randomized inputs. The crunches include all the client and GCC versions of undefined sanitizers and address sanitizers and other tests like DONE tests and M1 markers and R check are also included. We see that RCPP-DHK state performs better when compared to prior additional checks. And this is a webpage for RCPP test results. If there is an issue with your package, you can check it over here. And once you click on the package, you'll be able to see a HTML page like this where it will be listing out the inputs that were passed and the message showing the message that was generated when running those inputs on the executable test function. And it also shows the file and the line number where the issue occurred. We also provide a valibrant log. The valibrant log is the one that we got when we run the executable in our local machine with those inputs. This is an executable test file. This is to reproduce the same error in your local machine. You can run this test file to get these errors. You can replicate the errors. Moving on to the experiment results section. RCPP-DHK state's default process identified issues in 478 packages showing a better performance compared to valibrant nutrition this process like AFL and log-fossil. RCPP-DHK state identified issues in over 156 experimented functions which is around 74 packages and 755 unexported functions which is over around 406 packages. Moving on to the conclusion. Our future work, we would like to improve fast testing with more realistic randomized inputs. And we would also like to extend the random generation functions for SCAP and list data. We'd also like to include RCPP-DHK state as a part of projects. And we would like to thank our consortium for funding RCPP-DHK state. And I would like to thank Barbara Gillan-Walking and Alex Gross for monitoring this project. Thank you so much. Thank you very much for your talk, Akila. I think too, for the interest of staying on time, we'll go straight to the next slide and to the next talk, which is by Dallas or Doug Kelcoff, a data scientist at Roche. Hi, I'm Doug Kelcoff and today I'm going to be presenting on some of the work that we've been doing in the R-Validation Hub with regard to assessing package risk in regulated industries. And this is work that I'm presenting on behalf of the R-Validation Hub, which itself is supported by a bunch of contributors. And I've mentioned a few of the contributors that have been quite active in developing the risk metric package here, Yilong, Marley, Eli, Eric, Mark and Julianne. So just to give a quick overview of what I'll be talking about, I do want to give a quick intro to some of the unique challenges that we have within regulated industries in the use of R, a brief overview of risk metrics design goals to support this. And because this is more of a technical audience, I do want to give a dive into some of the internals, which isn't something I get to talk about too much. Usually we're talking to more like industry folks who are more interested in the application. And I want to use this opportunity to talk a little bit more about the implementation. So just to give a little bit of background about regulated industries and some of the challenges they face. So within industries that are regulated, such as the pharmaceutical world, we have long histories of using license proprietary tools, especially for statistical analysis. And more recently, R has become quite a forerunner in terms of new methods development and as a preferred glue language for data science. And to address this, we're kind of looking for new ways that we can leverage that enthusiasm and all of the awesome software that's being produced in that world. But at the same time, we have kind of this legacy expectation of how software is delivered and the types of expectations we can have and how those are documented. And so that leaves us with a little bit of a gap in terms of how we provably show that software can be reliable and robust and that can be reproducibly installed. And we need ways to document that so that if we're ever faced with someone wanting to audit the work that we've done, we can show this proof of the decision-making process by which we chose to use that software. So enter the R Validation Hub. And this is an organization that's really spanning quite a few different industries at this point, but is predominantly focused in the pharmaceutical industry, trying to build tools and processes and recommendations around how to use R specifically in a regulated setting. This has had representation across about 60 different companies on our mailing list. So folks that are interested and I would say about 10 or so, they're actively contributing to the discussion and the recommendations that we're putting out. And a lot of those go through community feedback. So this is involving a lot of those 60 groups. And we do have representation also from finance and agriculture have taken interest as well as security-focused folks and quite a few others that have had an interest in the tooling for one user or another. And this is really aiming to patch some of those gaps so that we do have this kind of documented robust audit trail through which we can provably show that we have high confidence in the software that we're using. And for this, we have a primary package risk metric that we're using to do these assessments. And for further details, I do want to plug that we do have a website pharmaR.org and you're welcome to check that out if you want to learn more, especially about the application and the industry in general, because this talk's going to be a little bit more focused on the implementation details. So just to describe the package a little bit before we jump into implementation, package risk metric aims to provide some of that tooling to assess packages and make informed reproducible decisions so that we can support regulated decision-making. And so some of the use cases that we're hoping to support might be that a statistician wants to use a package and wants to know that it's broadly used enough to justify using it for an exploratory analysis. And maybe they're just doing this on a person-to-person basis, just trying to make sure that they're using appropriate tools for their day-to-day task. Similarly, maybe an analyst wants to know that they can reproducibly share their work with a reviewer. And so that's kind of factoring in this idea of development stability. But we also want to support kind of the industry or the infrastructure around supporting R. And so perhaps you're an R systems administrator and you want to know that if you install a new package, what impact will that have on the environment that you manage and how might users be affected? How will you have to manage the stability of a platform if you're going to accommodate the installation of an R package? And maybe if you're in quality assurance or to some kind of regulatory interface to a health authority in the pharmaceutical case or some kind of regulator, maybe you want to know that you've done some due diligence to ensure that package isn't malicious. And so we hit on a few of these and more importantly, provide this foundation for building these types of assessments. So some of the unique challenges that we face to do that is that this covers a pretty wide range of unique needs with varying levels of sensitivity and they touch on different parts of a package lifecycle. So perhaps if you're a statistician who's just looking to do some day-to-day work, you just want to know that before you go out and install a package, it's going to be the right choice if there's maybe multiple packages that implement a method or that's widely used and currently maintained. But if you're on the more infrastructure side, maybe you want to know the effects that it's going to have in a broader environment. And if you're more involved in the quality assurance or regulatory aspects, you really are more concerned with the security implications, the stability and the reproducibility to really show that audit trail. So there's quite a few different use cases. And from that, we have a few design objectives. So first and foremost, we do want to accommodate a bunch of different sources of package metadata and that can include pulling information from the web or things that you might compute locally. As well, when you're collecting that metadata, we want to avoid recalculating the same things over and over again. So as a trivial case, you can imagine that if we wanted to find how many issues a package has there open from a source code representation, we might look at the description file to get a URL of the source code repository that which we can query an API to get the number of open issues. And a lot of things are going to make use of that description file and the metadata in there. So we don't want to have to be recalculating that over and over again. To minimize this amount of recalculation, especially for computationally intensive tasks like running our command check or rate limited steps like querying APIs. So we do want to kind of manage this dependency structure. And then also we want to encourage contribution. So we don't want this to be at least on the surface, overly complex. We want this to be fairly easy to extend. And before we jump into the details, let's just quickly run through what we see as kind of the core data flow here. And really what's happening is that we're taking these different sources and we're trying to funnel them into a central data structure representation of what this information would look like. And from that, we want to score it into a unified score that we could use for somehow evaluating these packages against each other. And so we have a little bit of terminology that we use around that process. We have package references that are a reference to where you're going out and getting this information. So that could be a local source code directory. It could be an installation into a library or it could be a remote repository, either on Cran, Bioconductor, GitHub or something like that. And then we have what's called metrics, which are like singular criteria that are pulled out of the metadata that's extracted from these different sources. And that's where really we're funneling into this kind of central data structure. And then we have scores, which are numerics on the scale of zero to one. And that's kind of our way of aligning these different packages. And then we provide a little bit of added tooling just to summarize those scores, aggregate them. We really provide that as kind of like an extensible interface so that institutions can derive their own scoring function that kind of suits their needs. All right, so now just to jump into the details, I'm gonna jump right into a package ref object because this is really the crux of our design philosophy and it's a little unique. So what we do is we implement this class. It's an S3 class called package ref. And there's a bunch of, there's a subclass hierarchy to this. So packages can be either install or source or remote. And even within remote packages, they could be from a number of different sources. And each one of those might have different behaviors in terms of how they go out and grab information. But really we're trying to converge down on that unified way of representing some of this information. For instance, let's say if we were trying to assess the news of a package, whether that's up to date, the way that we would pull that news file might be different if we were pulling that information off of GitHub versus if we were pulling that information out of a source directory. And to make this rather accessible, we actually use an environment behind the scenes and within this S3 structure. And that gives us a little bit of statefulness, meaning that we can go out and evaluate things and they show up in other parts of our code. So environments are kind of one of the few stateful class structures that we have within R and we kind of leverage that to present something that kind of looks like a list that you can index into, but is stateful. So we can allow one way of grabbing metadata to be reused over and over again without really having to go out and reevaluate it multiple times. And it's lazily evaluated. So if I just create a package reference here, in this case on the risk metric package itself, we can see that I have an install and it's a subclass of a general package ref. And it comes with a bunch of different metadata fields. And you can see that it's already picked up some of the most important things for finding a package. It knows where it's at in my local system. It knows the version and the name. So not the whole lot yet. And you can see that there's a bunch of other fields that have these dot, dots, dots after them. And the way that our class structure is organized, that means that we can try to query this. And when we do query for that field, it will go out and fetch that information. So that's kind of one of the unique things about the way that we've set this up. So just to dig into that a little bit more, here I have that same object and we have a few dot, dot, dot items down here. And if I just by the, in the process of indexing into it, trying to get this field help aliases, which was one of these that's cut off on the screen, but one of these fields down below. And I don't actually assign that out to anything. I just kind of leave it in this package object. And because it's this environment that's statefully managing this different fields, next time I look at it, I've got this new field derived. So that's not always desirable. Sometimes, especially if you're trying to be absolutely kind of pure in the way that you're organizing your functions, you don't want this kind of state mucking things up. But in our case, it becomes a really helpful feature. So one of the things that this really helps us do is kind of expose the way that this can be extended really seamlessly and without very much background information about how this class operates. And that was a key design decision of ours because we wanted to make it easy for people in a developing industry that doesn't have a whole, a long history of using R to be able to contribute to this and to be able to get people involved that aren't maybe active and long-term R developers. Maybe they're from the infrastructure end of the life cycle or maybe from the quality end. We want to really kind of expose this as an interface for people to extend with new metrics. And what we can do, that indexing field or indexing into a field, just calling that kind of basic operator there actually dispatches down into another function that can be easily extended. So in this case, we're extending it for a field called example. And the function itself is cache to cache the value that's going to go out and be fetched. And this allows people to add new metrics and also kind of an interesting behavior that this allows for is that we can index into some of the fields that we've used. And if we hadn't already looked for this help aliases field, it would go out and calculate that for us just like it did when we assigned it out to alias. And this exposes a really elegant behavior that we quite like because it means that we don't have to manage this interdependence between a bunch of different metrics that are being evaluated. They can be executed in arbitrary order. And that dependency graph is just going to resolve itself as they need to be evaluated. So that makes it much easier to go out and structure these things. There's a little bit of familiarity with what fields are available, but beyond that there's very little detailed knowledge that's needed of the class structure in order to implement a new metric. And if someone needs to, they can even dispatch on the subclass of the package reference itself to handle that differently. So in this case, maybe the help aliases, once those are derived, those might need to be derived differently depending on whether the source is from CRAN or local. But in our case, we could implement that uniquely if this data structure happened to look different. From there, the process is fairly straightforward. So the rest of the data pipeline is really just kind of data handling and aggregation. So we have an assess family of functions that actually tease out the atomic pieces of information from that metadata. And then a score family of functions that converts that atomic representation into a numeric score on the range from zero to one. And we provide that function that I mentioned earlier, summarized scores, which if you happen to call it on a table, which is the case if you create multiple package refs, in this case, by passing it a vector of package names, that summarized score will get evaluated automatically. And so we provide this kind of pipeable interface just to make it exceedingly simple to execute. You can imagine if you were an administrator and you wanted to execute this across an entire library of packages, you could just maybe list all of the installed packages and go out and score all of them. So in summary, we take kind of this unique approach to managing the interdependence of a bunch of different related metadata by using the S3 dispatch system, as well as environments for stateful data passing, to really lower the order dependence of our evaluation and the cognitive burden of trying to manage a pretty interdependent execution pipeline. We leverage that dispatch system to allow for interfacing with a bunch of different ways of assessing packages without really needing a whole lot of specific implementations or at least we can focus those specific implementations where they're most needed. And we double dispatch, you could say on these S3 functions so that we're dispatching on the field name as well as the subclass. And that kind of gives us this quite extensible way of implementing very specific functionality. And overall, this has fostered some good engagement with our related organizations and industry partners. Some of the contributors are even relatively new to our and are very interested in discussing the metrics. And this has provided a fairly accessible way of starting to implement those and even toy around them in a local session to make that more communicative channel. So it's been a really productive design for the needs that we had. So with that, I wanted to give thanks to all the people that have participated in the development, Elon, Marley, Eli, Eric, Mark and Julian, and I'm happy to field questions. Thank you. That was great talk. So remember, you can type your questions in the Slack channel or in the Q&A panel here on Zoom. We have one minute or two minutes for questions. Otherwise, we'll move on to the next talk at 25 past the hour, which will be given by Nathan. Nathan Eastwood, a freelance data scientist and our programmer. I didn't see any questions pop up in the Zoom chat or in the Slack during the time, but I'm happy to answer if people have questions pop up throughout the day. I'll be monitoring the Slack channel throughout. So feel free to reach out if you have questions. I was also keeping an eye on the Slack channel for that. My name's Nathan Eastwood. I'm a freelance R developer based in Amsterdam, and today I'm going to be introducing my data manipulation package, Poor Man. When working in R, it's very common that we need to do some data manipulation, regardless of which field or discipline you might be working in. The great thing about R is that we've actually got quite a few different options to do this. We've got base R itself. We've got deep layer. We've got data table. Like I said today, I'm going to be introducing my package, Poor Man. And hopefully by the end of this talk, you'll have a good understanding of why I think there's a gap in this data manipulation market to why it's needed, why it's useful. Let's start off by taking a look at an example. Here I've got some code written in base R and I'm working with the famous empty cars dataset. I'm filtering some data, selecting some columns and creating some new columns. Here I've got the kilometers per liter, the weight and kilograms. And then finally I'm selecting some columns in a particular order. And I get this resulting dataset. So this is great. What's the problem? Why write Poor Man? Well, the problem with this code is especially for new users. Let's say that you're new to programming or maybe you're coming from Excel or SPSS, something like that. Looking at code such as this can be particularly jarring. There's a lot of subtleties going on here that you need to understand. And it can be quite tricky when you're first looking at this kind of code. And that makes the barrier to entry quite high. And to the extent that actually some training courses that I've seen these days actually skip over base and just start with tidyverse. Speaking of which, let's take a look at the deployer alternative or equivalent even. So here I have the exact same result. I start off with my data set empty cars then I filter the rows, I select some columns and then finally I create some new columns putting them in a particular order. And this is great. This offers us this human readable API. And this is one of the key points that Poor Man tries to recreate. It's ultimately what makes deployer so popular. It really breaks down that barrier to entry, especially when you have these really cool initiatives such as Tidy Tuesday as well, which offer examples of different data sets and give you the opportunity to work with these APIs and learn from other people, see how people work with this API, how to produce these analysis sets. They can be a really great way for a new user to be onboarded with R. So human readable code reduces the barrier to entry. This is the first key point that I want to make today. So let's consider some scenarios now. Let's say for example, you're doing some analysis and you are rerunning some code, all works great. A few weeks go by, a few months maybe, maybe even a year. And all of a sudden you have to go back and rerun that code. Maybe you work in the financial industry and you get audited and you have to show how you got some particular financial results. So you go back to your script, you start running it and oh no, something doesn't work. How annoying is that? Well, this is probably because maybe you install the newer version of a package and something in that package broke. It's not necessarily the code that you wrote. Another classic example is let's say you write some code and you share it with a colleague or a coworker. But I know it doesn't work on their machine. How annoying is that? Why? You know, it works perfectly fine on your machine but it doesn't work on theirs. It's classic programming issues. Well, this is probably because you've got different package versions. If you've got a dependency of a package, maybe you have a different version and that's what's breaking the code. Finally, let's say that you're working towards a deadline and something isn't working in your code. You might want to start deboning that code. Okay, so what's going on? Is it my code? Is it the package that I'm working in? And if it is the package that you're working with, what happens if you don't understand how that package works? It can really add to the time that it takes to be able to understand where the problem lies. And ultimately, if you're working towards a deadline, that can be quite problematic. And this is where you get into something that I like to call dependency hell. Because ultimately, if you are adding these dependencies to your package, if you, sorry, to your analysis, if you're adding a package, you really need to think, do I definitely need this package for my analysis to run? Can I use something else? Can I work with base? Because ultimately, dependencies are an open invitation for other people to break your code. If a package maintainer or developer changes something in that package and it breaks your code, then it's not really their fault, they're well within their rights to do that. But ultimately, this is gonna be quite annoying for you. If something breaks, you have to take the time to figure out, okay, well, how is it broken? Where is it broken? And this can really be difficult to track down these bugs. So dependencies are an open invitation for other people to break your code. Please, when you're adding these packages to your analysis, to your project, really think, do you actually need it? But of course, dependencies aren't all that bad. And we want this human readable API that we've been talking about. So, okay, maybe we want to manage these dependencies in some way. And there are great solutions out there for that. There are tools such as Docker. Maybe you want to go the R package route, so you might set up a mini-cran server or maybe you want to work more locally at the project level, so you might consider using RM or Packrat. But these solutions, they have problems themselves. Firstly, they are a dependency. You're adding another dependency to your solution. They require prior knowledge that they even exist. Again, if you're a new user, maybe you're a new programmer. You might not have even seen these before. You might not have heard of them. So that means that they require time to learn. And then again, if we go back to that example where we're sharing code with a coworker or a colleague, all of a sudden, it's not just you that needs to spend the time and know about these solutions and learn how to use them. You now need buying from your team members, from your colleagues, your coworkers. They have to spend time learning it. So these solutions, they come with their own problems. Now, when we're talking about complex code bases, Deplier is one of them. If you've ever looked under the hood of Deplier, there is a lot of code there. There's a lot going on. It's a package that's written very, very well. You know, it's a great package. But ultimately, there is a lot of complex dispatches and abstraction that's going on to take away a lot of that code and return things in a package-wide consistent manner. And this goes beyond Deplier itself. This goes into packages such as Arlang and Vectors, which are dependencies of Deplier. But again, you know, it offers this package-wide consistency. So it's understandable why this happens. But there are other abstractions as well. For example, we have C and C++ code in Deplier. And if you, you know, there's a whole other language that you then need to learn to understand how these things work and how they fit together. So it can be really difficult to learn from the Deplier code base. As a side note, to my knowledge, there are no discussions of, you know, how these designs came about. There are publicly available. There's no meeting minutes, for example. And it can be particularly difficult to understand or reason with a decision that was made by the maintainers, especially if it's broken your code. You know, if something's broken, you think, well, this work before, why have you changed it? And of course, ultimately, generic APIs are very, very difficult to design, right? Nobody gets these things right first time. Nobody's going to. So the maintainers of Deplier and all of the tidy verse are well within their rights to make these changes. So it's not a bad thing on them. But ultimately, if you want to learn how they did it, you know, these complex code bases are difficult to learn from. And this is where Poorman comes in. So Poorman is a dependency free recreation of Deplier. And it does this in a completely unapologetic way. It really does just copy the Deplier API line, well, not line, the line, but function for function. And it does this all using base R. There's no C or C plus plus. So you might be sitting there thinking, okay, well, what about speed? You know, how does it compare? Well, look, Poorman, it's not trying to win any speed competitions. Poorman's focus is elsewhere. It's trying to create this human readable API in a dependency free manner in a way in which people can learn from. And what's great about that, because it's written in base, we actually get a benefit in that it installs in seconds, whereas Deplier takes quite a while, for example. Now, within the Poorman package, there is almost the full suite of Deplier functionality. I think there's maybe some experimental features which aren't available, otherwise you pretty much get the full whack. And also there's a couple of other things brought in from the wider tidy version. You've got tidy select, you've got Magrita, you've got some tibble functionality. And to give you some confidence that this all works, there's over 700 tests written for Poorman, a lot of which have actually been imported over from Deplier itself. So hopefully that gives you a bit of confidence that Poorman is able to do the job that it claims to do. Because ultimately, if you take a Deplier script, you swap out the library call at the top, library Deplier for library Poorman, you should still be able to run that script end to end. It should still all work completely fine. And this ultimately makes Poorman a great teaching tool. It's much easier to install in Deplier, so if you're teaching to a wide audience, then it's much easier to get them to install one dependency versus Deplier and all of its dependencies. And you can still go ahead and use all the great tools, examples, tutorials, tidy Tuesday, et cetera, et cetera that are developed for Deplier with Poorman because it works end to end. You take that script, you swap out that library call and it will still work. And what I've tried to do when I've developed this package because ultimately it is a teaching tool, I've really tried to explain the design decisions that I've made and how I wrote certain elements of the package in my blog posts. So if you're interested after this talk, you can go on and take a look at those and learn a little bit about how to develop such a generic API in base. All right, so let's take a look at some examples. So this is the same example from the start of the talk. All I've done here is I've changed that library call from library Deplier to library Poorman and now I'm taking my data, my empty cars, I'm filtering data, I'm selecting my columns and I'm mutating. Just wanna highlight a couple of things here as well because ultimately we get the same result, right? This pipe is not the room agree to pipe. This pipe is actually what I call the poor pipe, it's included in Poorman, it's written in base and we've got features such as starts with, so tidy select features. Again, written in base, included in Poorman and I've got a great blog post which explains how I implemented all of these things as well. So compare this to Deplier, again, you can see it's the exact same script, it's just all I've done is I've just changed this library call and it still works. Poorman also offers some group buy and summarize functionality, so here I've got the iris dataset, I'm gonna group by the species column and then I'm gonna summarize across the sepal columns, calculating the means and I get this nice aggregated output. Poorman also offers all of the join functionality, so here I've got a couple of data frames, data frame one, data frame two, here I'm performing a mutating join, so I'm mutating data frame one by performing a left join, attaching the columns from data frame two and then down here I've got what's called a filter join, so I'm taking all of the rows from data frame one that don't have a match in data frame two and then here's my resulting output. So take home messages. Human readable code reduces the barrier to entry for new users, new programmers. This human readable API that Deplier affords is re-implemented via Poorman and it's done in a dependency free way and this is important because dependencies are an open invitation for people to break your code. We can manage these solutions, but ultimately if we can reduce them, that's even better. And complex code bases can be quite difficult to learn from. Deplier has a lot of abstraction, it has a lot of dispatch that's going on, so Poorman does this using base R and that's a great way to learn from Poorman how to work with base. Now I just wanna leave you with this quote, which is I'd seen my father, he was a poor man and I watched him do astonishing things. Ultimately this is a bit of a tongue in cheek thing, it's when you do library Poorman, you'll actually see this quote. This is to say, well yes, Poorman is written in base but base is fantastic, it offers unrivaled, unparalleled levels of consistency. I can take a script that I wrote in base from 10 years ago and it'll still run today perfectly fine, which might not necessarily be the case if you're working with one of these packages which is under constant development. So I'll leave it there, thanks everyone for listening. I'd like to invite you to ask any questions which you may have now, but I'll leave you with a couple of links. Here you can see the links to install the package, it is available on CRAN, so you can go and install it from there, or you can install the development version from GitHub. There's also a Docker image that you can use as well if that's your jam. I've also listed my blog here where you can go and learn about how I implement all these things. And finally, like I say, I am a freelance developer, so if you want to get in touch, here's my email address, my Twitter handle, and finally my LinkedIn profile. So again, thanks for listening. I'm now gonna stop to take any questions. Thanks again, Nathan, and thanks to all three speakers. So we do have time for some questions, and I did see some questions streaming in on the Q&A. So I'm not sure if Nathan can actually read these questions. Yeah, can you hear me? Yeah. Hello? Yeah, okay, great. Okay, I guess I'll go through them one by one. Yeah, you just take them in the order they came in. Yeah, you just take them in the order they came in. Cool, all right. So Christopher Maronga, sorry if I'm pronouncing that wrong, asks, I'm just wondering what is the difference between poor man and deep player in terms of computation speed? So I haven't actually done much in the way of benchmarking. I think this is a bit of a timeless question that's been asked quite a few times, comparisons between base and deep player. Actually, there's a couple of great benchmarks that are already set up that you can find online. I'll try and find them and post them in the Slack channel. But yeah, I guess deep player, there's a few bits which are written to be more performing and that's probably gonna work a little bit faster than base, in particular grouping operations. But my aim really wasn't to focus on speed with this. I think if you really are concerned with speed, maybe take a look at something like data table. What obviously my aim here was to produce this human readable API for people using base and wanting to minimize their dependencies. I do plan eventually to extend the package such that I offer data table S3 methods, but that's way in the future yet. I depends on I get time. Okay, so the next question, how can poor man be at the same time fully backward compatible while keeping in compatibility with deep player when the latter is not fully compatible? So what's the motivation plan for man? Well, essentially my plan with poor man is to keep it lean, but at the same time, I'm not planning on deprecating any functionality. So right now I've kind of aimed for the deep player version one release. And I'm aware that functions such as, can you take that or arrange all or select if those functions will be deprecated in a future version of deep player. And so I don't plan to add those instead the across functionality is there. So hopefully that will provide you all of the functionality that you need. And then moving forwards, I don't plan on deprecating anything. So if anything else gets deprecated from deep player, it will still remain in poor man, which should hopefully help with backwards compatibility. Okay, so there's a question of whether the poor pipe is lazy is not it does just pass the data along. I'm not sure if the new base are pipe is lazy, but that should work just the same. And, you know, if you really want to, you can always use them agreed to pipe if you wanted to. Would poor man be a good replacement one to one for deep player in a production setting? Well, yes, I would like to argue so in the sense that, you know, I've taken the majority of tests from deep player and I've written them for poor man. So, you know, there's like over 700 tests now I think for poor man, which run on a CICD pipeline and yeah, yeah, yeah. So I'd like to say that I'm pretty confident that it can be ran. You might want to consider whether it is your best option, of course, you know, if you're working in production and maybe you want something that's maybe a little bit more sort of robust, you know, tested a little bit more in the wild, maybe something that's better for speed, such as data table, or you can de-plug itself, of course, but if you're looking to reduce dependencies, then maybe that's not the way to go. Okay, so it works only with data frames or also with tools with column lists. There is some support for column lists, but I don't think I've got the full support yet. I would need to double check. Maybe I'll write a blog post about that. Sorry about that one. I can't answer that one off the top of my head. Will having both poor man and de-plyer cause issues requires to call poor man, column, column versus de-plyer, column, column. Yes, because the mapping, so the exported poor man functionalities one to one with de-plyer, that was a choice that I made very early on. There are actually a couple of other really great packages out there which have tried to do something similar to poor man, but the functions, they either have something prefixing or something appending the function name. So I think like one package has like select data, filter data, whereas poor man just goes for select filter because I really wanted you to be able to take a de-plyer script and just run it with poor man. Okay, will you have a new version of poor man? If yes, when someone uses it, the new versions may affect their work or codes. Yes, this is true, but then that would mostly be book fixes. The API is not going to change. So the API that is available now should be available a year from now, two years from now. So the only things that would potentially break is actually if I was to fix things within the poor man package tools and it's gonna give you a more robust script. Does poor man have the pivot functions? That's actually something I'm working on right now. So watch this space. I'm working very hard on that right now. So I plan on having pivot wider and pivot longer from tidy R included in poor man. Does poor man also provide functions from the tidy bus? So yeah, it does. Like say I'm planning on bringing the pivot functionality, but there's also some other functionality from tidy R. There's some from Tible, I believe. Obviously you've got them agreed to pipe the tidy select package. I'm planning to add some glue functionality in there at some point, sort of like a wrapper around it, but I haven't done that yet because I haven't decided on what I want to do. I don't really want to just go ahead and recreate glue. I may end up having like a poor man's version just using S print S, but I haven't decided yet. Does poor man have any function support integration of Python coding R easily? I haven't tried it with Python. Maybe with like reticulate, you might be able to get around it, but I haven't tried it. If anybody does, I'd be really interested to hear about it actually, and just to find out how well that works. Have I answered everything? There was a question here. Okay. I think that's everything. I went for even as quickly as I could in the interest of time. Oh, there's one more. Anything in front of Janitor? Okay. Not yet. Again, that's something that I plan on adding. Does it work with SQL queries? No, maybe sometime in the future, but it depends on how much time I get. And that's everything. If there's any more questions, of course, I'll be hanging around in the slides. Yeah, so up next, well, from 615 UTC, there's a 15-minute break. That's happening at the lobby channel on Slack. And after that, there's two sessions, one on data management and one on Shiny. So there we go. We can also stay here for a while because this session has three talks rather than four. So if there's any more questions for any of the speakers, feel free to ask them here or on the Slack because we do have all speakers here. See, there's another question about Pullman. So does Pullman improve Shiny app development? I mean, improve is a very vague term, I suppose. I mean, yeah, you can use it within Shiny. There's no reason why not. So it depends on what you mean by improve, whether you mean in terms of performance. I'm not quite sure. So we do have one more question and that's for Doug about Risk Metric. Doug, if you're there, can you hop in and answer? Yeah, I didn't see this question come through yet. Oh, I see it from Philip Ifmore. Sorry if I'm pronouncing it incorrectly. For Risk Metric, what metrics do you use to score the packages? So this is a growing cohort right now. They're fairly simple, I would say. So we look at things like whether it has an identified maintainer, a lot of things that our command check also looks at, as well as our command check and whether that throws errors or warnings, as well as some community interactions. So we'll look at open issues, what percentage of the open issues have been closed in the last 30 days, things like that. And there's a full list on our GitHub page or you can dig through the package and just look at what metric functions we have available. But the purpose is really this foundation. So it's meant to be easy to extend to incorporate new ones. So as we get more engagement, more people start looking to this to facilitate that kind of risk assessment process. The intention is that those can then be easily contributed back and we can start growing this cohort. Yeah, so there's maybe 13 or so, ranging from things like package development, access practices, all the way through to community engagement. And maybe one thing I think an opportunity to emphasize during the talk is that one of the things that's kind of nice about this, depending on where you're installing the package or looking at where the package or referring to where the package has been, is that the metrics are evaluated for that specific instance. So locally, I might be able to run our command check successfully, but someone on a different platform might encounter errors and that's an intended behavior that we're assessing risk really where the package has been, where the package has referred to. So even metrics themselves are kind of this like amorphous concept. So no independent testing, but we do run all the unit tests as well. So and coverage itself is also a metric that we incorporate, but it is growing. So as those keep coming up, like maybe rolling in our CPP deep state would be a cool way of looking at package closing. As these things come up, we can start incorporating them as well. Is there a consortium white list? So that's something that's been a hot topic recently and we just started putting together a work stream to start looking into whether or not we would, what that would look like. And another thing that we have kind of in the pipeline right now is to make it easy to incorporate like a badge onto a GitHub page or a repo page. And then it would be easy to get this numeric representation which would be a little loose because it might be assumptive of the image where that risk score was derived but is at least a representation of risk. So yeah, if you're interested in thinking about this white list and helping to develop that, that's definitely an area of opportunity right now at our validation hub. Thanks for that. So there is one more question that came in for Nathan but before that, I wanted to ask Akilah one question if she's still here. Yes. Yeah, so I didn't have time to ask this when I was monitoring the slide during your talk but I did see that you mentioned that one of the outputs in the New York package produces a beta table. Was that, why was that chosen? There's no specific reason because I find that data tables are clear and easy to understand. A lot of the output versions that I'm working with are mostly data tables that are just easy to understand. They need a good column and do base. Am I not audible? Uh-oh. You've a little low at least for me. Okay. Do I have to repeat it? Yeah, please. Yeah, I use data tables because they're easy to read and understand. There's no specific reason why I've chosen it. I just use, I do all the rejects operations and when checking off for the valigrant logs. So I found it to be easier. Okay, that makes sense. Thank you. If Nathan's still here, there's one more question from Kenneth. Sorry, yeah, okay, I see it. Strange the Q&A pings up and down. How can you call Pullman if you call tidyverse? So that will depend on which order you actually load in the packages. So if you did like be tidyverse and then library Pullman, the Pullman functionality should be first on the call stack on the environment list. So you won't need to prefix it with Pullman, column, column. But if you did it the other way around, if you did like Pullman and then like be tidyverse, to get the Pullman functionality, you would have to do Pullman, column, column, because otherwise that functionality is going to be masked by the functionality of deep layer. Hopefully that answers the question. Oh, I see a follow up question to the shiny question around about speeding up deployment of shiny. Yes, that's very true, particularly when it comes to installation. I mean, Pullman installs in like seconds, whereas the tidyverse, because there's all the compilation and all that jazz, it just take quite a while to install. So in that sense, yeah, it would probably be faster to deploy shiny after, yes, it's a good point. So while any other questions keep rolling in, I'll remind everyone once more. That up next, we have a 15 minute break, where you can all hang out in the lobby channel on Slack. And at 6.30 PM UTC, we have the next session. Session two way is on data management, which has its own Slack channel and session two B is shiny, which also has its own Slack channel. All right, so we will be wrapping this up because we need this room for the next session to start warming up. So once again, I'll thank all three speakers, our Zoom hosts and the sponsors for today. Thanks again everyone and feel free to keep in touch using the Slack channel.