 All right, hi everyone. Thank you so much for having me. And today, I'm here to talk about how to build internal R packages to help analyze open data and bring in more aspects of the open source community to internal teamwork. But before we talk about packages, first, I wanted to ask you to think back to the last time that you joined a new team. There was probably so much you needed to learn before you could start contributing. From the frustration of trying to access data, to lost hours, answering the wrong questions until you built up that sort of intuition, and the awkwardness of figuring out team norms. When we joined a new organization, we fortunately only have to go down this learning curve one time. But the off the shelf tools we use don't preserve this context. So every single day is like their first day at work. Now, unlike open source packages, internal packages can embrace institutional knowledge more like a colleague in two really important ways. First, they can target a far more domain-specific and concrete set of problems than open source packages. Second, although narrower in the problem definition themselves, their insight into our organizations allow them to be more holistic in their solutions by incorporating more steps of the necessary workflow to answer that problem, covering areas all the way from pulling data to communicating with our coworkers. Because of these two factors, internal packages can add a lot of value by doing things like interfacing with internal utilities, streamlining our analytical processes, and empowering data scientists and analysts to create high quality outputs. But more than the tasks that a package can do, today I want to focus on how to design such internal packages to be as rich in this institutional context as possible. To do this, I like to think about the popular jobs to be done framework for product development. This asserts that we hire a product to do a job that helps us make progress towards the goal. And to me, it's that notion of progress and truly known the ins and outs of what sorts of progress our users want to make because they're ourselves or our colleagues that really sets internal packages apart. Additionally, these jobs can have functional, social, and emotional aspects. Through the rest of this discussion, I'll tweak this framework just slightly. Let's think about building a team of packages to do jobs that help our organization answer impactful questions with efficient workflows. So what sort of teammates should our packages be? Let's meet a few. First, let me introduce you to the IT guy. Think of a helpful or friendly IT colleague that you may know. They're fantastic at handling the quirks of infrastructure and abstracting away the sorts of things that data analysts don't want to or aren't particularly educated at thinking about like maintaining secure servers. In that abstraction process, they also take on a lot of additional responsibilities to remote good practices, like credential management. Ideally, they help us save time and frustration by navigating organization-specific grid blocks that no amount of searching on Stack Overflow will resolve. So how can we encode these characteristics into an R package? Providing utility functions in a package can help achieve the same abstraction. And in these, we can take an opinionated stance and craft helpful error messages. Let's look at an example. Suppose we wanted to write a function to connect ourselves to a database. We might first start out with rather boilerplate piece of code using the DBI package. We take the username and password, hard code the connection information and return a connection object. Now, let's suppose that our organization has strict rules against putting secure credentials in plain text, as well they probably should. In an open source package, I wouldn't want to presume to force users' hands in how they're doing credential management. However, in this case, I can make strong assumptions based on my knowledge of my organization's rules and norms, and this sort of function can actually incentivize users to do the right thing, like in this case, storing their credentials in environment variables, because that's literally the only way they can get the function to work. Of course, opinionated design is only helpful if we provided a descriptive error message when the user does not have their credentials set this way. Otherwise, they'll get some cryptic warning that DB Pass is missing, and find nothing online when they Google that, because that's a package-specific concept. So we can enhance this function with a custom and prescriptive error message explaining what went wrong and either how to fix it or where they can get more information. Even better than explaining errors is preventing them from occurring at all. We might also know that at our specific organization, non-alternumeric characters are required in passwords, but DB Connect doesn't encode these correctly when passing them to the database. Instead of tripling our users and asking them to change their passwords, we can instead proactively but silently re-encode that information. However, strong opinions in this level of independence don't always make the best teammates. Other times, we might want someone more like a junior analyst or a trainee. They know a good bit about our organization and we can trust them to execute calculations correctly and make reasonable assumptions. At the same time, we want them to be responsive to feedback and willing to test out multiple solutions. So how could we incorporate these jobs in an internal package? We can build in proactive and yet responsive functions with default arguments, reserved keywords, and the ellipsis. To illustrate, imagine a basic visualization function that wraps a ggplot2 code, but allows users to input their preferred x, y and grouping variables to draw cohort curves. This function is fine, but we can probably draw on institutional knowledge to make our junior analyst a little bit more proactive. If we relied on the same opinionated design as the IT guy, we might consider hard coding some of these variables inside of our function. Here though, that isn't a great approach. We might know what the desired x-axis will be 80% of the time, but hard coding is too inflexible, even for internal use. It decreases the usefulness of the whole package for the other 20% of possible applications. Instead, we can put our best guess 80% right name is the default argument in the function header ordered by the decrease in likelihood of the need to overwrite. Then when we do not provide a unique value, the default is used. That was the junior analyst's best guess, but users retain complete control to change it as they see fit. This approach becomes even more powerful if we can identify a small set of incredibly commonly occurring assumed variable names or other values. Then we can define and document the set of reserved keywords that span all of our internal packages. If these are well-known and well-documented, users will get into the habit of shaping their data so it plays nicely with the package ecosystem and saves manual typing. This also incentivizes a level of standardization in naming. They can have other convenient consequences in making data extracts more transparent and more shareable. Finally, one other nice trick in making our functions responsive to feedback is the ellipsis or passing the dots. This allows users to provide any number of additional arbitrary arguments beyond what was specified by the developer and plug them in at a designated place in the function body. This way, users can extend functions based on needs that the developer never could have anticipated like customizing color, size, and line type. So far, we've focused mostly on the first dimension we talked about of internal packages, making our package teammates targeted solving specific internal problems. But there's just as much value in that second dimension. Using internal packages is a way to shift not just calculations, but complete workflows and to share our understanding of how the broader organization operates. To illustrate this, consider our intrepid tech lead or principal investigator. We value this type of teammate because they can draw from a breadth of rich past experience and institutional knowledge to help us weigh trade-offs, learn from collected wisdom, and inspire us to do our best work. So that's a pretty tall order for a package, but conscientious use of vignettes and templates can help us towards this goal. Vignettes in our packages often help introduce the basic functionality with a toy example, as you can see in the deflier vignette. Or alternatively, they may discuss a statistical method that's implemented as done in the survival package. Now vignettes of internal R packages can do more diverse and numerous jobs. These vignettes can accumulate the hard one experience and domain knowledge like an experience tech leads memory and use them to coach anyone currently working on a related analysis. Just as a few examples, you can consider having a completely code-free vignettes that conceptually introduce a common problem your package solves, explain the workflow and key questions you as an analyst should be asking yourself as you go through the data, and even potentially get into some of the procedural weeds, like what you need to do to add a feature or deploy your resulting model. Then after aligning with your users on that conceptual framework, you may introduce the package's functionality and explain how these two overlap. When your package contains functions for many different ways to do a task, you can also compare pros and cons and explain different situations where different options would have proven more effective. Vignettes can also include lessons learned, reflections from past users, and links to references to past examples to help analysts learn about similar projects. In fact, all of that context may be so helpful, even colleagues who are not direct users of your package may seek out its mentorship. In this case, you can use the package down package to automatically create a package website to share these vignettes with anyone who needs to learn about a specific problem space. With a single function call, your package can share its wisdom much more broadly. And unlike their human counterparts, package teammates can always find time on their calendar for another meeting. Now, similar to vignettes, embedded templates take on a more important and distinctive role for internal packages. In open source packages, our markdown templates provide pre-populated file instead of the default. And this is commonly used to demonstrate proper formatting syntax. For example, the Flux dashboard package uses templates to show users how to set up a YAML metadata and the different section headers. Internal packages can use templates to coach users through workflows because they understand the problems users are facing and that progress they hope can achieve. Templates can measure users and structure their work in two different ways. Process walkthroughs can serve as interactive notebooks that coach users through common analyses. As an example, if a type of analysis requires manual data cleaning and curation, a template notebook could guide users to ask their questions of their data and generate common views and summaries that they really need to interrogate. We can also include full and tend analysis outlines, which include placeholder text, commentary and code if the type of analysis that a package supports usually results in a specific report or outcome. Similar to markdown templates, our packages can also include project templates. These templates can provide a boilerplate director's structure and a set of files for each new project to give users the helping hand and drive the kind of consistency across projects that any tech lead would dream of when doing a code review. Now, speaking of that collaboration piece, brings us to the last teammate whose traits we want our packages to vote, the project manager. One of the biggest differences between task doing versus problem solving packages is understanding that whole workflow and helping coordinate projects across many different components. When writing open source packages, we rightly tend to assume that our intended users are in fact, our programmers, but on a true cross-functional team, not everyone will be, so we can intentionally modularize the workflow and think about how to augment our studio's IDE to make sure our tools work well with all of our colleagues. One way to do this is with modularizing the parts of our workflow that do not really require our code. For example, in those templates we just discussed, we could actually make separate templates for components that do and do not require programming. Files that only need plain text commentary could be in vanilla markdown files that a collaborator could edit and any sort of word processor. And the main our markdown file can incorporate these documents using child documents. This approach is made even easier with advances in the RStudio IDE. For one, the Visual Markdown Editor provides a great graphical user interface to support word processing in Markdown. And we can also use the feature of RStudio Add-ins to extend the interface and ship interactive widgets inside of our package. As an example, I here show the Esquiz package, which is a point-and-click plot coding assistant. Add-ins require more investment upfront, but they're much easier to maintain than complete web apps and can help convert our teammates to our users over time. Besides, a good project manager is willing to go that extra mile to support their team. Now, speaking of collaboration, we've talked a lot about what makes an individual package a good team member. Another major opportunity when building a suite of internal tools is that we have a unique opportunity to think about how multiple packages on our team can best work together. We will always want teammates that are clear communicators, have defined responsibilities, and keep their promises. We can help our packages be good teammates with naming conventions, clearly defining scopes, and attention to dependencies and testing. Clear function naming conventions and consistent method signatures help packages effectively communicate with both package and human collaborators. Internally, we can give our suite of packages consistent names that indicate how each function is used. One approach I really like is that each function prefix can denote the type of object that it will eventually return, like viz functions always returning a ggplot2 object. This way, past experience working with any one of our internal packages gives users intuition when working with any other one of them. Another aspect of good team communication is having clearly defined roles and responsibilities. Since we own our whole internal stack, we have more freedom in choosing how to define functionality between packages. Take, for example, the data science workflow as described in the ARF for data science book. Open source packages inevitably have overlapping functionality which forces users to compare alternatives and decide which one is best, which can be timely. But internally, we can use some amount of centralized planning to ensure that each package teammate has very clearly defined role, whether that be to provide a horizontal utility or to enable progress on a specific work stream. When assigning these roles and responsibilities to our team of packages, we can consider how to manage the dependencies between them when different functionality needs to be shared. Packages often have direct dependencies where a function in one package calls function in another package. This isn't bad, but especially with internal packages, which might sometimes have a short shelf life and few developers, this can potentially create a domino effect of failure. If one package is deprecated or decides to graduate, retire, or take a vacation, we don't want the rest of our ecosystem to fail. Alternatively, since we control both packages A and B and are under our control, we can see if we can eliminate explicit dependencies by promoting a clean handle. We might make a co-function in A, produce an output that B can consume instead of calling A directly. Or we can find shared needs in packages A and B and extract them into some building block package C, which might contain something like visualization primitive functions. This way, we can at least have a clear hierarchy of dependencies and can identify a small number of truly critical pieces of infrastructure that can't fail. Regardless of the types of dependencies we end up with, we can use tests to make sure that our packages are reliable teammates who always do what they promised. Typically, if I write a package B that depends on package A, I can only control package B's tests. So I can write tests to see if A continues to produce the results that B is expecting. This is a good safeguard, but it means I'll only detect a problem after it's already been introduced today. Instead, with internal packages, we'd prefer both A and B be conscientious of the promises they've made and stay committed to their collaboration. We can formalize the shared vision with integration tests. That is, we can add tests to both the upstream and the downstream packages to ensure that they continue to check in with each other and inquire if any changes that they're considering making could disrupt the other. Now, just imagine having that rigorous and ongoing communication and prioritization with your actual teammates. In summary, we all know the joy of working with a great team. And I suspect many of the people here today know the pleasure of cracking open a new open source tool. By taking advantage of the unique opportunities of designing internal packages, we can truly achieve the best of both worlds. We can share the fun of working with good tools with the teammates we care about, and we can elevate those same tools to be full-fledged teammates by giving them the skills they need to succeed. Thank you.