 Hi everyone, I'm Javier Luraski and in this useR 2020 talk we are going to learn how to use the PINTS package to make your data science workflow reproducible even when you're using data sets. So a lot of the common kind of like anti-patterns that I've seen when using data sets is that well they fall into two categories. So the first one is we overuse iris and empty cars because it's just hard to get data so we just fall back to the easiest data sets to use. And the other one, a lot of times we don't include our data set in the code that we're trying to share or reproduce, right? So you know you can see a couple examples on the slides that I found from our bloggers where the data set is missing or where it requires you to provide manual steps to actually download the data set and that creates friction and ultimately leaves code that is broken. Now in general we talk a lot of times of the core data science workflow which was popularized by R4DataScience and in this workflow we start by importing data then tidying it, understanding and communicating which is all good but at a lot of times we also need to figure out where the data is. We need to search for data and we need to bring the data from a remote location into our local computer and a lot of times we also want to share these data sets with our colleagues, friends or the world. And this is what I call discover, cash and share which is something that the PINTS package was designed to help you resolve. And in general reproducibility is important. I don't even need slides for this but tools like Jupyter, our notebooks and our markdown were designed with reproducibility in mind. So we want to make sure that our data sets are part of this workflow and are not left behind with manual instructions, hard to configure, permission, dependencies and why not. We want to be able to take code, rerun it even if it has data sets attached to it. So what do we want from a package and a tool? Well, I think we want a few characteristics. One, we want to make sure that we have a single tool. We want to avoid relearning different tools for different storage locations. Second, we don't want to be locked in with a specific storage provider. Say that we're using Amazon S3 and then we want to move to Google Cloud. It should be easy to move from one cloud provider to the other one without having to relearn or change our code. A third, the tool should allow us to also create automated workflows that don't require authentication, right? Because that way we can also automate our data workflows and just make workflows that are more efficient. Fourth, it needs to work with any type of data. And by any type of data, I mean any size from small data sets to large data sets that ideally are hosted in things like HDFS to really scale up the type of data sets that you can store. Fifth, it should help us find interesting data sets because we are all tired of using iris and empty cars, so wouldn't it be nice to use additional more interesting data sets? Sixth, it needs to be fast, right? We don't want to be waiting for our data sets to load. And ideally, we also wanted to make sure that it works offline when we don't have an internet connection. These are all things that we consider when designing the PINs package. The PINs package allows you to pin a remote resource or a local resource into your local machine or into remote machines. It allows you to discover data sets and it allows you to share resources with others in multiple services like Kaggle, RStudio Connect, GitHub, S3, Azure, Google Cloud Compute, and Digital Ocean. The PINs package is now on CRAN. You can download the latest version, which is 0.4.1. And to use it, it's actually quite straightforward. So after you install from CRAN and after you run library PINs, all you have to do to save an object is say PIN, the R object that you are interested in storing, and then a name. In this case, we're saying numbers. And then you can retrieve the PIN using the name. You say pin get numbers. If you are retrieving a remote data set, you can also specify a remote URL. In this case, we're retrieving a smaller version of ImageNet from a zip file, all with one line of code, and then we're retrieving it from our local storage using pin get. You can also use pin find to find data sets. And not only that, but also this is nicely integrated with RStudio for those RStudio users out there. So what is a board? Well, a board is a concept from the PINs package that describes the storage location. So it's basically anywhere that allows you to store data. And currently, we have seven different storage providers that the PINs package support. You can store data sets in RStudio Connect, Kaggle, GitHub, S3, Digital Ocean, Google Cloud, and Azure. And it's basically quite easy to store data sets in a remote location to share with others. And all you have to do is include one line before you start using the PINs package, which is board register. So board register will enable this particular data set. We'll enable this particular storage provider to store your data sets. And then all you have to do is whenever you pin an object, like the numbers 1 to 10, you want to specify board equals and say GitHub, S3, RStudio Connect, or whatever. And yeah, that's that's the way it works. Then you can do pin get to retrieve your data set. And you can optionally also specify the board name, if you want to make sure that you're retrieving the data from a very specific board. So that's that's how it works. If you had to manage permissions or look at the content or whatever, well, obviously, this is different between different storage providers, right? Like, if you were to go to RStudio Connect, the user interface that you would get from each storage provider is different. And also the defaults that they allow you to change as permissions and settings are a little bit different. So let's just take a look at a quick example. So in this case, this is the pins website. And you can navigate to use cases to take a look at a few things that actually are common cases of using the pins package. So the first one is reusing data sets. So as I was mentioning, a lot of times we spend a lot of time cleaning data sets. So we find the data set, we clean it, and we want to share it with others. And in this case, we're just executing one line to share this data set in RStudio Connect. But you can run very similar code to share it in GitHub or cloud providers and why not. And you get a preview of the data set. But more importantly, you also get a line that you can use to basically retrieve this data set from an R session with ease. A few other use cases, you can automate the data set update. So if you have a data set that a lot of people are consuming, you can basically create GitHub action or an RStudio Connect automated report to update this data set in an automated fashion. And not only that, but you can also create multiple pipelines. You can have two data sets that are automatically generated and then a third report that uses the previous two data sets to update something else or why not. And last but not least, also a pretty great use pattern of the PIN's package is to use it to preload data in plumber and shiny applications, right? So in this case, we have a deep learning model that trains overnight and takes a long time to compute and requires GPUs and why not. But after we finish processing this data set, what we do is we create a PIN, which then a shiny application can make use of by simply loading the data from a PIN. Not only that, but the shiny application uses what we call a PIN reactive, which is defined on the PIN's package, which allows you to update the data set reactively without having to require the entire data set. So it's kind of like the PIN's package is smart enough to only bring data when it has changed. And that's exactly what PIN reactive is doing with your shiny application. There's a few other features worth mentioning. The first one is that you can support multiple versions. If you're familiar with GitHub, you already expect the PIN's package to allow you to retrieve previous versions, but you can also use versioning with boards like S3, RStudio Connect Kaggle and why not. Some of them like the cloud boards require you to opt in since it comes with additional costs to store multiple versions. But nevertheless, versioning is supported in all the package and all the boards that the PIN's package supports. You can also extend the PIN's and the boards. So say, for instance, that you are dealing with a proprietary file system in your organization or something that simply the PIN's package does not support. You can obviously send us a pull request, but you can also create a package that implements four operations, which are create, find, PIN, retrieve and remove. If you can implement these four operations, you can have a fully functioning board, whatever storage provider that might be. And in addition, you can also configure how each particular R object is stored. For instance, when we store a data frame, it gets stored as a native R object and also as a CSV file for easy interoperability. But you can choose to store a ggplot or plots in general objects with their screenshot as well. And this is something that you can configure with an S3 method over the PIN verb. Last but not least, this is actually a feature that I think is quite great and almost no one knows about. Whenever you store data in a board, what the PIN's package uses is what we called the data.txt specification. This specification is just a YAML file that describes the location of your data sets and the metadata. So it's quite simple. It's just a text file that describes your data sets. But what you can do with these data, with this specification is use tools like the data.txt package, which allows you to generate a website for your data sets that are already available in the data.txt specification. So whenever you create a PIN in an S3 bucket, you also get by default these data.txt specification, which then you can use to point a custom domain name and basically pre-create this website out of the definition of your data sets. And this is how it looks like. So this is the seller subproject on the CASA.ai project, which contains a data set site with five different insurance-related data sets. And in this case, you can see that the data set has some of the contents and some of the column descriptions and summaries that you would expect. But what it's also really awesome is, apart from being able to see this website, once you have made your data sets public, you can easily interoperate with them from our studio or our in general, with a single call, which is a PINS board register, and then the path to your domain that contains the data.txt file. And you get in the connections pane, you can see all your data sets being displayed, and you can also view them, you can also PIN get them, and you can also, for instance, find them and why not. All right, well, thank you so much for attending this talk. If you want to learn more about the PINS package, the best resource to go at is pins.rstudio.com. If you want to keep in touch with updates on the PINS package, we are putting all the updates that our team is working on on their blogs, rstudio.com slash AI. We also have a YouTube channel where you can learn how to use big data technologies, deep learning and data management technologies like MLflow and PINS under youtube.com slash C slash MLverse, or you can contact us directly on Twitter, myself at Javier Luraski on Twitter, Kevin Ikuo for data.txt and insurance data related solutions, Alex Gold, which has been advocating for using PINS with RStudio Connect, and the rest of our team members on the Multiverse team, Daniel Siegret and Yita Boli. Thank you so much.