 So this is the our data retriever session. It's a platform for downloading, cleaning and installing publicly available data sets. I'm Henry Senyondo, a research software developer at the University of Florida. And specifically I work at the Ecology Lab, which is one of the best research units at the University. And as scientists, we do a lot of data processing and we know the data processing workflow. It usually starts with acquiring data, we clean and reformat the data, we combine the data, visualize it and analyze it. And all these parts take different amounts of time. So scientists spend quite a lot of time during acquiring data. They spend more time even during the cleaning of the data and we could utilize that time in these other aspects like analyzing data and creating the cool science. So the our data retriever is that tool that will help us spend less time in acquiring and cleaning, but more time in analyzing the data sets and coming up with all this cool science. And this is what we actually produce after using the our data retriever. We noticed that less time is used in acquiring the data, cleaning the data and more time is for the scientists available to visualize and analyze their data sets and come up with the cool solutions to all the problems that we have. So data processing in particular is time consuming. We know that. And why is that so? There are a large number of formats. We have data packaged in XML files, CSV files, tabular data sets. We have data that is zipped, provided as zips. These compressed files, which is dot zip or maybe tar and other versions. And even if we have a grid upon standards, there is lack of that standard of following up on those grid upon standards to provide these data packages. And when it comes to trying to actually clean it, we as scientists, we know that everyone can clean data and we always create our own data cleaning code, right? And what happens is that every data package eventually has its data cleaning code. And now we're increasing the amount of time that we actually use on cleaning or acquiring the data, but actually searching for these kind of packages and debugging them at times, because software always breaks and I always say it grows by night. So data processing is also unstable. Everyone writes their own code, but data updates could change the changes due to format. The changes maybe the URL has been moved. For example, recently people have been moving their data packages on GitHub from master to main. So the data processing code for regularly updating updated data frequently breaks. And when it comes to tooling, that's really frustrating because you end up getting your old pipeline broken down because of these problems. And so how can the R data retriever actually help? So the R data retriever can download your data sets, it can clean them and install all the open data sets with one line of code onto some of your data storage engines, right? So data sets in the R data retriever are provided in form of JSON files. These are kind of like specifications. They try to understand or attribute properties of the data like maybe the data has a version, maybe it has a license. It has a URL where we can find the data and it has this kind of meta specification of like which files and maybe a zip folder. And then we also provide custom scripts for really kind of like arbitrary complex cleaning or restructuring tasks where the data is so complicated that we actually have to try to do some tailoring of the data so that we could get an instance that could be ingested into the platform. This is basically similar to a packet manager for software, but this time it's for your data. How is that so? Because a single person comes out or figures out how to solve the data, how to clean the data. He creates a recipe. This recipe is shared to everyone else. And what happens is that if something changes in the recipe, it's automatically delivered to all the users just from one single recipe change. So let's get an overview of what we're actually dealing with. So the R data retriever collects all these data sets from all the public sources. You know, we have many public sources, some are domain specific, some are just kind of like data science specific, for example, cargo. USGS is basically science, Forest Service gives you all those data sets. And there are many, many repositories where we can find publicly available data sets. So the R data retriever comes in and cleans these data sets, restructures them in the best way possible for you to analyze and puts them in R, obviously. And if you want to use them as data frames, you can go ahead or you can actually go ahead and store them in some of the data storage engines. We support several data storage engines. I hate my SQL, Postgres, flat file storage systems like CSV, Excel, JSON, JSON. So there are many, many supported platforms with this tool. So just to go ahead and show you how the processing is, this may be a little bit kind of like hard to show, but we have several data sets. And how do we do that? A simple run will be initially loading the library reticulate, which we're using reticulate is one of those underlying libraries that we use in the R data retriever. So we load reticulate and we load just it's setting up again. We load reticulate and we load our data retriever and it provides several functions and you can see what kind of functions you want to use installing CSV or maybe do a check for updates. So right now we're running data sets. We want to see how many data sets we have. Currently we have approximately reaching the 300 unique data sets. And you could understand many people will be worried about, okay, you have several data sets, but how do I find the data set that I want to actually use. So data sets are usually categorized by maybe keywords or something and retriever the same. You could run retriever or data retriever data set and give it a keyword, maybe I want data sets with birds or with plants and it will give you all these data sets. And you can go to the website and look at maybe the citation, the citation, provide the description and you could find the data set that you want to actually use. So a simple run on how to install the data would be retriever, install CSV where we're installing iris. It produces, I think it has 151 rows or data rows and then we, in another case we would want to actually create a data frame, maybe from a data set called Porto. So we say retriever, sorry, so retriever fetch Porto and it gives you, it tells you have installed the data and we have three files which is main plots and species. And you could go ahead and actually look at what you have as the data frame. And right now it would show you a little bit of what is in Porto. It's a list and what kind of data is in there and it will give you basically what you have, right? So if we go back to where this R data retriever comes from, basically we have three main data processing. We provide three major data science languages. It's basically written in Python, which is the core tool. And then we have the R package, which is the R data retriever. It's a wrapper on the Python core tool that uses reticulate under the hood. And then we have the Julia package, which is also the Julia retriever, but it uses PyCall to call instances from Python. So if you're using any of these languages, you can actually have the same functionality in any of those languages. So when you're producing, when you're processing data and dealing with all these data science things, we always have a problem of trying to reproduce an analysis we did in the past, right? Well, the R data retriever has that functionality. It tries to give you a snapshot of an analysis you did with the squad and recipes. It all packets them all together so that it's kind of like how visioning works, gate visioning. But this time it's for your whole reproducible pipeline. So you could get the data, commit it, get the scripts, commit it. It keeps all the versions together so that in the future, if you're trying to analyze data and the data has changed so much, you're trying to understand why the results have changed, the data has changed. You could go back to a day that you actually committed this data set and try to reproduce the same. And this is a very good part of people who write papers and you provide a paper and somebody does the analysis and is like, oh no, I think we have different analysis based on the data. You can go back to that same, same place and rerun everything. So that is cool. And this is a sample of how it runs. For example, Retriever, we are committing portal and we're saying, okay, I'm committing it and I'm at the CSV conference today. And maybe later on I will try to find out what I committed that day. So a few days back, it gives me after a few, a few runs it gives me a commit message. It gives me a hash number. So this hash number and the message that has been committed and the date can be used to actually install it in the future. So a few days pass and I want to install the same analysis using the same code because code always changes. Maybe somebody in the provider, the data package provider added a new column and I want to see what happened. So you could come into the R-Data Retriever and say, okay, install that portal and use that hash number. But to install the same data and give it to you and you could go through the files and see what is, what is actually happening. Yeah, I think so that is, that is when it is finished installing the new dataset. So coming to the end, the R-Data Retriever brings about robustness in data processing, right? How do we handle that? We know that updates always happen and they always even still break the underlining recipes that we have, but we do a weekly monitoring of those datasets to check for breaking changes. Then we, because it's a community best platform, people can come in and say, oh, I'm working on this dataset. It's Iris and it's breaking and we could figure out why it's breaking. Maybe there is a change in version. So we keep on updating the version and we keep on updating the data so that somebody who installed 1.0 doesn't can understand that 2.0 has a difference in the data because of a change in column. So it allows people to quickly fix these things and when a recipe is fixed, we automatically send the updates to people. You can literally go into your R-Data Retriever tool and say, get updates and it will tell you, okay, this has been updated. So with that continuous, it continues to run even with the changes in the data format or the location. And that's something cool with this because if everyone doesn't have to develop their own packages and we have a one standard platform that can clean the datasets and people maintaining it, and this is one of those great tools, I think, in reproducing science and getting the work flow at an optimal standard. So with that being said, if you have any questions, if you want to check out the tool, you can come up to the home page. I also want to thank all the funding organizations that have helped us get to this point. I want to thank all the contributors that have actually put in time to actually get this to evolve from the smallest part that we had. And now we're going to several datasets and covering most of the open source publicly available datasets. And I think that's a great way to go ahead. So thank you very much if you have any questions. Thank you so much. I just want to check really quickly if there's questions in the audience, from anyone in the audience. If not, can I ask how? I wasn't sure if I missed this earlier or not about data datasets that are maintained for connections and breaks. Is there a like a stable list or can anyone or does it apply to any dataset that someone's using? Can you ask again? Yeah, I'm sorry. I think I may have missed this part about the datasets that are tracked by our data retriever. Is there a like a stable list? Oh yeah, so what happens is that so because data always changes, right? We are running and we are going to provide that platform to so that people can actually look up. We have a dashboard which is called the retriever dashboard. What is this is doing? Basically it runs these datasets every day, like retriever install iris. And if we check the hash numbers of these datasets, is the hash number the same or does the hash number change? And changing the hash number will tell you that the dataset is faulty. So it will flag that as a faulty, as a dataset that has changed. Oh, it's either erroring or it's not. So we come in and look at maybe someone has changed the URL, maybe new dataset has been introduced into the data. And is it the data that we expect every day to accumulate? For example, we have a dataset we just installed and it's called New York Times COVID 19. And this dataset keeps on changing every day. It has the same URL. That's the good thing that but it keeps being added on database keeps being added on. We know that that is a change because of the data, but there's some datasets. For example, iris is the standard dataset that has been there forever, right? And if something fails, it's not that because they added new data, it's because maybe the location has changed, right? And those are things that we can track and be able to solve with that instance, at least before people get into problems with their tools. Thank you so much. Appreciate it. I think that's the time that we have for questions for this presentation, but I definitely wanted to invite people to continue the conversation in the Slack channels. And would you be willing to answer those questions in Slack? Yeah, I'm going to jump on to Slack. I'll be able to answer most of the questions that pop. All right. Thank you very much for the great presentation. Thank you very much, everyone, and thank you for attending this amazing conference.