 Okay. Hi everyone. My name is Monica Granados. I am one of the inaugural fellows of the reproducible research fellowship from the frictionless data program as that is part of the Open Knowledge Foundation. I'm also a policy analyst for the government of Canada as well as serving on the leadership team of Pre-Review. Lily, would you like to introduce yourself? Hi everyone. I'm also a fellow with Monica and I'm a marine science PhD student at UC Santa Barbara in California and I study how coral reefs and coastal communities are affected by global environmental change and I see research reproducibility and open data science as a tool for helping us to speed up the ability of scientists to improve resilience in these systems and understand issues faster. Great. Thanks Lily. So the two of us are frictionless fellows and we wanted to tell you a little bit about what our experience was like as part of the inaugural cohort and tell you a little bit about two really important tools that we learned about during this fellowship and how we've applied it in our own work. So when we do science, like Lily and I, we do a lot of data collection. So I'm a trained food web ecologist. I go out and collect data on crayfish. Lily has collected some data on the octopus trade and that data gets put into some form of data entry. So at least in the ecological field a lot of us just use spreadsheets like excel that output CSV files. When we start to talk about our data, whether it's in a publication or maybe we put out a tweet about it, we might get a lot of interest about the data. You know, people will say, well, I'd like to use your data or part of your data would make a great addition to a meta analysis that we're doing. Okay, so we get people who are interested in our data. How can we share that data? So there's been many ways that we can share the data and ways that we've shared data in sort of the history of science and scientific publishing. We could, you know, you could send it by carrier pigeon or by snail mail or by the pony express or by computer email. But regardless of the way that you send that information, oftentimes you end up staring at the computer screaming, I hate other people's data. And that's because a lot of the data that you have if there's sort of like innate parts that you understand yourself, but that might may not necessarily mean that others are going to understand that data. So when they open that CSV file or that spreadsheet, there's going to be columns that you don't understand or blanks that you don't understand, or sometimes the NAs or Na slash A or, you know, hashtag 99 has been one of my favorites. And it leads to, you know, a lot of confusion and frustration. So what if we could make it easier to share data? What if we can make it easier to give the information that we've collected, and instead of just sending an SESV files, that we put some context into this data? So that's what we learned through the reusable research program. We did a series of journal clubs and seminars and blogs, and we learned about two important tools that are part of the frictionless data program, data packages, and data validation. So we're going to give you a little sneak peek into data packages and data validation. And then at the end of the presentation, we'll give you a link for you to learn more about these tools and how you can implement them in your own research workflow. So starting with data packages. Okay, so what are data packages? I felt that as someone who is, you know, pretty committed to the open science movement, I was doing a really good job of making my data available. So my data for a particular manuscript that I published last year is all available, the codes available, as well as the CSV files where the raw data is in. And so you could go to GitHub and grab that information. And there's even like a read me file that gives you a little bit of information about the data. But the truth of the matter is if you don't really have any information about the column headings, you're not going to know what anything means. You don't know about units, you don't know how I collected that data. And so through the program, we learned about data packages. And it's I'm going to I'm going to tell you a little bit about the web tool. So if you go to the data package creator, the URL is at the bottom of the screen there on the bottom right, what you can do is upload your raw data. So you can upload either the CSV file, or you can upload the resource path. So I can actually take the data that was already in my GitHub, upload it. And then what's really neat about this tool is that it lets me give context to all of my columns, basically. So all of that information that I have. So I've got some column headings that are day, trophic, species, mesocosm, that may not mean a lot to you if you just open that CSV file. But through the data package creator, I can give you some context, and then make it easier for you to take that data and then apply it in ways that you may may find useful. So here, it's just a screen cap of once you load the path of where your resource is, it'll find your columns. And then you can give information in the title descriptor, and then give information about the data type. So all of this is using information to generate a schema, a table schema, basically some information about how the data is structured and information about the data itself. So I can add information about, so what did you mean by experimental day, or by day? Oh, it's the experimental day. And then I can give information about how long that experiment ran, for example. And then I can tell you information about the data, the datum itself. So what this does is once you've inputted all of that information, you can then download the data package as a JSON file and send the JSON file to your collaborators or to any interested party instead of just a CSV file with no context. You can then also receive other JSON packages and then upload them here on the data package creator. And you can see all the information that your collaborator has provided about the resource or about the data. Now there are other ways that you can use the data, like to use data packages in a more reproducible workflow. So you can use Python in our libraries that have been built around this as well. But I just wanted to give you a little taste of the power of sharing your data as a data package. I'm now going to turn it over to Lily. She's going to tell us a little bit about GoodTables. Hi everyone. So GoodTables is the second reproducible data science tool offered by the Frictionless Data Program. The tool was developed specifically to help with data validation and it's available both as a web tool and then through the command line. So let's walk through the web tool version and try to validate Monica's crayfish algae and snail data frame. So first we have to navigate to the Try GoodTables web browser. So let's say you are using just the raw data and you can do this step before you create your data package or if you don't have a schema which is made with the data package this is still a great tool just to check for structural errors of the data frame itself. The way you do this is you upload your data and check to see if there's structural errors such as missing entries. You can either upload your file from your local directory or you can insert the URL where your data is stored. For example up there we can add Monica's raw version of her GitHub data and then you hit that button that's in gray right now that says validate and then if there are no structural errors in your data frame you'll get a pop-up that says valid table but let's say that oh but here in this example we see that there wasn't there's actually one error structural error that we found content error that we found which is that the density column variable we had it marked in the schema file as a numeric or as the integer variable when actually it's a numeric variable. So I'm going to go back to that schema slide part of the data package creator JSON file that you make with that includes the schema so you have to make sure that you pull out that part of the JSON file and so a schema makes it possible to run a more precise validation check on your data so you're not just looking at the structural level but also at the content level. So then you just copy this part of the JSON file which is the schema and you have to make sure that you include the curly brackets. That's the problem I always accidentally run into. So then you go back to the good tables web browser and you insert the schema and then you hit validate and that's where you will be able to validate the content as we showed before and so in order to fix that error that we saw on the density variable column you can either make a change in the data package tool or you can change the JSON file directly and then after you update the JSON file and you reupload it on the good tables web tool and hit validate you should get a notice saying that everything is valid. Yeah and then so I guess we just wanted to end by saying that this is a short introduction for these different tools and if you want to learn more about them we're hosting a 90-minute hands-on workshop on May 20th and we'd love to see you all there and then the Open Knowledge Foundation is also accepting applications for their next cohort of frictionless data reproducible research fellows. So if you're an early career researcher and interested in applying we would highly recommend it and would love to talk with you about that more. Great. Oh go ahead sorry didn't want to catch up. No I'm just gonna say no thanks so much thanks so much Lily and yeah you can look at those the top URL is the where you could actually follow the syllabus of the data package and data in the schema you can learn a little bit more about frictionless data yeah and at the bottom you can you can learn a little bit more about what the fellowship is like and how to apply for the next round. Great yeah I just there was a lot of comments and questions in the chat about how this all fits together and it works and I know there'll be a lot of questions over in Slack for sure. I just Vicki actually just put a question in or a comment into the chat that I was I had in my mind as well which is it'll be really great if we can mandate this type of analysis or or review of tabular data before it goes into repositories. A lot of data repositories are full of tabular data that are missing columns or poorly documented and I didn't know if there were any discussions that you've had with data repositories around integrations on ingest or QA for in that sense. Yeah I we haven't asked the fellows but I can actually tell you a little bit about some of the work that I've done through my my like day job that is you know we deal with a lot of data environment and climate change and we actually did a hackathon to see if we could have someone build in like a checks that has like data got fed into a repository they would check to see at least for some really basic things like you know like empty cells or you know like characters and so you know even in like a weekend hackathon they were able to the students who were working on that were able to come up with something so it is certainly something that big organizations care about and would like to see happen and so that at least and that the researcher isn't necessarily the one that's doing it but that at least there's some kind of process to ensure that that actually does those checks do do happen. Yeah definitely yeah I agree. So another question was what advice do you have for scientists that are trying to integrate these tools into their workflows for the first time? So some advice that you'd have for the science scientist community? Yeah I think definitely there's a lot of video tutorials on that link that Monica shared and then coming to the workshop where we're actually going to work through it all together and I think both of those would be really great. Yeah okay great well thank you both we have a lot of the questions here we will move over to Slack people can interact with you there and we appreciate you walking us through it. I had seen a presentation about frictionless data when it first was getting launched and it's very slick you've come so far seems like it's really you know hitting a lot of the original promise so congratulations. Yeah it's been a pleasure it's been really fun. Yeah great okay. Thanks for joining everyone thank you.