 All right, I'm gonna go ahead and start. My name is Lily Winfrey. Can people hear me? Okay, I'm gonna try and be a little loud. And I am going to be talking to you about frictionless data for reproducible research. I'm a product manager at the Open Knowledge Foundation and my background is in neuroscience research. So I'm another neuroscientist today. But now I do open data with a focus on open science. And this bitly link here is the link to my slides. There's a lot of content that has links in here and some content that I threw up that you can look at later after the talk's done to practice and play on your own. But to start, I want to ask you how many of you have heard about the reproducibility crisis in research? All right, great, cool. We're gonna be talking about that today. And so for those of you that don't know, it's the idea that there are some experiments in science that aren't reproducible. And I'm gonna show one recent example of this. It's Dr. Kate Luskowski recently had to had to get her paper retracted because of a data issue that they found years after it was published. And basically they couldn't understand some of the data and they couldn't understand how it was created. And this is just a horrible feeling for any scientist to have to retract her paper. So why does this happen? Why are experiments not reproducible? There's a lot of debate about this right now, but there are certain things that we know. And for instance, methods for doing experiments are often not published or they're not published openly or completely. And the same for data, especially raw data is often not published. And so it can be really difficult to understand what happened from raw data to the analyzed and published results. So today we're gonna be talking a lot about these data management issues in research and the Frictionalist Data Project is focused on helping fix some of these data management issues. First I'm gonna tell you about the Open Knowledge Foundation where I work and then I'll get into the technical background of Frictionalist Data and then get into a use case where we've been working with researchers. So the Open Knowledge Foundation is a nonprofit. We've been around for 15 years and we're focused on creating a fair, free and open future. This is a future where everyone has access to data and where people know how to use that data to drive social positive change. And the project that I work on at OKF is the Frictionalist Data for Reproducible Research Project. This is where we are removing friction and research data to move from data to insight faster. This is an open source project and we're very community focused and by that I mean that we really depend on our community to make this project successful. So right here I have pictured my colleagues on this specific project but it's more than just four of us that do this work. We really rely on our community to give us feedback and use our tools and after this talk I hope that I've convinced many of you to join our community. Okay, so this project is overseen by the Open Knowledge Foundation and is funded by the Alfred P. Sloan Foundation. We have three main ways that we collaborate. We have the Fellows Program where we're working with early career researchers to teach them about open science and data management and using the Frictionalist tools. We have the Tool Fund where we're working with developers to develop new tooling for reproducible research based off of frictionless data and we're about to open up another round of this. So stay tuned, we give funding to people. And then we also have the Pilots which are collaborations, they're very intensive one-on-one collaborations between us and our developers and researcher teams to help solve data workflow issues with those researchers. And we're also actively looking for new pilot collaborations. So if you're interested, please come talk to me. So I've said frictionless data a lot but what does that mean? Basically we're trying to remove frictions and working with data. You can think of these as like the data cleaning steps and questions such as what's the license for this data? What does this data value mean? Can I even use this data? Can I use data that was created in Excel and then run it in my Python code? Who created the data? And things like checking the quality of the data. So oftentimes these are thought of as like the boring parts that you have to do to data before you can analyze your data and get a result. But these are very important. Anyone that's worked with data knows how important cleaning data is. Frictionless data is a set of specifications for data and metadata interoperability and a collection of open source software libraries. It's also a range of best practices for data management. And importantly, it's platform agnostic meaning that it's very interoperable and purposefully generalizable. So the main question I wanna talk about today is how can researchers and other data wranglers use frictionless data? And to get into this question I'm going to talk about one of our pilot use cases that's ongoing right now. It's with the BecoDemo group which stands for biological and chemical oceanographic data management office. And this group is funded by the NSF which is a major science funder in the US. And basically anyone that does oceanographic research in the US that's funded by the NSF submits their data to BecoDemo. And then BecoDemo has data managers that go through and they clean all of that data and then they host it as well. And so when it's clean and hosted other people can access that data including the public and other researchers that can then build upon that research. And I wanna mention the team that we're working with at BecoDemo is Amber York Conrad Schlower, Adam Shepard and Danny Kincaid. And I also shamelessly stole a bunch of these slides from Amber York. She gave a talk at CSVConf and that's what this Zenodo link here is is the link to the rest of her slides. Okay, I love talking about BecoDemo because their data is messy. As you can imagine they have data about everything in the ocean. They have data on coral reefs. They have data on ocean salinity. They have data on jellyfish. I think it's very interesting data. But it's also messy. And if you can see where I'm pointing here there's this column that's showing dates and it's a date range. So there's two different data points in one cell and the dates are written how Americans write dates which is how no one else writes a date. So it's confusing. And so the data managers get this messy data and then they really have to wrangle it to make it more clean so other people can use it. So we're working with the data managers using this program called Data Package Pipelines that I'll tell you about in a minute and trying to help the data managers in their various tasks. So for example, the data managers need to do things like they add spatial temporal context in standardized formats. Things like date and time or even time zones because this data is from around the world. They record things like latitude and longitude and make it standard. Even the depth at which a measurement is taken under the ocean. And then they also need to do things like correct quality issues, fix inconsistent formatting, corrupt data characters, data gaps, like is this value that says NA, is that actually nothing or does it connote something to that researcher? They also have to fix invalid species names and things like typos. And I'm sure many of you that have dealt with messy data recognize many of these steps. And basically what they're doing is reformatting the data for reusability by others. So this collaboration that we have with BecoDemo is where we're going in and giving developer time to try and take their messy data, turn it into clean data and then host it for others to use. And we're trying to make this entire process reproducible so that other people can understand what we did to this data or what BecoDemo did to the data. Okay, so this is a great, one of my favorite slides is that the researchers are out there, they're working hard, they're collecting this data. And then we come over and we're like, oh, hey, did you record the metadata? Like, did you remember to do that? And the answer is usually no. So I wanna talk a little bit more about metadata today and tell you also how the Frictional List Data Tools are useful. So first of all, they can be used to keep track of your metadata. And we were just talking about metadata a little bit in the last talk, but for those of you that don't know, it's data about your data. It's things like what's the license and what are column names? So using the Frictional List Data Tooling, such as this browser tool here, and again, I posted these slides so you all can click on these links later and play around with them. You, in this browser tool, you can take raw data and insert it, and the tool will automatically create metadata for you that you can go in and edit. And the metadata is in JSON, so it's machine readable and it's interoperable. And why is this important for scientists? Well, in all data wranglers, really, it's important to keep track of your metadata so that you know what is in your data. Future you knows that, and anyone else that wants to use your data can know that as well. Like, for instance, I know what SEM means because I was a scientist and I did statistics, but there's a good chance that someone else might not know what that means. Okay, another thing that Frictional List Data can do is help you package your data. And this is where you take raw data and your metadata and package it together. And we like to think of this as a shipping container analogy, where the container contains your raw data and your metadata. Optionally, you can also include a schema about your data, and this describes kind of like the big picture about your data, it can include things like what type of data should be in a column and how many rows and columns your data set has. All right, so we have two different tools to work with data packages. We have many software libraries and they're all open source again. And then again, we have this browser tool where you can actually create a data package. And why is it important to package your data? Well, package data is useful data. When you have package data, I'm gonna use a Lego analogy to talk about this. I'm assuming many of you have played with Legos before. And one of the best things about Legos, in my opinion, is that you can take different blocks from different sets and they automatically work together. And it's the same idea with the data package. It's in this nice standard package format. And then you can use different tools and just plug and play. It's very interoperable. Also, package data can be easily published. And you can publish this data on data repositories such as Zenodo. All right, another thing you can do with frictionless data is create a schema to describe your data and then validate your data based on that schema. And why is this important? This is my favorite horror story about research data being invalid is that Excel will actually take certain gene names and convert them to dates without telling you. It does it silently. So there are genes like DEC7 that Excel will convert to December 7th. And then that data value is no longer useful for you. And there are several papers that had to be pulled because the analysis was incorrect because this happened and the researchers didn't realize it. So one way to know that this has happened to your data is to create a schema. And here are the frictionless data tools that will help you with this. A schema would tell you column A is supposed to be strings and so if you validate based on that and it detects, oh, there's a date format instead of a string there, it will give you an error. And so that's what I'm showing you here over here. This is the GoodTables client and you can do GoodTables and it will validate your data. And here it's showing you this is a valid data set. There are zero errors, but if it was invalid it would tell you exactly where those errors are and what the error is. So we have try.goodTables as a browser tool to look at this and we have GoodTable Software Libraries and Table Schema Software Libraries that will help write schemas. All right, the final piece of frictionless data software I'm going to tell you about really quickly is the data package pipelines and this is what we're using with this pilot collaboration. Data package pipelines is a data processing pipeline software, again, open source. It's a Python framework for a declarative processing of tabular data. And so it has standardized data processing steps already built into it, things like joins, find and replace but in addition to that you can write custom processors in Python for things that, you know, your specifics data needs to happen a lot. These pipelines are defined in a pipeline spec YAML file and this includes the specific processors that were done on your data and any execution parameters and having this information written down really helps with reproducibility. DPP produces a single data package as its output. Okay, finally, we have all of these software on our website. This is just a screenshot and I encourage you to go look at it. We have Python code is our main software library that we write in but we also have JavaScript, Ruby, R, et cetera, we have a lot of languages. All right, so now I'm gonna go back to our use case and show you how we're using data package pipelines. I like to think of frictionless data is coming in and trying to help make this research data really useful and live up to its full potential. All right, so how does data package pipelines help the Beco demo users? First of all, it gives data managers a more immersive experience. Part of this pilot collaboration has included building a new UI for the data managers. This has reduced data set processing time. It's removed the barrier of programmatic ability for these data managers and it's avoided having to hand write things like a pipeline spec file or Python scripts which reduces errors and is faster. You can also add custom metadata to the pipeline. The Beco demo users have really rich metadata and so when we're working with the Beco demo data managers, they wanted to make sure that we were able to capture all of this metadata and keep track of it. And importantly, you can also add capabilities that we're not already in the base data package pipelines by adding custom processors. Now I'm gonna show you some of this custom processors that were added and also an example of what the Beco demo pipeline looks like. So each one of these arrows is a different processor step. There are things like load around the field and then we're gonna look at an example of a find and replace. Here you can see in the notes, this is fixing inconsistent time format. Some didn't have seconds. And to do this, we've highlighted the field, which is time and then entered in the find pattern and then the replace pattern. And this particular piece of the processing pipeline is now shown in this pipeline spec YAML file. And what you can see is that it's human readable so that the next person that uses this pipeline knows exactly what happened to the data and knows how to reproduce what happened to the data. Here's another example where we are going to show changing the date format using data package pipelines. So here we have this date column. Again, I've written how Americans write dates. Not super useful. And so the output from running this processor step is two columns, one where we have the date in a nice ISO standard format. And then another column where we have the date still in the same way that the researcher originally put it in, in case that connoted some important thing for the researcher. But it's in a more standardized format. And the output of this data package pipeline process is this pipeline spec YAML file, the raw data, and then the metadata. And this is all captured together so you can repeat the experiment, you can repeat the pipeline, and then you could say, host this data and metadata on the BecoDemo site. So to sum up this collaboration so far, we were aiming to take BecoDemo's messy data and then run it through the pipeline and get out the pipeline spec YAML file, the data package, metadata, and the raw data that can then be used by other researchers or other data managers further down the road. Okay, we just ended phase one of this collaboration or going into phase two. So our next steps here are the release of the open source community version of this pipeline. It's not quite done, so it's not available yet. And then also this will allow the public to rerun these pipelines or build upon them. And then also we're adding in validation with the GoodTables library. So we're going to be able to check that the data remains valid throughout this process. And I can't say enough good things about the work that BecoDemo is doing. It's super interesting research. So here are links where you can find out more about them and I encourage you to check it out if you're interested in oceanography in any way. All right, now I'm getting into the slides that I'm not gonna talk about, but are up here just so that you can play with them later. And if you're interested, here's a good place to start looking into frictionless data. It's our field guide. And then we have these links up here. I have links to play with our browser tools and some toy data, or you can run your own data. And GoodTables, this is one that will validate. We also have continuous validation. So this is GoodTables integrated into GitHub. So every time you push your data, it will validate your data automatically. And this is an example where it's showing you all of these errors that were found the last time someone pushed data. All right, so with that, I'm going to end. But again, here's the link to the slides. And I want to end by asking you to join our community. And if you use these tools, please let me know. We're always looking for feedback and for new use cases. Here's our GitHub repository, a link to our Discuss Forum where you can ask questions about open science, but also open anything. And then our Gitter chat where if you have a question, a technical question posted in Gitter and we will answer or our community will answer, have a YouTube channel with a lot more tutorials, and then our frictionless data field guide. And with that, I will take questions. I don't, did you say you work in microscopy? Okay, the question was, he works in microscopy and often the file software is proprietary and so it's very difficult to get the metadata. Is that correct? He was asking if I have experience trying to force companies to give up metadata. I do not, that's a great question. Yeah, sorry, sorry. Yeah, so the question was, do I care about the metadata of the metadata? For example, ontologies. We are purposefully general, so that's a tiny bit too specific. Like, I think the last few years saying they don't care how you document your metadata. We also kind of feel that way. I personally care about ontologies, but we do not have any like standards that we really hold on to. The question is like, one way to answer this, in my opinion, because I've used a little bit of this technology is that the distinctions between metadata and data is a blurb one, where for instance, what you call the metadata, on-day metadata, ontology can be added into your data basically as a resource, but you can document in your metadata package in a way. So my question then, following on this idea is in the pipeline context, you showed an example of cleaning data. So I have two questions about that. First, how do you, in your experience, put the correction into the pipeline, or correcting the raw data? When do you choose to do one on the other? And the second one is, do you experience using pipeline not to correct, but to aggregate, create new secondary data from the raw data? Yeah, great questions. Okay, I'm gonna try and repeat. First of all, a good comment that you can document things like if you're using a specific ontology in the metadata. And then first question was, sorry, now I'm glad. When you decide to put the metadata into the data itself sometimes. Oh, yeah, absolutely. When do you decide to put metadata into the raw data? I think that's part of the data management and teaching researchers best practices. And I think it's also personal. It depends on the lab. It depends on the experiment. And I also don't think it matters a whole lot. I think it's just important that it's documented at all. And then the second question was, have we used the pipeline to integrate data? Was that it? I have not used it and I don't know of a specific use case but it's possible and it could be that someone's done it and I just don't know about it. You mentioned going from a tabular spreadsheet to a machine-arrayable JSON file. Is that based on a standard? It is and we have standards on, I'm gonna redirect you to our website to look at all of our standards. We call them specifications because they're supposed to be more flexible than a real standard. But yeah, I'm gonna look over here and see. I haven't looked. Okay, yes? Yes, there is a talk about the refinement thing. So this seems to me like an approach between both projects so what is the relationships between both projects? That's a great question. We were, I think OpenRefine had integrated data packages but then one of our software libraries they were using had a license that is not recognized by the OSI because it has the statement that it must be used for good. So then OpenRefine had to drop it. So we're working right now to try and rewrite that library with an OSI compliant license so that hopefully we can get that functionality back. Yes? What they find, they don't know enough or so they are not comfortable in using even very basic things. Like, I don't know, there are different environments in Python. Yeah. So when they try to say, okay, I put my code in a new machine and they have to reinstall all the dependency is a help because they don't remember what they needed and so they just run the code until they find some error and now I need this and so on. So I'm finding myself that I explain these very basic things. Yeah. And I wanted to know if somebody has already done it and if I can find some slides to reuse. Yeah. The fellows are below that level programmatically. Usually they know like some are some Python but we aren't teaching them about, oh, sorry, I didn't repeat the question. We aren't teaching them about specific programmatic environments. That was kind of the question. Yes, but resources like that exist. I wonder if Emmy knows the answer to that. She's talking later, you can ask her. Yeah. We have to stop, sorry, at the time. Thank you. Thank you. Thank you.