 Thanks, Lily. Yeah, yeah, yeah, camera, that's right. Welcome CSVConf, I'm very excited to be here. We had some great conversations with the Lightning Talks yesterday. And I'd love to continue that here today. I'm going to be talking about frictionless data processing in the wild. And to better understand where we encounter wild data, I need to first explain a little bit about who we are. We are BICO-DEMO, which stands for the Biological and Chemical Oceanography Data Management Office. And we're part of the Woods Hole Oceanographic Institution, but we're primarily funded by the National Science Foundation. Scientists, primarily NSF scientists, submit data to us and we provide data access to the public. And here is where we first encounter the wild data. When it moves from its native habitat from the scientist to us. The keen eyeed among you may already notice some things about these data, such as merge cells and date ranges in a cell, multiple data tables that should be combined as one table, some special proprietary codes that aren't necessarily clear. So we work really closely with the scientists who collect and submit the data to us to get it into a format that is conducive to reusing the data, interoperable formats. So there's a lot of reasons that we receive the data the way we do. We ask that it all be fully processed and quality assessed and controlled. That doesn't always happen, or at least to the extent that we need. So again, that's a really close conversation and working with the scientists to get into good shape. Scientists are really focused on the collection and analysis of their data and publishing, but they don't always think about everything that needs to happen to it to make it reusable. So that's where we help them with those pain points. So we have interoperable formats and we present the metadata in a way that's easily understood on a dataset landing page along with the data access. And we also provide some other formats of the metadata like ISO, JSON, XML. So why do we process data? And it's to be fair. It's some principles that we really think are very important to make sure the data is findable, accessible, interoperable and reusable. I'm introducing this slide early because I'm gonna touch on all these principles during my presentation because they're really the driving factors behind why we do a lot of what we do. So when we're processing, what do we need to do to be fair? So a lot of it comes down to making sure the data have certain things in them, such as spatial temporal context. So when we get the data, we make sure that it can correctly be placed in space and time and has the context of where it fits in on a global scale. And we also have a lot of the things we do during data processing are to correct quality issues such as inconsistent formatting, like various date time formats within a column, corrupt data characters, data gaps, invalid species names, and sometimes there's just typos. And again, big emphasis on reformatting for usability and also making sure it's not in a proprietary format so anyone can have access and use the data. So back to the frictionless part. What is frictionless data? And that's an initiative out of Open Knowledge Foundation. Open Knowledge people raise your hands. There's a few of them in the crowd. So yeah, round of applause. So if you need any targeted specific questions, you should talk to them. But it's a collection of software and specifications that allow publication, transport, and consumption of data. And there's big emphasis on being platform agnostic for interoperability and also big emphasis on reducing the time to research. So time from data to insight. There's a lot of different components to frictionless data. I'm just gonna be talking about one specific component that has to do with data processing called data package pipelines. And why were we so attracted to data package pipelines? For a few reasons. One is that allowed us the ability to standardize data processing steps. When we do operations like joins, modify the values in a column, change columns, add, remove, any number of things. But it's a way to track those changes in a more structured way. And actually the provenance is generated automatically as we build these pipelines. On the right you see an example of the pipeline spec.yaml and that's the actual encoding of the processing pipeline. So there's different steps, processing steps that are within a processing workflow pipeline. And again, this is extremely attractive to us because it also allows very easy reproducibility. You can run the pipelines as often as you want. It includes provenance information to be able to fully reproduce and go from the input to output and you can build on pipelines and modify it as you need to. And one other key thing was day package pipelines allowed us the flexibility to add some custom processors to do specific tasks that we needed to do in our data management office. And we did that writing custom processors in Python. So we made a web application that is a user interface on top of the pipelines. So it's a React app that communicates with a server running a Python Flask app which actually executes the pipelines that are built and returns the data. So you'll see here on the left is an example of a workflow that's a pipeline with the individual processing steps and called out is just one of the steps and the configuration is shown for find and replace. And at the bottom you see this is how actually that step looks in the pipeline specification YAML file. So why did we feel the need to make this user interface and write our own custom processors and build upon data package pipelines? One of the reasons is we wanted to give our data managers a more immersive experience. So when you're constructing the pipelines it's not an isolation that you actually are able to visualize the pipeline as you're constructing it and see how the changes you're making affect the final output. And also to calculate statistics that's very important for us for looking at some different quality control metrics as we're making sure we don't mess anything up when we're processing the data. A key area of why we wanted to make the user interface was to reduce the time it took to process the data sets. So we wanted to avoid handwriting either the pipeline specs and running them on command line or doing custom scripts in Python using the pipelines. So because we've written custom processors we're able to integrate those easier into provenance capture in the steps and it reduces day set processing time because we don't have to hand write them every time. Also reducing time due to syntax errors just little mistakes that can happen to anyone but when you have a user interface you can avoid those and then reducing repetitive tasks. So another key area of why we are building this tool is because we wanted to improve our provenance capture by adding custom metadata and you'll see later on that we have a way of capturing both statistics and specific processing notes along with each step. We wanted to remove one barrier that was programming ability so we wanted to open up the data package pipeline creation to users that didn't necessarily have, weren't Python whizzes or had as much facility with the command line. And then the last key area of why we wanted to build on data package pipelines was as I mentioned to include custom processors. So you'll see in this pipeline here on the left there are many steps, some of them like find and replace are right out of the box data package pipelines processors but we also in our tool added a lot of the custom processors to do specific things such as convert date time formats and deal with different time zone issues that occur which is a lot of the time and in things like converting to decimal degrees and just things we just do commonly. So we have our custom processors that we have in our pocket for reuse. This is an example of building a pipeline. So here I'm starting with just a load step and I'm looking at a data table. This data set, this is a good example of the level of communication we have with the scientists and how that helps our final end product. So this data set came with date and time, time zone. That was a question. So we found out that it is a local time and what UTC offset it was. So we were able to at the end of this pipeline also add a date time in ISO 8601 in UTC time. So here I'm adding a second step called round fields. You see there's the difference. And again, the visualization just helps us ensure that we're doing the right thing and everything is working properly. So this is an example of how we configure that step and the time saving just benefits of user interface in general where you can select multiple fields and apply a rounding to multiple fields at once, those kind of things. And then I have a lot of different steps here that we had to do to get this data in shape. But at the end of the pipeline, so here we have our date, time stamp in UTC. We preserve the date and time local in this case because the scientists, that's how they like to analyze their data. And again, we want to reduce the time to research so they don't have to back calculate their local time. Again, if that's how they like to do it, but we do want to still be able to place this in global space and time for reusability. So we compromise by having both in this case. And you'll see there's also lat-lawn here now that was in the metadata for this submission. It wasn't actually in the data, but for reuse again, we added that as columns. This is another example of configuration of a pipeline. So you'll see this is to set data types. The user selects a resource, quote-unquote, which is a tabular data set. And the set types processor guesses the type and infers it. And then the user has the ability to either override that type or supply formats as needed. And again, on the left bottom, you'll see this is actually how it's encoded in the pipeline specification. And that's a huge time saver if you don't have to write that out and we just click some buttons and override the inferred types. So back to full pipeline at the end when it's run. And you can actually run it at any individual step to see the state at that step. But at the end, you get the ability to download the output, three different critical output files. So pipeline spec.yaml, which I've mentioned is the encoding of the pipeline. Data package.json is the full description of the data itself, including all the types we've specified, the source and all their provenance information. And finally, the CTD, which stands for conductivity, temperature and depth in this case, CSV file, which is the result of the whole pipeline. So there's the yaml, the description of the data, the data package.json, and the actual data file itself. And back going back to FAIR, which is our driver for all of this. Really, the pipelines is helping us with all the different components of FAIR, the findability, accessibility, interoperability and reproducibility. But specifically, I wanted to highlight the pipelines in helping us with reproducibility. Again, as I mentioned, you can rerun the pipelines and all the provenance is tracked with that. An interoperability highlighting the data package.json, which provides the description of the data. And a key component that's gonna help us with the findability and interoperability are our statistics calculations. So we have a package that we've integrated into our web application that allows us to calculate basic statistics and also some other specific things like list of unique values and things like that and actually add that into the data package.json. And we hope to be able to use that in data discoverability. So capturing date, time ranges and spatial bounding boxes and things like that. So this is on the left of the screenshot of our current data set landing page. And already, as I mentioned, we produce metadata in various formats that are consumable. But if we plan to integrate these frictionless tools and outputs to better support the fair principles, specifically the reproducibility and interoperability. So where are we going with all this? We plan to release an open source community version of our web application. The user interface we have now is a prototype and we're developing it. And we plan to also release our custom processors and statistics calculator. We want to be able to integrate the pipelines with data. So the users can access it from our data management landing pages and they can either run the pipelines on their own or build upon existing pipelines we've created. Another component of the frictionless universe that is very attractive to us that we plan to explore further is validation of the data and QAQC using a component of frictionless called good tables. And we have been working very closely with open knowledge people and providing feedback on how the tools are working for us and areas we think that it can be enhanced to better serve the needs of science. So we've had a good communication on GitHub and conference calls and it's good to see people in faces in person after talking with them digitally. But yeah, so we're actually have already started doing pull requests back to the package pipelines that have been merged. So that's a success for us. And yeah, it's all to better support good science. So that's our primary objective. And of course I'm gonna end with a message. The practicality of the scientists is to see it's not always the first thing they think of, but hopefully we can educate scientists through this process so the next time they go collect data they'll think about that next time. Thank you.