 Hi, my name is Sanya and I'm going to be talking about data ancestry, data lineage and then just some things and how we can reimagine tracking our data. My slides are going to be available there after the talk as well as on Sonodo and I'll be tweeting all the links. And for those of you that don't know me, I am now a developer advocate of Microsoft but I still consider myself a research software engineer at heart because that's my background and that's something that I deeply care about and in that same line I'm still an open source nerd and advocate and I'm always trying to get people to convert to the advocacy of open source as well. And I want to start with a bit of background with a bit of background on why I'm talking about these things and we know that for a few years there have been so many talks and so many discussions about river disability crisis. So I'm not going to give you another talk about why we are facing this problem but about what things have changed in the last few years and in the last decades. So now we have much more access to data, data sources and we've ever had before and this has dramatically changed the way we do research, we do science, we do pretty much any data intensive activity and we'd increase data knowledge and increase data access. We also have had to generate better collection tools, better approaches to have access to all of this. And something that also has changed or driven a change in research and scientific computing is the access to better and cheaper compute. We can access to many many resources either on prem or on the cloud or anywhere for a fraction of the cost that we were able to do five years ago or ten years ago. And pretty much any area of research now called it digital social sciences, humanity, linguistics, physics or engineering is now very very dependent on compute and scientific computing and open source tools and all of this data intensive things. So I think is this actual change in access to tools and actual changes in how we collect and manifest our research across data that has shed a light on not a reproducibility crisis but a chronic problem that we've been facing for a very very long time. It's not something that just appeared ten years ago or fifteen years ago but it's something that we've been carrying around because of the way and incentives that we've been put in academia and research institutions. But it's not all too bad having this awareness and this increased knowledge of this reproducibility crisis or chronic problem has actually led to some very good outcomes. And this has bring all of the communities together to actually fight and fix this problem. There is some disciplines like psychology that have had a much worse problem than other disciplines like engineering trying to fight reproducibility crisis and also build trust with our communities, their founders and researchers themselves. So actually being able to identify where we're lacking, better institutional support and better tools has made us a better community, a stronger community and identify researchers that actually want to see a change. And just to make another point it's not just a matter of raising awareness and bringing us together but this has also led to new institutions, new organizations and funders to try and change the way and the culture that research has been done. I just saw this from Lorena Barba yesterday and I found it so good that they just had to add it in these slides. So this is very very new and this is the result of all of us pushing for change, speaking out, raising our voices. And finally digital assets are starting to become kind of second class citizens, going from third class citizens to second class. Like it's not always that bad, it takes time because it has taken us decades to get to this point in research. So it's going to take a lot of time to revert it back and redirect it. Also in the last few years we've seen the emergence of movements and recognition of career paths that have been there in academia for so long. But the problem is that they've not been named as such, they've not been acknowledged. For a very long time we've had PhD students, postdocs, building research on software and building software that actually catapults research into battery things. The problem is that they never were acknowledged as so, they never had a name. They were renamed under 200 or 500 deaf brain acronyms or deaf brain posts so you never had a community brought together. And this is starting to change. And also we've seen a change in actual, in the power times. Yesterday we saw the demonstration of the eLife, of this eLife reproducible article and this is all community driven. This is again because all of the researchers and funders and open source communities are calling for a change. But this change doesn't come easy. For us to trust research we have to nail the basics. And we have to take care of our code and software, our computational environments and our data. When it comes to code we've done a very good job. We not only have research of our engineers but we've built a lot of community of practices, a very clear example is the carpentries. They've not only made scientific computing and a data analysis more approachable to researchers and scientists but they've also brought people together and actually formed support groups of like-minded individuals. Organizations like the Software Sustainability Institute are driving grassroots efforts to actually recognize software, scientific software and push the boundaries on what funding agencies are recognizing in these areas. And across most disciplines we have a better understanding of what best practices are and actually applying peer review processes. We have now things like the Journal of Open Source Software which is not only looking at the science and implementation of the software but actually making sure that the software meets certain standards that is correctly licensed, that is supported and maintained. When it comes to infrastructure we're doing OK. I'm going to say we have enough institutional buy-in because more and more institutions are putting big grants to set up HPC or in-prem infrastructure that will support all of this scientific compute. And it also means that we have PI level support because they have to code that in the grants so that they can have access to these computer resources. On the open source space we have amazing tools like Binder and Jupiter that makes everything more accessible and make it easier for all of the researchers to share their stories. To become not only passive storytellers but more active storytellers in the areas. There is just one problem. Again, the way that we're learning how to program, the way that younger generations are learning to program is changing. So we're going to need to adapt to change. Many, many big HPC clusters are coming to the end of their lifetime so they're going to need renewal, they're going to need additional funding, they're going to need additional grants. And we need to rethink what the new paradigm and the new environment of scientific computing looks like. Is it going to be the same now that it's going to be in 10 years? It doesn't look very likely. So we need to start these conversations. And when it comes to data some of the things that are mostly overlooked are because we just don't have that education. We have some really good data management and archiving plans. The librarians and institutions are probably the best people and the most knowledgeable when it comes to data management, archiving and how to tag your data properly. More and more fending agencies are requesting that your data sets are open as and when possible. But this is not enough. Although we have to attach metadata when we archive our data there are a lot of standards but there is no clear minimum requirement. And then if you don't have data, if you have data without metadata it's just as bad as not having data at all. Because your metadata should give you enough overview of your data and answer the questions of who collected the data, who's maintaining it, who's looking after it, who can be contacted if there are any problems with it. What does this data represent? When was this data collected? If it has a period of time that it represents what is this period? Why is it collected? Why is it archived? And why it's in the format that it is? You always have to provide this information so that it's actually usable. If you can't have all of this information then basically you don't have usable data. And also data has family. And I like this picture because it's a beautiful picture of America's family. Because everyone and everything has a history, even your data. We sometimes talk about how dirty your data is but we never want to talk about the history of our data. And here's where the concept of data lineage comes in. It references to the data origin, what is happening across all the pipelines and where it is moving over time. By having data lineage we are not only giving higher visibility to the data and the processes that we're applying to it, but it's also giving us the ability to track back errors and solve them in a more organic way. So if you have for example a database and then you have something like this that are three smaller packages of your data that is not data lineage, you can trace back from many of these data packages what was the original source where it came from and what transformations you did. So we need to have a better overview of the data path and this includes every single process and every single transformation of your data has its own very basic metadata. So if you think about water and water processing plants, every process that is applied to the water to go from very, very dirty to filter it usable, consumable data is similar to having this overview. Every process is explicit that you can just look at the water coming out from the first filtering, second filtering and collection that you know exactly where it has been before. This is where we should aim our data to always be. From a very, very raw and rough perspective, we should know when was collected, what is in there, what we're expecting from it. And as we move into our research pipelines to processes, transformations, cleaning and then we obtain our fabulous research and dissemination article, we should always be able to track where it has been. And this means establishing a contract with the data across every single step. At every point we should always know what we're expecting of the data. And this means that we should know what kind of data structure it contains, what sort of information, and then being able to identify those data structures that don't conform to what I'm expecting. And I'm sure that now people are wondering so what is data lineage and how it's related to data provenance. Truth is, they overlap each other. One is the basis for each other. For us to be able to have provenance in our research pipelines and have a fully reproducible scientific output, we need data lineage. It basically provenance sits on top of it. Once we have all of our data identified, the metadata and its processes, we can then add the inputs and outputs and that will give us a truly reproducible pipeline. And I'm sure everybody has seen this picture and this is one of the greatest examples of aggregating multiple, multiple datasets from very different origins, from different sources to finally get a final output. This can only be done if you have enough recollection of what your data has been, how you're aggregating it and putting it together. And at the moment I was expecting to have to pre-print out what I'm working in a real life use case. I'm working on a joint project with Imperial College London, Microsoft and the NHS. So for those that are not in the UK, NHS is about the equivalent of the NIH since our public health service. And the aim of this is building a data lake that has multi-trust data sources that is fully indexable, that actually has first line approach in provenance data lineage. And also the most important thing is having a full integration of a schema registry alongside the data. And by having a fully indexable dataset and integration of a schema registry, we can then understand how we are actually establishing contracts with our data. So the pre-print is soon to come and I'm surely going to put it on Twitter. And why did we even bother putting this together? Why did we bother putting this data lake and indexing it all and cleaning the data and looking for provenance? Because at the end over 20 projects are going to be using these datasets and all of these data lakes contain data for multiple processes. From early sepsis diagnosis to better utilization of hospital resources and that's not an easy task. Most of the people that are going to be using the data are not research software engineers, they're not even data scientists, they're not even people that are highly literate. But there are people that work in hospitals, there are nurses, there are decision makers in the NHS that are going to drive how we're going to allocate resources, A&E, personal. So it is worth it. It is worth taking all of this trouble to track where our data is coming from so that the end user actually can rely on all of this data. So to finish off, contracts are good. Having a contract with our data is good. It's the only way in which we can actually trust it and be certain that actually our research is truly reproducible or as reproducible as it gets. So as much as I don't like contracts and I'm sure you don't either, this is not so bad. Thank you.