 My name is Rokroshka. Thanks very much for having me. I run the development of the Renko platform at the Swiss Data Science Center and I'm based at Ithaha Zurich. We have a second office in Lausanne where half of our team works. So I'll give you an introduction to Renko and what we're trying to do with this platform. A lot of what I'll talk about is not actually what you'll be using in the course but given that you're all scientists I'm hoping that you'll find it interesting anyway. So I'm hoping to recruit some of you to be a long-term Renko user. So let's see how that goes. So just to motivate where we're coming from here. So some time ago I received an email like this. I used to do research in astrophysics and I got an email about a paper that I wrote about 10 years ago from a grad student that was trying to reproduce a plot from one of my papers. And she was basically looking at the code that I'd made publicly available on BitBucket and which I thought was enough to make my work reproducible by others. So she was looking at my code trying to decipher what was happening there and trying to repeat one of my plots and of course couldn't get anywhere because nothing was very well documented. I didn't have any sample data and so on. So I was very embarrassed because as a lead of a team just trying to develop a platform for reproducibility this was kind of an embarrassing email to have to receive and not have a proper response for. So Renko was primarily about enabling what we're referring to as practical reproducibility and I'll tell you in a second what I mean by practical. But basically I think you've probably all heard about this term the reproducibility crisis in science and it comes from various sources but the upside is that because people are sort of acutely aware of this issue now there is recently what has been termed a credibility revolution meaning that there's a lot of people working on making reproducibility easier making it more accessible making people more aware of the problems. And so there's a lot of sort of high level reasons why you may care about reproducibility but the main reason why you should care is that somebody's probably eventually going to ask you how to reproduce something from a paper you've written and this can turn out to be not so nice even if you have the best intention. So here's some snippets from the from this code that this student was referring to. It's all sort of nicely organized but you can see that there's very little very little comment very little description of what anything here means and I won't get into the details here but basically we can have a look at what's missing so the code is versioned and well structured which is great it's but it's not telling you what kind of data you need to run this code so somebody trying to reproduce or reuse this code has no idea what sort of data they should feed through it. There's no information about the runtime environment so I have a bunch of libraries in here but this is from 10 years ago so you might imagine that these versions are all very different and there's no indication what those versions were so presumably if you try to run this code now there might be some hiccups and most importantly there's no indication of how code and data should be combined to produce the results so there's no there's no sort of workflow description so takeaway from this I just wanted to have is that data and code or sharing data and sharing code is not enough you have to have a very clear description very clear indication of how those two link together to actually reproduce results and what we're trying to get away from in the end is to not have our our disciplines look like this right that we're basically just stirring pots of black boxes and coming up with some some results that we don't really understand we want to make sure that it's clear that there's a head and a tail to what we're doing and that especially in domains of machine learning etc that there's some hope for for understanding what's happening under the hood so with Renko we're motivated by these five sort of simple questions in in data-driven research so first of all we want to be our best future collaborators so we want to ask the question how I computed the results right so we wouldn't be able to answer the question how I computed the results because a few months from now I may not remember exactly what parameters I used what functions I ran and so on so just for for for myself for yourself how how something fits right then if that's in the control you want to understand how new data might change this result and you want to be able to apply this new data to see to see how the downstream changes are reflected and you may probably work with other people so you want to know how they computed their results and you want to know if you can use their data to reproduce it if you if you can even use their code or in many cases you actually want to also run it on their infrastructure because either the data is big or the computational complexity is high so you need to to use some special infrastructure um it's got cut off um it's the beauty of not having the presentation mode I can just edit the slides on the fly you if you're working in a more theoretical capacity you may want to know you may want you may be interested in algorithms they may want to know if a particular algorithm has been applied to a particular set of data and how and you want to know whether this worked or or what people have to do to make it work and finally I think for all academics it's extremely important to have attribution in the end for the work that you're doing and so you want to know if if you're working in a capacity where you're following sort of open science principles and putting your data out there publishing it on Danoto and putting your algorithms out there making available for people um you in the end want to know where these research artifacts of yours are going you want to know if people what context people are using them in and ideally you want to be able to um point this out to whoever is interested presumably a funding agency um and say look I've I've shared my data and it's resulted in in 50 publications uh by people who I've never actually talked to but they they seem to be finding it interesting and this is uh clearly a valuable product of my research um and typically actually in in in the current mode of of sharing on on various platforms um this is the probably almost the hardest one because there's this one way uh link between the producer of something and the consumer of something of a code or data so if I publish data on Danoto I publish code on github um I never know who ends up using this data or code the people that are using it know where it comes from obviously because I have a link to my repository but um but they but I never know where it goes until they tell me so these five questions boil down to just a few words for us and this is sort of uh the driving principles behind what we're trying to develop with Renko and that's the focus on the ideas of reproducibility on the reason reusability of the of research artifact and on um collaboration so so what is Renko really it kind of wears it wears many hats it's uh it's at the same time uh a complex piece of machinery and and but we're trying to make it as easy as possible so we're trying to integrate existing technologies to make to make answers to these questions uh more more approachable so we are first of all everything is based on gith so you'll and you'll see this very quickly that every project in Renko is actually a gith repository so everything is versions um your code your data and uh if you end up using workflows which you won't be doing in this course but those are but if you end up doing that in the future those are also versions we use um existing tooling for providing interactive sessions so everything is based on uh Jupiter and we provide optionally our studio in in our docker images which I think you'll be using in this course um everything is based on containers so that uh you can more or less easily hand your runtime environment to someone and they can go from there and actually um have everything they need already included uh in in there to to to either repeat what you did or build on what you did and we also build on this idea of analysis workflows which uh people working in in bioinformatics are very very familiar with um but rather than um requiring people to write workflows we actually try to create workflows for them on the fly as they work you know speak a little bit about that in a minute and the defining feature of Renko is that all of this information uh the we try to capture a lot of metadata about what people are doing and and record it all in a knowledge graph which to first order allows you to replay your analysis to see how you arrived at a particular result but um in the end what the intention is is that this should serve as the primary way of of searching and discovering data and workflows and people in the end as well um on the on the platform so I'll talk a little bit about the concept of knowledge graphs in a second um but you should but before we go to that I just wanted to point out that one of the confusing things of Renko sometimes is that there's basically two modes in which people work with Renko one is Renko lab which is this online in browser uh web application where you have project management you have um you can uh share your your code and notebooks and so on you can collaborate with people uh you have interactive sessions and you can also actually recently we pushed a change that allows you to that allows anybody to to actually spawn interactive sessions from public projects so if you have a project that opens the public on Renko lab it works a little bit like my binder if you if you're familiar with that where you can just hand somebody a link and they can come in there click a button and they they can spin up your your notebook sessions of course they can't they can modify your project at all but they can run what you have there um and the second aspect of Renko is this command line interface and this is what what uh drives the some of the knowledge graph uh and some of the metadata capture and the nice thing about this is that it can run anywhere so it doesn't actually need any server side component it only requires a recent version of python and you need to install library um this is what you can use it to manage data and automatically capture provenance of things that you work on um and it doesn't need any any hosted piece for just using it within a single project so you can just run on your laptop you don't need to push push your code anywhere if you don't want to but you can still use all of the reproducibility features for single projects and as soon as you send your data to the server then we can start to establish cross project links based on what data you're using if you're sharing data sets um for example this then becomes clear uh in the in the online platform and if you're interested in I think this is you probably will not be using any of the Renko CLI in the course uh but if later on you're interested we have a tutorial that walks you through some of the functionality and I encourage you to look at that um but what I wanted to the point I wanted to drive home here as well is that in both of these sort of modes uh the every every project is still get repository so even in the online uh in the in the web app um there is a fully fledged git lab server behind each project where you can uh eventually if you need to do something more complex you can go there and use that functionality as well so I want to walk you quickly through this idea of a knowledge graph and why we think this is uh this is important to to maintain so I'm gonna walk you through a scenario of two projects working in the same domain with different tools in a third project that sort of comes comes into it from a different perspective um and if you don't know what a knowledge graph is I'm about to explain so don't worry um so in a knowledge graph basically uh all this term means is that you have you have uh you you you you you express facts about about every every entity that you care about so in knowledge graph we would have a project and this project would have um let's say a data set inside and we would have a knowledge graph we would have this edge that would have some uh particular type of name in this case we would say that the project has a part that is a data set and the data set has a part that is a file and in a real knowledge graph all of these would have additional additional edges here additional uh properties attached to them so a project would have a type and it would have a creator and would have a date and it would have um all kinds of a various metadata attached I just didn't include them here because it would be too cluttered but that's essentially what a knowledge graph looks like um and when we when we run uh when we execute some code in Renku with the command line interface um we automatically record additional additional things here so if I now this researcher runs of some sort of codes uh in the platform we we say that then this file was used by a particular code and had a particular output and this is now some pre-processed data down here so already this researcher has recorded a pipeline this is the pipeline and there's some additional metadata here about what what this pipeline consists of namely that there's a data set inside with some DOI attached to it and it has some data now a second researcher comes in and they're interested in the same data set they import the same data set this project has the same uh data set in it and the researcher looks at the knowledge graph and realizes that there is already uh some use of this data um in the platform so they have a look at what this other person has done and they see that they've they've established this pre-processing pipeline uh for this data that they're interested in and so instead of having to repeat the pre-processing pipeline they just uh use this data and they have the full knowledge of how this data was pre-processed because it's all recorded and so if they don't like something in the pre-processing pipeline they can modify it and re-run it themselves if they want to or they can just use this data as this and also if the if the original author of of this pre-processing pipeline changes something here the consumer of this of this data can can see this and update and update their um their result or their their their input data here but they're not interested in just stopping and pre-processing they actually want to do something with this so they uh they build some additional pipeline on top where they train some model or what have you and this produces them up with and so now this researcher the second researcher has full knowledge of they basically saved a bunch of time because they didn't have to pre-process this data but they also have full knowledge of the entire of the entire chain which they can then modify and re-run as they want and in addition the original author of the pre-processed data here project one they can see that not only do they have this pipeline that they recorded but they can see that somebody has used their data and they can see how and they can see exactly what results it led to so um if they wanted to see if this pre-processing pre-processed data is useful they immediately have the answer that um indeed people in their in their discipline are uh using this expensive pre-processed data for for additional for additional science and because this is a knowledge graph we can attach arbitrary metadata to these uh to these uh to these models to these executions uh in the in the that are being done and actually we have a an extension now that we're building for machine learning specifically where not only do you have this abstract notion of this is an output this is a this is a code that ran etc but you have additional information like this is a particular type of model this had a particular uh runtime uh this had a particular um particular type that it was type of model that was working with and so this allows then for somebody from an entirely different domain say this is a person working on machine learning uh specifically they come in and they say well let me see what people are doing with this type of model in in the platform and they discover actually this whole branch of of or this whole uh field that is relying on these types of models for their um for the results and they realize now that the the models that they're working on that they you know care about deeply uh have a much wider audience they have a much wider application so this connects them to this whole field that they may not have known about before so this is kind of an idea of why why we're bothering with uh with with the concept like a knowledge graph in a platform like Renku where we hope that there's going to be really potential for some of these bi-directional links to take place and so once we have the knowledge graph it becomes easy to answer these questions that I described at the beginning so it's easy to see who is using the data and how it's you can see which algorithms are used to to do what um you can replay pipelines to regenerate results to respond to new data um or you can also um notify someone if data that was previously there is no longer available um you can know who to credit if you're using a particular type of data a particular type of algorithm um and you can see how data sets can be transformed and used which makes makes them infinitely more usable because uh in you know data repositories are great but if you just have data in a data repository with the readme file most of the time it's still going to be very difficult for anybody to use um so data sets with attached data with an actual pipeline attached to them which shows you how you can use them are are much much more useful um so our hope is that we can convince scientists that these that if you follow these best practices at a practical level so at a level where they're actually uh where these reproducibility concepts and ideas are actually useful in the day-to-day that they're making more productive um scientists will actually be more productive because they'll spend less time remembering how to do something less time sort of uh thinking about how to you know organize their deeply nested directory structures or whatever um and instead rely a bit more on on on these concepts like like the knowledge graph um we we hope that the that by doing this that there would be better visibility for for projects and results and better impact either within the organization or within a lab or within a community and it also um if you share this publicly it's also um uh it also helps to boost um trust because you are you're saying you know uh you're kind of laying it all out for everybody to see right you're not just hiding behind uh the abstractions in the paper you're basically making it available for anybody to scrutinize so how is Renko used today um you know we're not at this full picture yet and we're working we're working on realizing this this division Renko is still a relatively young project but we have a very good team in place and we hope this year to actually make big strides in making this this vision more of a reality but today Renko is already being used um by in various in various ways by various people so we do have some um individual researchers like I said at the beginning just tracking the progress of the of their work and communicating results that appears they're not really um using a lot of the interactive session functionality and so on but they're using the actually the command line interface to track their their work and making their their work more reproducible um we have lots of groups that work with the Swiss Data Science Center and our data scientists are using Renko to collaborate with them on a project and there the use of live environments is is very much the core aspect because the data scientists will create some um they'll come up with some result and then they'll they'll create a Jupyter Notebook that illustrates this result and share that with the PI and the PI can go in and actually rerun everything in any way they they want and and and look at the look at the look at what's been done and actually get their hands hands a bit dirty we have quite a few we have a few cases where uh they are Renko is used by people in the more um sort of data management capacity so people are using it to make it as a public and self-documented so they'll have a data set in there um with some description with some readme file and then in addition there'll be a notebook or maybe even a pipeline patch to that um there's also similar to this um there are some projects that are providing preprocess data sets that others can then uh kind of work off of and which is kind of the example that I described in the in the cartoon Knowledge Graph discussion a second ago and um an exciting development which we didn't anticipate from the beginning to be honest um is that people are teaching courses with with Renko um with all the batteries included so it's it's uh it's been used in the way that you'll see that you'll see today where the instructors set up the environment with and sometimes these environments can be very complex or take a long time to set up so a lot of work goes into actually creating the environment but once it's there all the students have to do is click a button and and they're uh already in in a session with all the software available and all the data available to them so there's very little overhead and actually just getting the course off the ground so just a quick word of where we're going because right as I said Renko is still very much a work in progress um we are constantly recently improving the ease of use of the data set functionality and I think um this course actually does make use of the um data set interface already so uh I was happy to see that that's that that's getting used um but we're we're still making that easier to use in particular uh easier to reuse data sets in other projects uh is being developed now um if you look at the existing Renko projects now that have some pipelines recorded you'll notice that the Knowledge Graph is not very uh very nice to look at for the time being so we're working on improving the Knowledge Graph model to make this presentation more useful and maybe what's interesting for this audience is that we are actually in the process of completely rebuilding how the workflows are being handled so that they allow they will allow us to serialize these workflows to various formats so at the moment we're relying almost exclusively on the common workflow language but our plan is to have to be able to translate not just to CWL but to also other workflow languages so that if you if you're working in a particular environment or your infrastructure requires that you run with a specific uh workflow system you'll be able to take this sort of abstract representation of what happened in the Renko project and put it into a workflow language that you can then run on your infrastructure this is a really exciting uh possibility that we're working on right now and it it will allow us to run workflows in the cloud so if you record something you'll be able to say okay now I ran this on my laptop on some small data and I want to rerun this workflow now on bigger data in the cloud or on my htc cluster and you'll be able to do this uh from from Renko itself but also we're very much looking for input on what is interesting for people to have in a platform like Renko so if you have questions or if you have ideas please reach out to us we have this open public deployment which you'll be using for for the course but if you have sort of general questions about how to do something or you want to you're not finding something you're looking for we have a forum at we have a discourse forum at Renko.discourse.group so please reach out to us there um this is really a place where you can ask kind of open-ended questions um we we have all the codes publicly available on github and the main repository is at Swiss Data Science Center slash Renko and there if you find something that um it's not just a question but uh but more of a bug report or more of a specific feature request um you should go there and and open an issue with us and once you're done with uh with using Renko for a little while or once you've used Renko for a while we would really appreciate uh if you could fill out the survey that we have just we have a we're trying to collect some information about where our audiences are coming from and what their technical backgrounds are and so on um so I'll stop there I I think I won't do a demo which I normally do following this presentation because you'll you'll get that anyway in the class but I just want to thank you for for having me and please if you if you find Renko interesting um for like sort of beyond the course please reach out to us we'd love to hear from um from you to learn more about what you what you're doing and what you'd like to do um so don't be shy and and and get in touch and we yeah I hope that you'll find it useful thanks very much