 And our next speaker is Sarah Gibson from the Alan Turing Institute. She is a research software engineer there, and she solves real-world problems using cutting-edge academic research there. And she will talk about binder repusable Python environments with binder. Welcome, Sarah. Hi. Thank you very much for that introduction. I'm Sarah Gibson. If you don't know who I am, I am a research software engineer at the Alan Turing Institute in London. I've been using Python now for about five years since I started my PhD in astrophysics. And I'm also a maintainer and an operator for Project Binder, which we're going to talk about today. And the primary use of binder is that it makes it really super easy to share reproducible computational environments. And I'm going to show you what that means for Python today. But I just very quickly want to start by defining what I mean by reproducible. So what are the means of the word reproducible? Pretty much in whatever field, whether it's data science or a specific domain you go into, reproducibility has a different meaning. So I'm just going to quickly run through this slide so it's absolutely clear what I mean when I say reproducible. So I'm looking at this top right quadrant of reproducible, which is if you gave me the same data and the same analysis pipeline that you were running, I should be able to get the same answer as you. This is like the lowest bar of reproducibility. But if we can conquer that, we can open up these really interesting other dimensions of reproducibility. So if I then run a different data set through the same analysis pipeline and it tells me that qualitatively the same answer, this would be a replicable analysis. If I ran the same data through a different analysis pipeline and that qualitatively gave me the same insights, this would be a robust analysis. It's really interesting dimension because it's looking for methods that should do the same thing. And if we combine all of that together, we get a generalizable analysis. And this is not generalized because it doesn't apply to all of the people during the analysis, but it is two steps towards being able to make inferences about the data to the broader population. And the results of this type of analysis are not specific to one day set or one particular methodology. But we can't achieve any of that unless we can get the same answer out from the same data in the same analysis pipeline. But I'm actually being even more specific than that. And I'm going to talk about what is actually called a repeatable analysis. And this is literally getting the exact same answer from the same data and the same analysis pipeline. And this is not the same as reproducible because repeatable brings in the concept of the computational environment, whereas reproducible only relates to the data and the analysis steps. And this is why I'm talking about binder because binder captures this repeatable aspect by adding in the dimension of what's going on in the background when you're running your analysis on a data set. So I just want to acknowledge that there are many reasons and barriers and why people, why reproducible research isn't happening all of the time across all of the domains and I'm sure many people on this call will relate to at least one reason on the slide right now. And the two I like to focus on the fact that it takes time and it requires additional skills. You need time to learn these skills and implement them. And if you can get over that initial barrier, research in general can be much more efficient and successful in the long run because you've set up your workflows in such a way that you can repeat things much more easily. I like to do a little bit of market research during my talks but I'm aware that I can't see any of you. So hopefully you'll still at least be able to relate to the questions I'm about to ask, but I want you to think about scenarios when you've been collaborating on software. And I'm sure all of us have heard the question, I've heard, oh but it worked on my computer. Or, oh, but it works yesterday. And these are super frustrating things when you're trying to collaborate and work on software and you can't get the same environment going or something different is happening on your machine to someone else's machine. And this is really frustrating because we actually have tools that solve these problems. Whether it worked on my computer scenario, we have Docker, if you're not familiar with that, what Docker does is it creates something called containers. These are portable environments that capture not just code and dependencies such as software or data but it can actually capture high level architecture such as like the operating system. They become very portable from like your Mac laptop to your Linux desktop to even implementing on the cloud and it becomes very easy to share those kind of environments. And for the oh it works yesterday scenario we can use version control version control is the magic time machine I wish I'd had during my PhD, when I can just rewind everything back to last Tuesday when everything was last making sense. But if you combine that with continuous integration and testing, you can actually catch these kind of bugs that might create discrepancies in your code before they actually reach your production research ready code base. But these are exactly the additional skills required, I just said, former barrier to reproducible research and we have to sit down and learn them. So this is where binder comes in. Project binder is a global community of research and software engineers and data scientists who are dedicated to the open and transparent communication of reproducible research. And then my binder org service allows anyone to launch a complete and interactive computing environment from their browser. Complete is I mean the code, the land programming language, any software dependencies and packages any assets you have like pros or media. You can very easily share a full research environment through a browser without actually eating the person you're sharing with to install anything. So I am going to now attempt to demonstrate this so please bear with me. I am just clicking on a link here. And this will take us to the gravitational wave open science center website. So what this is this is a website hosted by the LIGO and Virgo collaborations these are astrophysical research collaborations that are looking into the detection of gravitational waves and they have example notebooks here and you'll see this third one here binary black hole events has this my binder link next to it. And if we click that link we get redirected to the binder website where we have this the binder spinner of reproducibility. And if I click on build logs says found built image launching and we're launching a server. And if it hadn't found a built image at this point it would be building one for us. And this server that is it's launching. It's not actually on my local machine. It's in the cloud. And what's going to happen is my browser will be direct redirected to a Jupyter notebook environment as we can see here. It's being hosted in the cloud, all running through my browser. And if I go down and show you the requirements dot txt file. We'll see, you know, we require h5 pi, maplotlib sci pi, none of those have had to install. And then if I open up the notebook. We will be able to see this tutorial notebook that the Lego and Virgo collaboration have written up that basically goes through all of the analysis steps that they do when they are detecting gravitational waves. And I can just click run all cells. And this will just run. And I can then begin to work my way through this notebook and learn all about signal processing and such and we can see that various different things are running. We can create plots and we can see we have the power spectrum of the gravitational waves and we can even see the chirp signal as the two binary black holes emerging into one another. And this graph shows the the ring down as as the resulting object comes to rest. And that's great. I can now learn all about black holes and gravitational waves and I haven't had to install it. I had to make sure I have the right version of Python and packages. It's just launched in a browser there for me to use. So I'll just go back to my slides. So what did they have to do to create that. So this is what this is a comic that kind of shows the workflow research will go through in order to publish their work on binder so we have Jane here. She's written a paper based on her experiments and she would like anyone anywhere to be able to reproduce check and improve her calculations. So the first steps she takes is to describe the experiment as a Jupyter notebook and she and the reason we like Jupyter notebooks is that you can mix together pros code and visualization. And much like the example we just saw you can walk people through the steps you are taking, although you do not have to use Jupyter notebook in order to use binder it also supports Jupyter lab terminal. So instead of time you can get it to work with VS code and it even works with our studio as well. It's not just for Python environments. So when she's written her notebook she put publishes this on a hosted repository such as GitHub, but many of the public repository websites are supported such as GitLab, BitBucket, Zenodo. So it makes this repository binder ready by just squareable with my binder.org. You just glitched there was maybe 30 seconds loss in the video so can you please go back again. Oh yeah sorry. So I'm just talking through this comic here. Jane has written a notebook based on her experiments. She's publishing it on GitHub. And now she needs to make that repository binder ready. So what does that mean for a for Python developers. How do we, you know, make sure that that's ready. So if you are a Python person used to installing your packages using PIP, then you will have a requirement dot txt file. And this is just plain text file that lists your packages and your package versions. And this is a valid configuration file compatible with with my binder.org. But you might use Conda and therefore have environment.yaml files, which is again just a list of all of your dependent packages and the channels from which you download them. And this too is a compatible configuration file for working with my binder.org. And by providing those kind of configuration files, my binder.org will allow Jane to share her notebook with everyone so that they can run it and reproduce her computations by providing compute power from the cloud. So all my binder.org needs to work. It's a version control repository on a public server and a description of the software dependencies. And importantly, we've configured it such that it recognizes the typical version control, sorry, the typical configuration files that those communities already are using to define their software dependencies. So this is a little bit of background into my binder.org or the binder project. It was originally launched by Jeremy Freeman back in 2015. And at the beginning of 2017 project Jupiter started having meetings with binder as we began to take over stewardship and bring binder into the Jupiter ecosphere. The first half of 2017 was spent redeveloping the back end binder into what is now called binder hub and that's all the technology that powers this service. And in September of that year, binder was awarded a more foundation grant to run its operations. Since these humble beginnings, we've now grown to hosting over 140,000 user sessions per week. So in roughly three years project Jupiter has brought this tool from a prototype to a staple of the open source community, open science community. And binder itself is open source and is built modularly using other open source tools, which means anyone can deploy their own service, for example in an institution. And you can, it's completely configurable as well so you could configure it to only allow sharing between specific teams, which I've represented here by changing the globe that meant everyone to just be a little house that might be your institution or your team. And that's totally fine. So what is the technology what's this binder hub that we use to kind of provide this service. So here is a little screen grab of the form you see on the MyBinder.org web page. And all you do is you paste in the URL of your version control repository so we have an example from GitHub here, and then you just click launch. And everything that happens afterwards is happening in the background automatically. So the first thing that happens is that your repository is cloned. Then we have a tool called repo2docker and if you were at Tanya Allard's talk about Docker and Python yesterday, you will have heard a little bit about this. But what repo2docker does is it basically reads the repository and looks for configuration files, and it builds a Docker image without the need for a Docker file. So you can see across all of the code, any data, it installs all of your software dependencies, and it makes that image compatible with Jupyter by installing all of the necessary Jupyter servers and such. And these configuration files are your requirements.txt or your environment YAML that I showed you earlier. This image is then executed using Docker. And we then host the running Docker container on a Jupyter hub. And if you're not familiar with what a Jupyter hub is, Jupyter hub is one solution to the problem where you have access to some computers and you would like to share them or give access to some humans. And Jupyter notebooks are a good user interface for this scenario. And here, computational resources and humans can take many forms, such as researchers on HPC, anonymous users on the cloud, students on a university, etc. By combining different custom authenticators and spawners, you can create any kind of mapping of any collection of humans onto any kind of computational resources you like. And that is Jupyter hub's job. It's just providing computational resources to your people. So that's what it's doing. It's giving computational resources to our running Docker container. And in the case of my binder.org, these computational resources are a Kubernetes cluster running in the cloud. Jupyter hub then makes the running container accessible at some URL. And then binder is this thin layer running across the top of all of these tools that kind of handles URL redirection. And it just redirects the user's browser to that running container. And lo and behold, as we saw in the example, you get your Jupyter notebook with the environment already installed and you're ready to roll. So how did we manage to scale this up to 140,000 users per week? And the answer is we created a federation. So my binder.org is supported by four clusters around the world, including one that I myself manage at the Turing Institute. And federating the service in this way means we can be resilient against both cluster outages and funding streams. We can be sustainable by creating by sharing the workload and knowledge like this, the service can persist in spite of changes amongst resources and people. And we become robust as well, we can reliably support users present and future. And because binder hub and my binder.org are built using modular open source tooling, we're actually in a really super cool position in that to run my binder.org you do not require cloud vendor lock in. And we're in a really cool position of hosting one website with four different flavors of Kubernetes. And our user base are for the large parts unaffected by the redirection to our different clusters. And in fact, we're not even completely cloud based the geese's cluster, which is the second largest cluster in the federation is actually an on premise facility hosted at the liveness Institute. So another thing that the binder team like to do is to run user surveys, and then we do this trying gauge how our users are using my binder.org and what they like or dislike about it. So the good news is around 80% of our user base or at least 80% of the 346 responses we got to the survey would recommend the service to a friend. So that's good. And we found that there's actually a wide range of use cases for using my binder dog. And but like the top three sort of equally distributed use cases are university teaching, hosting documentation and examples, or running workshops and training courses. And the inference you can make from that is binder really shines in scenarios where installing an environment would be distracting or a waste of time, such as you're running a workshop at a conference. You don't want to spend 20 minutes making sure everybody has the correct version of Python and all of the packages installed, and then downloading all of the necessary materials. Instead, you could just send them a URL to click and they get the environment and everything is there and they can interactively and explore that this. Thank you. But one thing that we could do better on is speed. So this is a word cloud that was generated from free from responses that show one area we could improve is the time it takes to launch binder tackling this problem, however, isn't as simple as it seems. We are part of this is that we're relying on upstream conditions to speed up our launches, for example pulling the Docker images onto our computational nodes, or allocating computational resources with Kubernetes. We are always always going to be limited by how quickly Kubernetes can requisition a user session. But one way we've tried to tackle this is to write up some community guidance and this explains what's happening during the launch process, where the time is spent, which steps could be taking longer and why. And we've been provided pathways, our users can follow to speed up their launch times. And we found that once we've explained what's going on, many people actually go, yeah, actually 10 to 15 seconds to launch binder is actually reasonable considering how much is going on in the background. So we've probably been too good at, you know, streamlining this at this moment. And this is ultimately what this project is all about, we are about community, and we exist in a larger ecosphere of languages as well as I mentioned earlier we don't just support Python we include Julia and R. And we want to meet those communities where they are to ensure we're providing the best and most useful service we can. That being said, binder is a Python project and it is run by Python devs. And this is problematic because when it comes to developing the service for Julia and are, we just don't have the knowledge and the expertise from those communities. However, this year I've been awarded a fellowship by the software sustainability Institute, and my goal for this fellowship is to help diversify the skill set of the teams maintaining an operating binder and my binder dog. And this is so that we can ensure that these communities outside Python are represented when we are innovating binder for the future. So this is all from me here is a whole bunch of links if you're at all interested in this project getting involved asking questions. Thank you very much. I see there are many, many questions. I doubt we have time to answer them all but let's start. There's a question from Francesco what are the pros and cons of binder with respect to Google Collab. For instance, do you also allow users to access GPUs which are essential for several machine learning applications. So, um, pros and cons. So Google Collab is a very different kind of project to binder binder is about Google Collab is about developing quickly with other people. So it kind of installs a kitchen sink environment that has everything you could possibly dream of, but you might not actually be using them. Whereas my binder dog is providing a bespoke environment that is specifically the requirements you need to run an analysis. So the kind of different beasts targeting different use cases. My binder dog does not offer GPU services. This is because we are running. We are providing the service completely free of charge and it's expensive enough to run without adding GPUs to it. We did run a GPU service for New York's conference a few years ago and like it's a very obvious spike in our billing graph. But because it's open source you can deploy binder hub onto your own infrastructure that includes GPUs and you can do really clever things like if you authenticate a user and they're on an allow list to use GPUs. You can then you can then redirect them to a different nerd pool. For example, what I would say about the analyses that rely on GPUs. My binder dog is not a tool to do your analysis on. It's a tool to communicate the result of your analysis. So you probably wouldn't be running your simulation on a GPU that might take X amount of hours. There's no reason to run that in a browser. But like the like the tidy up scripts that creates your plots. That is something you are likely to share and that is where binder becomes powerful. Okay, there are many more questions. One hot question is who is paying for the compute resources. So the Google cluster, which is the largest cluster. We, we was originally paid for by the more foundation, but we get a donation of credits direct from the Google cloud platform. And another cluster is run by over age, which is a cloud provider in Europe. The geese's cluster is paid for by the liveness Institute who hosted and the Azure cluster that I manage is paid for via a donation that the cheering received from Microsoft as well. So we're all kind, we're like this community of binder hub managers that are getting our funding through our own streams, but we just get the traffic from one point and then it's kind of like under the hood distributed across the network is quite a cool system. So another very important question for the project. This is awesome. How would one could, how would one contribute to this? I've left the links up here. The binder hub repository is at the top. We also have this might my Twitter is in the corner if you'd like to send me a direct message. One good way to do is to jump onto our discourse at discourse.jupiter.org and introduce yourself and just be like, Hey, I think this is awesome. I want to contribute. And then we can get into a conversation. I always think the best place to start is by reading the documentation and seeing how you can improve it like fresh pair of eyes that don't suffer from expert blindness are so useful to these kind of things. And then we can just build up into more cool technical ideas. Okay, and one more technical question from Alexander. Does binder and could binder and be configured with pep 518. I looked it up. This is specifying minimal bin system requirements for Python projects. If no, will it be supported in the future. I have no idea what that means. I think this is something you can discuss on discord later. Yeah. So I had no clue too. So I looked it up quickly. If we can drop the pep in discourse, I will try and read it and make sense of it. Okay, I think we have unfortunately run out of time. So there are three more questions. Please take that to the discord chat. It's a talk binder. I think the channel name. So, yeah, please join us there in the binder channel discord and I'm sure Sarah will answer all your hot questions. So thanks again, Sarah. Thank you.