 Next we have started, it's past the hour. So please welcome Paul Amazoner, who's a volunteer, a worker with data kind. And that's his sort of superman job. Daytime is working in the investment part. Yeah, in the, thank you. Okay, so Paul is going to, I forgot the title though, like your Docker pipeline from pre-producible research or in a workshop that you can follow along here or even after the session. Thank you, Paul. Yep, thanks. Hi everyone, so thanks for coming. This is a workshop, so if you haven't done so, we need accounts for the various platforms. These platforms, GitHub, KIO, and Play With Docker will go into use it later. So while I introduce the overview, you can set up if you don't have yet or you can go to the tiny URL workshop dash accounts or take a snapshot of this. So Docker pipeline for reproducible research. I'm Paul as was introduced a while ago. I'm a developer and also a core volunteer with data kind. So data kind is a nonprofit. We help connect data science volunteers with nonprofit organizations to help them analyze their data for good. You can learn more about data kind in tomorrow's session in FOSS Asia. Raymond and Wei Young will be there to explain more about data kind. So what we're going to do here today. Docker for reproducibility. I'm going to do a recap from what I've shared last year in the video. It's also located in the abstract of this talk and after that we're going to dive in straight into the workshop so that we have more time. We only have I think around 55 minutes for this workshop. So there are two waiting times during the workshop while we build our images. So during the time we'll utilize that to discuss more on what is data dive and then the second waiting time we'll use that to discuss data dive workflows which we are planning to do this year during a data dive. So this is a recap as part of my video last year. What do we mean by reproducibility? In the data kind context we expect the result of a volunteer's analysis on this machine to be reproducible in another volunteer's machine. So volunteer's machine or a nonprofit representative's machine so that at least we don't have these issues like hey it only works in this volunteer's machine but when we transfer the notebook or the script in another volunteer's machine it doesn't work anymore because of the dependency in the library, dependency in the package, the versions and all. So we don't want that to happen. That's why we want as much as possible to promote reproducibility amongst our volunteers in data kind. So this is the link to the... So in data kind previously we tried to promote reproducibility using GitHub by versioning our codes and scripts and notebooks and all and then last year we realized that having versions for the scripts and codes is not enough because a number of the issues that you'll face when doing this analysis or you're producing the analysis of another volunteer is also due to environments because when you try to run the script you'll notice that you have dependency of another version of the package and another dependency of the version of the library and so on and so forth. So we realized that apart from versioning the code and scripts we want to version also the environment. So at least when we run the script or code it will run smoothly because the environment is also version and tested for that matter. Last year I did this video to explain more about Docker. So if you're interested about Docker basics and more discussion and reproducibility you can check out this link, tiny URL Docker for volunteers to check out more discuss more on the Docker basics here we'll focus more on the continuous integration pipeline that we're using in data kind so get to know how we use KIO to do reproducibility and continuous integration. This is the link to the workshop material so that if you want to follow along as I do the demo you can try it also. The all of the instructions that you need are all in the document so you can follow along you can work on it on your own time or you can follow along with me or you can also work on it after this session. All the links that you need are all the accounts that you need are all in the cloud so you don't have to install anything in your machine. So we'll start with the first task that we want to do as part of this. We need to set up our Docker file. So I'm just going to follow the workshop material. Everyone has the link already, right? For the workshop material that's tinyurl.com slash Docker lab and you'll have the material there. We'll do the first task. The first task, well you can just read this on your own the overview that gives the context of this workshop and then the prerequisites I assume you already have GitHub, KIO and corresponding which call this lab play with Docker account. So first we'll need to set up the Docker file. You can fork this repository. So we need to go to our GitHub account and then go here I'll just copy this. By the way, if anyone is stuck somewhere just raise your hand we have mentors around Yoke, Raymond, at the back that will help you in case you're stuck. So we've now flown the repository in our account, GitHub account, containers. Just want to show you something. Click on the workshop folder. Demo Docker set up and then we'll open this Docker file. So in this Docker file you see the base image that we're using. We're using Jupyter minimal notebook as much as possible. Yes? The font size, just a moment. Let me just, it's better, yeah. So this one, as much as possible we want to have the image small so that's why we're using minimal notebook so that at least we don't have the overhead of additional packages that is in that base image. And apart from that, you can already use this as this highlighted text but since we are promoting reproducibility even the base image we want to get a corresponding version so that when we replicate it later we can replicate the corresponding environment that we used for this environment. So the highlighted portion that's the version of that base image. Now here since we're going to use a Python notebook for this all the packages that we need for this particular analysis will store in requirements.txt file. So here, the highlighted, I'll just go one folder up a bit. Demo Docker set up. And then I'll just open this requirements.txt. In this requirements.txt you'll see the corresponding libraries that we need to install for Python and note that we also include the versions. So versioning in the data kind is really important to promote reproducibility so that we know which versions work with which analysis. So this is our requirements.txt which has the Python libraries that we need with their corresponding version. So we're currently in step task one. Step one, we've done. We've done step two, which is examining the Docker file. And then now where our Docker file is ready we can proceed with syncing our GitHub to the KIO. KIO we're going to use for our continuous integration. So task two, I'm just going to update the slide. Going to build Docker image using a continuous integration via KIO. So KIO, I will go there. So here we'll need to create a new repository as explained in the document. I'll just click this. So in the repository, create new repository just type workshop dash notebook. This is the name of our repository. And then we need to put this as public. So KIO is like GitHub. If you put your repository as public then you can get to use it for free. So that's what I like about it. And then we're going to also link to a GitHub repository push so that the GitHub repository that we forked a while ago any changes to that will have this build automatically to give us the new image. Okay, and then create public repository. So this option authorized score OS it doesn't appear now on my side because I already set up a while ago but in case you're setting it this up newly you just need to authorize score OS so that you have a connection between KIO and your GitHub account. So at this point we're on step six. We're going to select our organization in this case our KIO account. Just click continue. And then we're going to select our repository that we've worked a while ago. Contain yourself. And then here you have the option to configure the trigger. So here there are two options trigger for all branches and the other one is trigger for only the branch corresponding branch. So previously in the last year I think in the early days we used to use trigger for all branches and tags and we had a problem on that because in the data kind repository if we have multiple branches any change to any of those branches if there's any change that was submitted to GitHub our build will unnecessarily build an image which is not really good. So ideally we want only to trigger the build whenever there's a change in the master branch. So we're going to limit this to check-ins to a master branch only. So I'll just put heads master and then you'll see matching master so that it will only build if it's a change in the master branch. Hit continue. And then here we'll just select the Docker file which is the one that we intend workshop demo Docker file and then continue. And then the context, the folder where our Docker file is located. Hit continue. And then you'll notice there's an optional robot account. The robot account you'll just need this if you have a base repository that is private but since we don't use a private base repository the Jupyter minimal notebook a while ago is public one so we don't need this so we can skip this. Just hit continue and then we're ready to go. So just click here. So we're on step 12. Just click continue. And then click the this link to go back to the build page. Then click start new build and then click run trigger now and then select a branch. Here we'll just choose master. Start build. And the expectation is it should be building right now. So in case you don't have a feedback right now like what I have, just try to refresh the page. We should see like three dots moving this one. That means that our current image is being built. So what I like about KIO is if you click this build ID it will also show you what's currently going on. It shows you the logging. So what it does is it's trying to pull that base image that Jupyter notebook base image that we've set up a while ago and you'll see the progress until it assembles and then install the necessary fight on packages a while ago. So while the build usually takes around six to seven minutes so we'll just let it build and while building let's discuss more on the data hide stuff. Data hide stuff. So while the image is being built I'll share more about data dive. So what is data dive? A data dive is a weekend long event where a nonprofit organization works alongside data science volunteers, developers, designers to analyze data and gain insight into their programs, the communities they serve and more. So last year we had two data dives organized. The first one was held at Expedia. We've partnered with three nonprofit organizations back then that was I think Red Cross, OJOY and Children's Society. And then we had a second I think in October if I remember correctly. The second data dive was organized in NVPC and we've worked with NVPC data, National Volunteer and Philanthropy Center during that time. So it's a fun event. If you get a chance to join one of this I encourage you to do so. There are blog posts for those events. We have tiny URL for those. Tiny URL slash data dive one or tiny URL data dive two for the blog posts that we've done for last year. So we can read about those. So during this data dive we've learned some things also on the Docker side of things. So last year we did promote on our volunteers to use Docker for reproducibility so that we can easily reproduce the analysis. However, these are the things that we've learned. Not all volunteers use Docker and there are some issues it may be because of the installation and also it may be because of tools preference. So far we've tried to support multiple operating systems like Mac OS, Linux or Windows. A number of them we've been able to set up Docker properly. Some we had issues. So not all the volunteers are able to use the Dockerize tools that we have. And the second thing that we learned is doing reproducibility exercise after the event is quite painful. So the second data dive, the one with NVPC we didn't really plan to use Docker at first but after the data dive event we want to reproduce the results. But then after the event the author might not be there already. So we tried to analyze what's the intention of the author, what version of the library he has used on a particular analysis. So it's very, very tedious. So try an error for checking the library if it works on the new machine is quite laborious. But it's okay, it's lessons learned. So what we'll do is we'll use the lessons learned and improve our workflow in this year's data dive. So we'll talk about more on that later. So we'll check our build if it has completed. So here you can see that Docker build completed and pushed. I'll just go back to our build page. So here we have the build. We can see the white arrow with a white checkbox on the green background. Let's just do the next thing, which is we need to do the labeling. This is to promote the versioning in our environment. So we can see the latest here. What we need to do is click this settings button, add new tag. So here in DataKind we try to use semantic versioning. So we have three numbers. The first number is for major changes. The second is for minor backwards compatibility. And third is if there's any fixes that we need to do. But since this is our first image, I'll just put 1.0.0 and create a tag. And then you'll see the version number here for this particular repo workshop national book. So let's see where we are in the documentation. We've added new tag. We've done this. We've checked that we have the repository tag and version. Now that we have the version notebook, we can proceed in curating volunteers deliverables. So deliverables could be data visualization, a Python script, Python notebook, or whatever the volunteer delivers for that matter. That will be our task to curating of deliverables. So I'll just update this, curate deliverables. So in this context, this is usually what the Docker captain or data dive Docker captain is doing during data dives. They will be helping in curating the deliverables done by the volunteers. So here, in a normal data dive, what usually happens is Docker captains will have their own Docker set up on their machines so that they can easily replicate. But for this workshop, so that we don't install anything on a machine, we'll try to use Play with Docker so that it will do everything in the cloud. And then we save a lot of bandwidth by doing that. So here, just log into your Play with Docker account and click start and then we'll create a new instance for that. So here I have the Play with Docker account. I'll just start. Oops, let me refresh it. Maybe a lot of people, okay, good. Maybe a lot of people are using Play with Docker right now. So add new instance. Yep, that's a good thing to know. Okay, so it will show you a console. And then what we need to do is, we're going to clone or contain yourself repository. I'll just click here. And then be sure to put your GitHub account in the path, the real data scientist. Okay, done. Just need to ensure that the repo is there. ls, contain yourself. Okay, we're good. Back to the documentation. We can run the Docker image that we will pull the Docker image and then run accordingly in our console. Just paste this here. And then don't forget to put your KIO account where we built our image a while ago. The scientist. Okay, and then click enter. So while it's pushing and downloading the image, let me just share to you Docker command. So 80 here is from Play with Docker. So that's the port. And then it's mapped to the 8888 of Jupyter, which is in the container. And we've also mapped the volume. So this first part on the left is the path inside Play with Docker, which is mapped to the folder inside the container, which is the Jovian work. And then this version, if you notice, is the version that we built recently, a while ago with KIO. So let's check if it has pulled, if the pull is complete. So currently it's pulling the image, the layers. And then once this is done, we'll be able to run our Jupyter notebook. So download complete. We still have a few more to go. I think it's got stuck. I'll just pull it again. Are you guys able to pull the image from your built repository? I'll try one more. Otherwise I'll proceed with the recording of this. Yeah, I think it has some issues pulling the image. So what I'll do is I'll just play this. Pre-recorded curation of the deliverables. So as you can see in the screen, we're in task three where we're going to curate the deliverables. And then of course this one we explained. Let me just fast forward, opening the Docker. So we just copied the, let me just forward it a bit. Okay, so it's building. Okay, so once the build is complete, you'll see a token available for your Jupyter notebook. Just click on that. And then just copy the token. And then back to the Play with Docker, click the eight port 80 link at the above. And then place the token in the token text box. And then you'll see this screen with both the notebooks, as explained in the documentation. So we're going to check volunteer once notebook first. And then do around all in the notebook. And then we should see this visualization at the end. So clicking that notebook will open a new tab. And then we have this notebook, which we're just getting the data. And then using Pandas and Jason to slice and dice. And then that's the table that we're getting. And then we're going to demo the group by function. And then do the demo using Plotly, which is one of the libraries that we're using for this. So just click cell and then run all. And we should be seeing the visualization at the bottom once the cells are run. So this is what we've got. That means we're able to reproduce volunteer once analysis. So we just created a stack graph using Plotly for this. Now the next one is once we've replicated or produced volunteer once analysis, we're going to also reproduce volunteers to analysis. Note that once the analysis has been successfully reproduced, usually Docker captains will indicate which version of the environment we used for that. So we'll proceed with volunteer two's notebook. We'll close this, go to home, and then open the volunteer two's notebook. So for volunteer two, same thing. We'll still get the data, same data as we've used by volunteer one. And then we'll still use the same packages like Pandas and JSON. The only difference is here we're going to calculate percentage for each year for the Red Cross data. And then use Bokeh for doing the data visualization instead of Plotly. So just click cell and then do run all. And then while that is running, the expectation is you'll see that it's not working because we don't have a Bokeh package in the image. So what we'll do here is the same thing as what we have in the documentation. We have an issue, we have an environment issue. So let's proceed with this, how to solve environment issue. So I'll just go back to the documentation again. That is task four, resolve environment issue. So the issue that we face here is we cannot run or replicate or reproduce volunteer two's notebook because we have a missing package called Bokeh. But note that in the actual scenarios, in actual data dive, the issues that we face is not as simple as this. Sometimes it's not only missing libraries, sometimes there could be binaries that are also missing. So we need to explore and talk to the author and then try to resolve the issues during that time. But for the workshop to make it simple, we have a missing Bokeh library. So what we need to do is go back to our GitHub account. I'll just go here. And then in the GitHub, in our requirements folder, our requirements file, we just update the file. Click here, edit this file. And then here, notice that we don't have Bokeh yet here. So we'll need to put an additional package called Bokeh for the data visualization of volunteer two. And then place the corresponding version. So essentially, we found this out, for example, by talking to the author, hey, what version of Bokeh did you use to replicate your data visualization? So once you have this, we're going to commit the changes. So add info, add the Bokeh, and then commit the changes. So once we've committed the changes here, the expectation is we should have, we should pick up this change and build a new image correspondingly. So let me just expand that. In our build page, we should see a new build running. So added Bokeh. So currently, it's building a new version for our image for this workshop notebook repository. So while that is being built, let's explore a bit on the new stuff that we're exploring for this year. So while the image is being built, let's discuss data-dive workflows that we're planning. So this year, because of the issues that we faced as we discussed a while ago, we tried to experiment a new approach. So instead of volunteers directly checking into the master branch, we will try to introduce a new item in the workflow, which is curating of deliverables. So here, you'll see that the volunteer has already done the deliverable. The deliverable could be notebook script or visualization. What the volunteer will do, instead of checking in directly to the, or doing a PR or pull request to the master branch, he or she will do a PR on the integration branch. Then in the integration branch, we will have another role, which is the docker captain, which will pull those deliverables. And then what he or she will do is he or she will try to reproduce the results or analysis from that branch, either using the new image or the environment that we have. If there's any issue, let's say missing libraries and so on, then what the docker captain will do is we will update the image file, the docker file, and then add the corresponding missing library there and rebuild. And then hopefully in the new version of the image, we will be able to reproduce the analysis or the deliverable of the volunteer. Once everything is good, next step is to push the approved deliverable to the GitHub main branch to a PR there. And then we have a GitHub admin to assess the changes. If everything looks good, then it goes straight to our main branch. That way, we can sort of ensure that whatever is in our master branch is reproducible. So let's check whether, before we do that, the docker captain has also a special overflow of its own. So apart from creating and updating the docker image and curating the deliverables, we also have this thing called docker overflow. It's a sort of stock overflow kind of thing during data type events that if there are any frequently asked questions about docker, let's say the volunteer choose to use docker for doing the analysis. If there are any frequently asked questions, he or she will just go to that docker overflow and then check if the issue was faced before and if there are common answers to those issues. So let's revisit our build. So here, we have added the bouquet. What we're going to do here now is we're going to go to the tags again because we need to add in a new label. And here, if you notice, this is very important. You notice that the latest here is 20 minutes ago. That is a lie, I think. So you need to refresh the page because you need to ensure the latest is really the latest, right? So I'm going to refresh it and then the latest now says a minute ago. So that's good. What we'll do here is we'll add a new tag and then put 1.0.1. So create a tag. So you have now a new tag called 1.0.1 and what will happen next is we're going to pull that version in our play with docker and then check whether we can replicate or reproduce volunteer choose analysis. Let's check where we are in the documentation. Add a new tag. We have 1.0.1 because we're fixing the image. We have seen this version. Go back to your play with docker console. We'll kill the current one that is running and then run the new version. So let's go back and do accordingly. So go back to play with docker console. I'm going to kill this and then control L to clear the stuff and then... So here it's pointing to 1.0.0 which is the old version, the version that doesn't work with volunteer choose a notebook. We're going to increment this with 1 which is the new image that we've built and versioned and then I'll click enter. Hopefully this will work. So already exists because we already pulled a while ago. Then let's give it a few moments. Otherwise I'll restart the pooling again. Let me refresh this. I'm a bit impatient. I'll just try it again. One last try if it doesn't work. I'll move to the recording. I think it has problem pooling the image again. Anyone successfully pulled the image? Mine has some problem. So I'll just proceed with resolving the docker issue with the new version. I'll just play this. Basically we'll just pull it again with that corresponding version. Version 1.0.1 and hit enter. So in the docker playground we'll just remove the necessary tabs that we used a while ago and then kill this and then ensure that you have the correct version applied 1.0.1 which contains our bouquet package and then once it has completed loading or downloading the image then we're able to use the Jupyter accordingly. So it needs 252 MB. We're nearly there. So once that is complete we need the token. So just click that link and then copy the token accordingly and then click the port and then pass the copied token in the screen. So we're not going to run again volunteer once notebook since that already is working in the previous image. What we do now is to try to see if this new image version is able to resolve the issue that we faced a while ago with volunteer 2's notebook. So we'll need to click the notebook of volunteer 2 analysis from volunteer 2, iPython notebook and then just basically run all. This is based from the documentation that you have. So we'll click this this is volunteer 2's notebook. So we're going to click cell and then run all. It will run all the cells of volunteer 2's notebook and as you can see we're able to replicate now what volunteer 2 has done. So our adding bouquet package resolve that issue and then we have this new image which is version 1.0.1 which is now able to replicate volunteer 2's analysis. So that's the donation percentage per year and then back to our documentation. So this is what we've done so far. So we have four tasks in this workshop. We had set up the Dockerfile in GitHub which is basically the Dockerfile base image which has a version Docker image as well. We place all the Python libraries and Python packages that we need for the analysis in GitHub and then task 2 we did connect the Dockerfile that we have in GitHub to KIO to have a continuous integration and then after that we tried to put on our Docker captain hat which is to curate deliverables of volunteers. We tried with volunteer 1, it worked. Unfortunately, in the first try it didn't work with volunteer 2. Why? Because we have missing packages. So what we did was to resolve the environment issue which is task 4. So we went back to GitHub, added the missing package and then we submit it but since we have continuous integration after submitting in GitHub it automatically triggers the build in KIO. So after the new image is built in KIO we try to put the version which is 1.01 we are using semantic versioning. Once that is done we're able to pull the new version in our play with Docker and then able to reproduce the analysis of volunteer 2. So that's good. So if you've done this, congratulations you've just completed the Docker pipeline workshop if you want to check more on let's say if you're keen to volunteer as a Docker captain in one of our data dives check out our Meetup page we might have we haven't announced yet I think our data dive but it's somewhere around April so subscribe to this Meetup page and then you'll be able to be notified once we have the data dive. Any questions so far? That's the end of the workshop basically. So we have a few minutes early than expected. Any questions? So if you haven't done this in this session the materials are still available you can try it after this also. Thanks everybody Thanks for attending. Thank you.