 Ohio. This is a session about the Docker, and it's going to be held by Ali Sivji. So please welcome Alan Sivji. Thank you for coming out. My name is Ali Sivji. I will be talking to you about data science workflows using Docker containers. I try to get as many buzzwords as I could into the title, so I think I did a pretty good job. Just a little bit about me before we get started. I work at IBM Watson Health on the client value optimization team, and I'm currently studying medical informatics at Northwestern. I like technology, Python quite a bit, data and Star Trek, Unicode, and that stuff that should be a Vulcan sign. Oh, so we're going to be talking about data science, Docker, and data science with Docker. Pretty simple. So what is data science? Well, obviously it's the center of this Venn diagram by Drew Conway. But really, data science is about extracting value from data. It's about turning your data into actionable insights. You can do this many different ways. We can visually explore data. We can use machine learning to build predictive models. We can classify observations into groups based on similar characteristics. We can use the results of all the other things that we've done to build data-driven applications. We have to remember that data science is a science. We have a question that we're trying to answer, a hypothesis of why it's happening. And our output is just not our answer. It's our findings plus the steps that we took to get there. And so this means that reproducibility is very important, and our analysis needs to be repeatable. Reproducible workflows in data science make it easier to communicate our results. So usually at the end of a data study, we have to share how we got there. So we can use our methodology to tell a story about exactly what happened and how we got to our final results. We can use reproducible workflows to defend decision-making if we can generate the same conclusions over and over again. It's not just gut instinct. It's about data-driven insights. And if we have reproducibility, other people can go in and audit our results to ensure they're correct and that they comply with regulations. I work in healthcare and HIPAA is very important. So we have to make sure that patient data does not leak in any of our models. So this is just a process of how data science works. We start by asking a question, we get the data, we explore the data, we model the data, and we communicate our results. The last step could lead to further iteration where our workflow shows that we have to maybe ask a deeper question for the next set of questions. There's a huge ecosystem of data science tools in Python that enable us to do a lot of analysis. And we can use the Jupyter Notebook as our data science front end to capture our process as well as our end results. So Jupyter Notebooks are a great tool that allows us to create and share documents that have live code, that have equations, visualizations, explanatory text. It's perfect for data science because it makes collaboration a lot easier. So here's just a picture I got from the Project Jupyter website. So you can see that we have some expository text, some formulas, a live code, we have sliders to make it interactive and then finally a visualization. So we can use Jupyter to walk through our data study and explain the steps we took to get there. And we can also pass these notebooks around. But not so fast. Jupyter still suffers from one problem and that's that it works on my machine problem. In order to replicate calculations, users are going to need access to our code, our libraries, and the data that we use to create the original notebook. And this is where Docker comes in. Docker helps us solve the problem of it works on my machine. So just before I start talking about Docker, it was introduced to the world as a lightning talk at Python 2013. So I included a link to the video in my slides, which I've also tweeted about. So check it out. It's pretty cool. Historical moment Python Docker. So I've been talking about Docker. What is Docker? So Docker is a platform that allows us to package and run applications in loosely isolated environments that we call containers. You're going to see the shipping container analogy a lot of places. So shipping containers standardized the logistics industry. There's a standard format for containers. It doesn't really matter what's inside. We could send it by boat, by rail or by trade. And we have the sorry by truck, and we have the infrastructure to handle all these things at the shipyards, at the rail yards or at the, at the truck depots. With software containers, it's a little different but pretty similar. We package code, plus everything we would require to make that code run into an isolated container. And since the container standardized, it could be shift opt into production without having to worry if it's going to run or not. And Docker is also on Windows, Linux and Mac. So you can pass these images around. And it doesn't really matter what OS you're using. Now you might be thinking, this sounds very similar to a virtual machine. And you're totally right. It's very similar. But the only difference is that containers run natively on the host machine's kernel that they share the same kernel of the host machine. Whereas in a virtual machine environment, we have our hypervisor, which provides each VM with virtual access to the host's resources. So in Dockerland, containerland, we don't really have full scale operating systems. And so containers are going to be a lot more lightweight and they're going to have better performance characteristics. So we can use Docker to streamline our development workflows for continuous integration, continuous deployment. We could do microservices. And of course, we can do it for reproducible data science. So here's a great image I found that describes the Docker architecture. The Docker client is where we're going to enter commands to interact with Docker. So that's going to be on our command line. And those commands are going to go to the Docker host. And that host can be either on your local machine or remote machine. And all it has to be doing, it has to be running the Docker demon. And this Docker demon is going to be listening to requests from our client. It's going to manage objects like containers, like images, and it's going to be able to communicate with other Docker demons. And we also have a registry where we store all of our images, but I'll get to that in a little bit. So what's an image? An image is a frozen snapshot of a container. So each image consists of a set of read only layers that are stacked on top of each other. And the layer, each layer is a set of differences from the layer below it. And when we make a container, that's such a runtime instance of an image. So when we create the container, we add a thin read write layer on top of the image layer stack, that's called the container layer. And all the changes that we make to the running container, if we add new files, remove files, that's all done to this top read write layer. It's really similar to object oriented programming where images are classes, layers could be akin to inheritance. And containers are runtime instances just like objects are runtime instances of a class. So the Docker registry is where we store Docker images. Docker Hub is like GitHub, it's the public registry. So this is where we'll find official Docker images for Linux distributions, databases, and Python has a bunch of images up there as well. So anybody can create an image and put it on Docker Hub. If you have a really cool workflow you want to share, make an image stored to the cloud, man. It's going to be so easy. So how do we create Docker images? We can freeze containers using the Docker commit command. And what that does is it takes that read write layer, the container layer, and makes it read only. And then we can use this new image to initialize new containers. Or we can use a Docker file and the Docker build command, which is the preferred method. A Docker file is a file containing all the commands a user can enter to create an image. And then we'll use the Docker build command to automate the build process. So here's just a set of common Docker commands. So from let's us set the base label which we'll get from the Docker registry. We can use label to add metadata to our container, copy to copy files into our image, EME to set environment variable, and working directory to set the working directory exposed in volume I'll talk about in a little bit. There's also a run command. And what that does is executes a command inside of our container and it creates a new layer. And then it commits those results. And so each time you we use a command in the Docker file, it's going to create a new layer. So the first time, say we want to install Jupyter and pandas if we do run pip install Jupyter, and then have another line that says run pip install pandas, that's actually going to create two layers. And one of things we want to do is we want to reduce the number of layers inside of our images. So we can do that by chaining together commands like we would on the command line. So we use the double ampersand and the backslash is just to continue on the next line and just combine it into one specific command. There's also ways to configure our run time. Like what are we going to do when we get inside of the container? So there's two ways to do this. There's the entry point command, which we can use to configure our container as an executable. And then there's the cmd command, which is what we're going to do when we first get into the Docker container. I would just suggest if you're starting off, use the cmd command. Entry point and cmd interact really well, but that's a little bit more advanced. But I have some links for further readings. And there's also two ways to use these commands. There's the exec form, which is like a list with strings. And there's just the normal form, which we can just do cmd, python, hello world like you would at the command line. So some best practices, containers should be stateless. Use the Docker ignore file. So if you're trying to copy, say the contents of a directory into your image, maybe there are some things that you want to ignore. So like a git ignore file, very similar. Make sure that you're only installing packages that you're actually using. We want to make sure that these containers are small and that they're pretty flexible as well. And so each container should only have one purpose. We want to minimize the number of layers. And the one thing I'll mention is you'll see maintainer in some of the older Docker files, but that's been deprecated. So try to use the label if you can going forward. So we have a hello world Docker file. I'm going to go in and check that out. So here I have my Docker file and my hello world.py. So let's just take a look at what my hello world is. It's just printing hello world. So now let's just go into our Docker file to see exactly what's going on. So as I said before, each command creates a new container. So we're going to start off with the base image. I'm using the 3.6.2 alpine image. Alpine is a Linux distribution that's really small. I believe it's around five megabytes. So I use it a lot for my container workflows. The next two steps set the working directory. And we copy the contents of our current directory where the Docker file is where we execute the Docker build command into our image at slash app. And then here we'll just specify that when we start the container, we want to run Python hello world. So let's go ahead and actually do this. So Docker build. T is for tag. So I'm just going to call this hello world. And I want to build in my current directory. So I'm just having a dot at the end. And there we go. We built our image. So let's go to Docker images to take a look. So you can see it right up here. It was built and it was built four seconds ago. So let's go ahead and take this image and create a container out of it. So we'll do Docker run. And there you go. It printed hello world. And so we can, so we can see all the running containers using the Docker PS command. But since this container is not running anymore since our process exited, we'll have to use the minus a to sort of see exactly what was done before. So here you'll see there's a container ID. It was based off the hello world image. And you can see the command that was that is executing. And it also has like a name practical Austin. So if we want to say we want to run this container again, we can just do Docker start. I want to attach standard terminal. Let's draw a standard into the terminal. And I can use the Docker name or I can also use the container ID. All right. So this is just walking through the steps that we did. So we did the build, did the run. And so let's talk a little bit about data inside of containers. We should be aware that anytime we delete a container, we delete all the data inside the container layer. That is the top read write layer. It just goes away. So we have to think about how we get to manage data inside of containers. We can use the Docker CP command to copy files in and out of containers. But this seems a little bit tedious. So the way that they have, the way that they created is called a data volume, where we're going to create a folder on our local machine and just mount it inside our container. And when we when we execute the Docker run statements, we'll just specify with the minus V flag, the full local path and path to the shared directory, where we're going to be loading this information inside of a container. We can also add a line in our Docker file for volume. This is more just about removing ambiguity. It doesn't really do anything. You still have to do the minus V to connect to the container. So containers can connect to the outside world, but we need to set up port forwarding to connect to the inside of containers. So like with the volumes, we can use the minus P command to specify what we want as the port on our host machine and the internal port. And as before, we can expose the port, but it doesn't really actually expose the port. We still have to use the minus P flag in order to specify what we want to hook it up to. Yeah, so the question was minus P is to map. So you'll need the minus P to map from the outside of containers to the inside of containers. I'll walk through an example in a bit. It's gonna be a bit more clear that way. So on this slide, I've, well, before I started introducing some of the Docker client commands I've been using, so here's just a list to the relevant section of the documentation. I've started the ones I use more than others. So these are just the list for containers. Okay, great. Okay, and this is the list for images. Looks like there's a slide out of place. All right, so we talked about the container lifecycle really, really briefly. So we can build an image from a Docker file. We can run a container. We can commit. We can also start a container. We haven't talked about kills. So if you have a process running in a container, you want to kill it, you can do Docker kill container name to kill it. We can also delete a container with the RM command. And if we want to remove the image completely, we can use RMI. So just to talk about the Docker run command in a little bit more detail, we can overwrite a lot of the defaults for containers with this command. And so if we do a minus D flag, we can run it in detach mode where it's going to be running in the background. We could do minus eight attach it to standard and standard outer standard error. I think I showed that a little bit earlier minus item make it interactive minus team makes it into a pseudo TTY. We can also name the container that we're creating from our image. And I already talked about minus V and minus P. And so for commands, we can also pass in commands to our containers. So after the image, if we want to say go into the shell, we could do slash bin slash sh to get into the containers shell. So just a little bit of tips and tricks, smaller images are better only install the things that you need. Make sure you clear your cache after installing. You can also mount symbolic links and make sure that anytime you're running a process, you're sending the IP address to zero dot zero dot zero. If you're using 127 dot zero dot zero dot one, that's actually a loop back interface and you're not going to be able to connect to your process inside of the container. I found out that the hard way. So let's talk about how we can incorporate Docker into data science. So I just want to be clear. There's millions of ways to do this. I'm just going to be sharing the ways I do this. If you take anything out of this talk, you have tools available to you. Use them as best they fit into your workflow. Think of my workflows as suggestions, if you will. So workflow number one is what I call the self contained container. As we talked about earlier, Jupyter notebooks are the perfect vehicle to share the results of an academic paper or a data study or a data study, but it suffers from the problem of it only works on my machine. So with Docker, we can create a completely isolated environment and reproduce every single calculation. So specifically, we're going to be creating a Docker image with notebooks, libraries and data, and we'll be pushing this image to Docker hub. Let's go back to the terminal. So you see here, I have my data folder, my iris analysis in Jupyter notebook and a Docker file. So let's examine that Docker file in a bit more detail. So here I'm building from the Python three six two slim image. I think that's based on the Debian image. And I'm just adding some metadata, setting my working directory, copying my data and my notebook into this working directory. And then I'm just doing some hip installing some of the required libraries like numpy, pandas, scikit learn and Jupyter. I'm exposing port 88 88. And this is the command I'm going to be running when I launch a container. So we got Jupyter notebook, setting the port, making sure it's no browser. And since we're running on route, we just have to make that flag set. I'm not going to build this one just because it might take some time to download some of these libraries. But I did this do this ahead of time. So let's take a look. So here I have my workflow number one. So what I'm going to do is I'm going to take this image and create a container from it. So we'll do a Docker run command. And since we want to map the ports, I'm going to use 99 99. And we're going to map it to the Jupyter running instance of 88 88. And that's the image that we want to run. And we can see right here that it is running inside of a container. So let's go to let's go to this URL. Pretty cool, right? We got our notebook running inside of a container. Let's make this a little and let's try importing libraries. Looks like everything's going and looks good to go. And so just notice on my local machine, I have Python 361. And I'm just going to refresh this so you can see that inside the container, we have 362, as we were talking about earlier. And then we can just continue on and just have users run through loading their data set, exploring the iris data set, visualizing. And we also have scikit learn. So let's just go ahead and make sure that ran as well. I really like the iris data set. And so I'm just doing like an SVM fitting here so you can just see that it's all flowing through. So we can just close that down. And if we go Docker PS minus A, if we want to restart that container again, we can just do a Docker start minus A and then the container name. And then it will give you another token and you can just copy and paste that into your web browser. So these are just going over the commands that I used before. So now we have our image. We want to upload it to Docker Hub. So we can do that by going to our terminal and typing Docker login to get into our account. Docker push where we'll have our full image name. And then you can just have instructions on your website or your blog to tell your users to get Docker pull your repository and it will get that for you. And then they can use the instructions about Docker run and Docker restart to to get up and up and going with your container. Workflow number two is what I call the data science team workflow. So this is great for consulting or project based workflows where we have to keep analysis separate or we have a team and we want to standardize the development environment across the whole team. So in this workflow we'll create a project or a team image that contains our development environment. Are we going to mount a shared folder containing the notebooks and data and just access it that way. So the benefits as this workflow are we can keep the project separate. If if I'm sorry. Pandas comes out with a new version and you want to make sure that it runs in your desk in your development environment. We can easily run up a new container and tested against your automated test grips. And if we have a development environment that's in a container it's pretty trivial to just on board a new employee. And just the way we do it at my company is we just have one person that keeps the image updated and there's sort of just like the Docker image manager and they'll just update that using Docker file or the Docker commit. And since it's me I use Docker file. So let's go back to the terminal. So here we just only have a Docker file. So let's let's take a look at our Docker file. So unlike before I'm going to use the mini conda image as my base image as a mini conda is just Python with the cond installer conda is really popular for data science. So I thought it'd be a good example for what we can do. Setting the metadata as before setting my working directory. I'll be installing the required libraries of cond install. Make sure you clean up make the port available create a mount point and then we're going to run Jupiter when the container launches as before I created an image already. So let's let's build off of this image. So we'll do Docker run connect the ports. We want to have a mount point. So I actually have a folder right here and then I think we call the app. Yeah. And we want the name of our container or sorry ever image close the window. So let's do this. And so you can see here we have access to that mounted directory. So we'll just confirm that I'm so well this is the the directory that we had before. So we'll just do an LS. So yeah you see the same directories in here as well. Yeah. So look. So there's them. There's a mount. So the question was where exactly is the data. So the data is on my local computer and I have a mount inside the container. So the data is actually in this local directory right here that. So the question was that notebook access anything else on the desk that is correct. So if you say have a symbolic link you actually have to like specify that you have to open that symbolic link. So I had the data inside of that directory before. So I'm just so that's a directory on my local machine. And I just created a window inside of my container to link to that directory. Any other questions. So this would be a rewrite because any change you make on your local machine will show up in the container. Changing the container will show up on the local machine. So I just close my notes. I'm just going to bring this back. Sure. So when you're running Jupiter and you're running it as a route you're not you need to have that flag there in order to run it otherwise you won't. Yeah. Yeah. No worries. So the question was does the operating system matter and no it does not matter. Yeah. It can use anything and you can pass them around from Windows to Mac OS Mac OS to Linux. You can build a container anywhere you want. And so with like with Docker so say for Docker hub you can build your your image and upload it. And so I just built an image. You can actually go to my Docker hub repo and download that same image. If you have Windows I have a Mac right here. Okay. My slides a little back here. Sorry about that. So the workflow number three is a data driven application. And so here once we completed our data study we might have to have an app probably a Python app that's going to contain a model or a dashboard to show our results. So if we want to monitor specific metrics that we've identified as having business importance we can easily do that using this workflow. So the number of ways we can do this is infinite. We can have containers connecting to other containers. We can have like one container holding just the machine learning part the other container running like the fitting part. So there's so many ways to to set these up. I've included links for further reading. So if you just want to check that out feel free at your leisure. And so what I'm going to be doing is I created a dashboard using plotly's dash library. And so all the data is stored on my local machine. And I'm going to put the code and all the libraries that required to run this code inside of my container. And then I'm going to make this container into an executable so it sits on top of the data. So every time we start a container we can go in and view our dashboard. So let's just go back to our So here I have my Python file which is plot time series. I have a requirement so that's just all the files that are all the libraries that I need to run that and then my Docker file. So let's go ahead and take a look at that Docker file a little bit more closely. So here I'm making it off the Alpine Linux image. It's a really small image. I really like using it. I'm setting my working directory. I'm copying all the contents of my directory into that working directory. So it's just going to be the plot file and the requirements file. And then I'm going to be doing a pip install with the minus r pointing to the requirements file that I created. Or then I copy it over, expose the port and create a mount point. And I'm going to be just using the entry point because I want Python to be my default executable. And I can just pass in the command of plot time series to plot that time series. So right now I'm just setting that as my default. So just like before, I did create this image earlier. And so let's go ahead and set this up. So we'll create a container using Docker run. We're going to be mapping the ports, the mount directory. So I have a mount directory on my local machine where I'm publishing just a number between one and five every two seconds. So yeah, you can just see the results of that. And I'm mounting it to app slash data. Let's give this a name. We have not seen that one before. So we can just call it dashboard. And I want to use the container or sorry the image I built earlier. So let's just like that right. Oh, sorry. Oh, we're for once. Sorry. Sorry. Yeah. Thank you. Okay, so we already have one with that name. So let's call it dashboard two. All right, so it is running on port 8080. So let's sorry, 8050. So let's go take a look at that. So here I'll have a live dashboard that's going to be updating every two seconds with the number between one and four. And so let's just make sure that it's actually working. So I'm just generating data right now. So let's just generate a little bit faster. So I think this does it every like half a second. So you can see already like the data points are a lot closer together. Pretty awesome, right? All right. So those are just the commands I went through. So I have another workflow that I've been working on is like it's actually developing in containers using like test room and development. I haven't really got too far in that July sort of my month of learning Docker really, really well. And I guess I just didn't do that properly. But I will be posting something on my blog in the next two, three months about my workflow that I found that works the best. And there's this post I found on the Docker website, which is live debugging with Docker. So you guys should check that out if you're interested in learning about that. So what are the next steps? Go on the Docker website, install Docker, Windows, Mac, OS, Linux, what have you. I don't know if you've been noticing I've been referencing the Docker documentation a lot as my sources. The documentation is fantastic. Check out getting started. If you're into Pluralsight, Nigel Poulton has a course called Docker deep dive, which is fantastic. And Sentry link has a great resource for Docker as well. And that's it for me. Little Easter egg. Also, thank you for your time. I will answer your questions. Yes. So Docker. So what work have I done using Docker compose with a data science project? So Docker compose is a way that we can connect. We can run multi container apps. I haven't really used Docker compose for a data science project. But I have done like a flash website with like a reddish back end as we're just connecting containers. If you're interested, I can put together some composed files for the examples I have right now and throw them on my GitHub. Yes. So the question was, can you make. So the question was, can you make Docker into an executable? So not really. I guess like you can use like the exact four, you can use the entry point form to make it into like an executable for a container. But you can probably write up some scripts to do that and automate that workflow for you so they can just double click like a script and it'll pull up everything and they don't have to worry about the underlying complexity. Yeah. Yeah, zip everything in a zip file throw everything in zip file, zip it up, just have instructions on how to do it. Yeah. Yeah. So just a point was when we're running Docker on Windows or Docker on Mac, we're not really running it on that operating system. We're running it in a Linux virtual machine. Yes, that is correct. And there are a lot of limitations that come across with that. And so like, I've run into problems. I can't really remember them offhand, but there is a there are pages on the Docker documentation that that point out the limitations. Anybody else? So the question, what data source can you use? You can use any data source you want. If you want to maybe spin up like an SQL container and have things in there, you can do that MongoDB container, you can do whatever you want. The workflows are endless. Yes. Okay. So the question was, have I worked with workflows where the data sciences has built a model and has to deploy it? So that was sort of the data driven app workflow I was showing. So here instead of like building up like a predictor model, I just have a dashboard, but you can run any kind of application you want inside of a container. And there's this open source project called Pack-a-Derm and it allows you to pipeline different parts of your machine learning pipeline into a container instance. So I highly recommend checking that out. Is there a help? I've never really dealt with anything like that. Just a shout out for Kubernetes from the audience. Anybody else? All right. Thank you so much.