 All right, welcome everyone to the Parrot Track of EuroPython 2020. I hope you all had a good time in the initial part of the conference, where Mark introduced you to all the platforms and the first time we are having this online. So this is gonna be the first track of the data science session, which is exciting, the Parrot Track. So without further ado, I'm gonna introduce you to Tanya, who's joining us from the UK. Oh, hey. Oh, you can unmute yourself. Of course, hello everyone. Of course, my dog decided that it's the perfect time to start barking at whatever she's barking now, because that's how it works. But thank you very much everyone for joining my talk. And I'm gonna share my slides, if that's okay. Yeah, I'm gonna let you go all the way. Let's share it now. Oh, five at least. So for those of you that don't know me, my name is Tanya Lard, and I'm a senior developer advocate at Microsoft. This is where you can find me. You can find me on Twitter, on GitHub, and on my personal site. I run a lot of projects, like site projects, things like mentor sprints. I'm also starting a podcast called Python 101. And I do a lot, lot, lot, lot of stuff here and there. After the talk, you're gonna be able to find this slides, these URL. They're not yet available, but they're gonna be made available soon after I finish the presentation. If you have any questions, feel free to put them on this board. And I, well, I'll answer some questions right after my talk. And some others, I'll just jump onto the chat myself. And I'm also gonna be available later on at Microsoft Sponsor Room. So we can carry on the conversation or talk about anything else that you're interested. And just to set expectations on what this presentation is gonna cover, I'm just gonna work out, well, explain what you want to use Docker, especially in a data science and machine learning context. Because it can be a bit different to when you're developing web apps or other sort of applications with Python. Gonna give you some tips on improving your security and performance when you're working with Docker. And some ways in which you can actually automate your workflow so you don't have to start doing everything from scratch. And I'm gonna finish up with a summary of tips and tricks and how to use Docker so you can start jumping straight into it. So let's start with what you want to use Docker. I am sure a few of you have been in this situation or you're developing an application. It can be anything, a model or a web app or an API that returns a prediction or something. So would you actually try to share this with somebody else? If you are not sharing your environment as well as the specifications, you're gonna see that your folks or your friends have actually encountered this problem. Either a module is not found or a data or environment variables are not set. So if you're not sharing all of the required things that folks need to actually rerun or reproduce your analysis or your app, then that's a massive blocker for that rally. So Docker is actually an amazing tool that helps you to create, deploy and run your applications using containers. And this gets rid of the problem of one hand, your laptop is not a production environment. So it allows you to mimic environments where you're gonna actually be deploying your apps or your products that your customers use. And this is how throughout my presentation I'm gonna point to a container. This is just a mental exercise. This is just like mental representation but it will help us to understand where in the workflow a container fits. So as I said, again, containers are a great way of solving the problem of your laptop is not a production environment. So it really, really allows you to get your software or your application and your model from one computing environment to another. So it can be your laptop, your text environment, your staging environment and your production environment seamlessly. And with all dependencies and requirements already bundled so that basically you can run it out of the box or out of the container. So then when you're working with containers not only you are already developing your application or your model but you're also bundling together all the libraries, dependencies, the runtime environment and configuration file within a container so then fold the properties and reuse and build on your application in a seamless manner. So if you're familiar with virtual machines if you've ever used virtual machines to develop on a different platform or a different operating system this abstraction might sound a bit familiar but the nice thing about Docker and containers is that it actually operates at the app level. So you have your infrastructure whatever your infrastructure is if it's cloud, if it's local, if it's a HPC you have your host operating system Docker sits on top of that and you can have multiple apps containerized. So it takes a very little overhead and you can have multiple runtime environments multiple containers running on the same infrastructure as an isolated process. The difference with the virtual machine is that all destruction happens at the hardware level. So instead you have your infrastructure, hyper advisory and then you have full guests operating system. So you can imagine that you'll have a full windows operating system, full Linux, CDM for example full Linux and Red Hat for example but it's very, very bloated because it includes all of the native packages and all of the native dependencies that said operating system needs to operate. Oh, I'm going backwards. There with me. So that makes it much, much bloated than an actual container because you'll also have the binaries. Now, when you talk, if you're not very familiar with the lingo of container and images sometimes you're going to go read tutorials and this happened to me at the very beginning when it was just getting started and there are so many words that are thrown in. So for example, let's start with image which is an archive with all the data needed to run that. So basically that's just an snapshot of all the libraries, dependencies and the code for example that you need. You then have a tag as you would normally do with your software when it's in version control and you're making your release. Every time you update for example binaries or the libraries, you make a tag and then you can pull that image from repositories like Docker Hub for example and you then run it, you run the image and every time you run the image it's going to create a container and that's actually where you can create your development work where you can do your work and you can mount volumes, you can persist some data but it all spins from the Docker image. You can actually have multiple containers spin out from the same image. Now again, if you've tried to learn how to use Docker more than likely you've gone to your web storage engine and you've typed Python and Docker tutorial and you're going to find out that most folks have actually developed tutorials based on web data or sorry, web applications but during that exactly the same when we're doing data science and machine learning we can have a lot of complex setups and dependencies especially if you work with things like Arrow or TensorFlow or Qtflow or things of the such those dependencies can be very complex and getting things right can sometimes take a long time. We also have high reliance on data. We have data ingress and ingress all the time and we even work with data that are not publicly accessible or sensitive. So making sure that our development environment is secure is critical. And we also work especially when you're in the iterative research development process of developing either a new model or an algorithm or application that that process is very, very fast evolving. So sometimes you're installing dependencies just to try something on, it doesn't work then you try another dependency and such and getting things right in Docker and orchestrating containers can take a lot of time it can be a bit complex when we're working with them. So how is it different from more of apps, for example? We all know that our Python packaging ecosystem is a bit complex to say the least. So sometimes we want to understand where or we want to find ways where it is good enough to start with packaging. We have very, very complex dependencies, for example, online. If you have a team where some folks are working Python and some others are working with R and some others are Julia, where do you draw the line on? We need who has, we need R, we need Julia. We need all of these dependencies. We need to optimize builds for this. It is very, very hard sometimes to reconcile and get robust and also lightweight containers for data science works. And also in our case, not every deliverable is an app. Not all of the things that you do with machine learning or all of the products that you're doing machine learning is going to end up as an app or as an API. We have multiple types of deliverables. And also there's a lot of real emphasis out there saying that machine learning involves model and that is not the tree either. Not every deliverable is a model. As I said before also, we rely on data. Data is our primary material. So we deal with this in many, many ways. So because again, how our scientific Python ecosystem works we're going to have a mixture of files and mixture of compile packages that we're going to need. Probably we're going to have Kanda. Some other folks are going to be using things like poetry and Pippen. And we have a lot of different channels as well if you're using Kanda. And as well, if you're working in a multidisciplinary team or you're working across multiple projects in machine learning, people are going to have and going to need different security access levels for both data and software. For example, if you're working with highly confidential data you might just want to block internet address from the container. But other folks are probably working with public data sets and they do need that such high security access levels. So again, how do you compile all of these apparently confounding requirements? It can be very, very tricky. And finally, especially when you're working to create products based on machine learning you have a lot of folks that use Python and use Zeta in a very different ways. You have folks in the data science team, software engineers, some other teams might have machine learning engineers that take care of taking your machine learning deliverables into production or out there in the world. So again, when I was learning to use Docker I experienced a lot of frustration because I would go and say, well, go and search how do I build a Docker image for Python? And this is like what I would find everywhere which is a bad example and I'm gonna be telling you why this is a bad example. But you start every time you start with a Docker file and it's a specification file where you're providing basically a set of instructions on what software to install and how to configure your image. So you start from normally a base image that is gonna define what for eating system or what kernel you're gonna be working on. And in this case, we normally want to use something starting with Python then to provide the main instructions. And if you're familiar with Bash, it's very, very similar. It's the same syntax. It is the same kind of instructions that you would go on an entry command. Now, you have to be very careful because everything stacks on top of each other. So Docker, like all of the Docker images is creating layers. So you can imagine that if you follow traditional workflows you're gonna end up with a very, very big and blocked image. So how do you even choose the base based image? Well, it depends a lot on the requirements and a lot of folks and a lot of tutorials out there tells you to use Alpine because it's very lightweight. It doesn't have a lot of unnecessary binaries. Don't do it. It's an absolute pain to get ends there working. So if you're gonna need to build from scratch, use official Python images and I recommend you seeing the slim version that it's like, as the name says, slimmer, it's thinner. But I would recommend you seeing also either a different poster for 3.7 and 3.8. I can talk more about that in detail, but if you need to build everything from scratch, use those. If you don't need to build anything from scratch and I absolutely recommend this for most of the cases, go to the Jupyter Docker stacks. The folks there have done an amazing job trying to understand what are the traditional or the most common requirements that data scientists have and they've already pre-baked a lot of Docker images for you. If you've ever used things like Bander or Recoach Docker, again, this is very similar stack and that saves you a lot of time, a lot of fiddling and a lot of headaches. So next, we find our base image, what do we do next? You always have to know what you're expecting. If you remember in the first example, it just says Python 3, we need to know again the specific tags so we always know what we're pulling, avoid using things like latex, provide context with labels, especially if you're sharing this. And something that a lot of folks forget and a lot of tutorials don't tell you about is adding a security context because this will allow your Docker container to be much more secure. And you can start using tools like Snake for a vulnerability assessment. If you need to run very complex statements, split them and sort them and as a general term of role, we have copy statements and add statements. I always prefer copy because it's a much better way to do it. Again, also make sure to use the cache because whenever you're building your images and you change something within your software, depending on where you put that, like this copy, the run statements, everything is gonna rebuild. So try to leverage this build cache. Normally, install the requirements first unless you're updating your libraries. Do a clean of your installs, either using Conda or PIP so your image is not bloated and then separate your instructions for scope. This ensures that your cache is hit as appropriate as possible. Again, only install what you need and concrete versions of everything. And something that we forget a lot when working with data is explicitly ignore files. If you're familiar with GitHub or sorry, with Git or version control, you might remember a get ignore file. We can have something similar for Docker that is called Docker Ignore and it follows the same process where you can deny or add a certain number, certain files or certain directories that they actually are not passed directly onto your Docker file or used into your container. This is especially good for when you have settings or environment variables or super secret keys that you don't want to go out there. To access data, there are lots of ways to do it and it depends on where your data lives. If you are using local data, create mounts to directories instead of moving the data over because you always want to have your data up to date instead of baking them into the container and also create a non-route user. And this takes us directly to security and performance. By default, Docker allows you to do everything that a root user does, but you don't want that. You don't want to introduce vulnerabilities. You want to privilege, to favor the least privileged user. So if you go, for example, to any of the Docker science at data sciences, tanks from the Jupyter stack or the repo to Dockermon, you're going to see that we're creating, but they create a non-privileged user called Jovian, for example, that's where all of the work is going to take place. And that allows your container to be locked down. That means that folks are not going to have access to the kernel, are not going to be able to do different, potentially damaging actions at your minimizing capabilities. Also, it's going to prevent a lot of issues. I don't know if you've ever been working in a container or something. And then it tells, when you're trying to work, it tells you get an error saying that you don't have access to a certain directory or you can't mount a volume. Normally it is because there is a non-compatibility between root user privileges and whichever user you're trying to work on top of the container. And having these tightened is very, very essential, especially if you have the security access level, restrictions if you're working with confidential data. Again, I said that all of the Docker containers are like onions. So everything is contained in different layers. Sometimes you think, oh, well, if I copy this key in a layer and then just delete it over and clear my cache or something, they're not going to be there. Everything stays there. Everything stays in an intermediate layer. They might not be visible in the outermost layer, but there are tools in which you can actually see how your whole Docker image was built. You can inspect the layers and folks can extract all of your sensitive information. So again, keep them out of your Docker file. There are different ways in which you can keep all of the sensitive information and something that I really use a lot and that I recommend and it's probably a very, very robust manner to do is multi-stage, is using multi-stage builds. Where basically you'll have a base image, for example, in this massive Docker file, I'm using Slim Buster and that's my compile image. So you can fetch and manage secrets there and not everything always needs to be compiled. Sorry, not everything comes as well. So if you also need to compile packages, for example, if you need GCC or GIF return for something, you can do the compile also in that first layer and then carry it over for a second layer. And also using this approach gives you a much smaller images overall. So again, I'm gonna just go over how you would do, for example, you have this Docker file and you would use the same command Docker bill. You specified your Docker file and the context and to provide a name and a tag. So it's gonna start first by creating this Docker image that is compile image. And in this particular example, I am using, I am compiling some packages, sorry. I'm providing options for my compiler and installing some requirements. Then in the second image that is the actual runtime image, that is the actual one where I'm gonna be doing my development work is I carry over all of this compiled packages or pre-compiled packages and install them directly in a virtual environment that I'm creating. So this virtual environment is what is gonna have all of my final compiled install dependencies and also if I were to have secrets in the first image, let's say special compile flags or a special access flags, those are not passed over to the runtime image. So that also makes it much more secure for me and whoever is using it. So my final image has that tag, has that name that I provided as part of, sorry, as part of my Docker run command. And it contains everything that I need, but it tends to be much, much smaller. I don't need to carry over GCC and G4Tran, which is now unnecessary. And I know that probably at this point, this all sounds like a lot because it is a lot. And it can be very, very overwhelming, especially when you are not an experienced Docker user or jury getting started. So my best advice is always automate, try not to reinvent the wheel. Most of the times you don't need to build everything from scratch unless you need very, very specific setups, permissions, libraries or access levels. So again, to start, if you already know what you're, something that I always recommend to anyone working in data science for rapid disability, portability, every disability is always know what you're expecting everywhere. So the best way for you to automate and optimize your Docker builds is also having a consistent project structure and template. I like using the cookie cutter data science and there's like a Docker ready version, which is cookie cutter Docker science. And it already gives you a very good baseline on how your project should look like or should be built. And this makes it much easier when you're building your Docker files and carrying over software or carrying over files because it's easier to know where things live, where things are living, sorry, and debug stuff. And unless, again, unless you have very specific requirements, leverage the use of tools like repo to Docker. Because it already gives you configured and optimized Docker images. All the folks working in Jupyter binder and repo to Docker in general have been put a lot of work and further. And you can install repo to Docker through condi, through condi, so do condensal. And then if you already have a repository with a Jammel file, with an environment of Jammel or apartment sexed, you run Jupyter repo to Docker and you don't have to create any Docker files. And it's gonna create all of your Docker, well, it's gonna create your Docker image, ready to use. And also if you want to use something like binary, this is the same Docker image that it would be created by binary. So you ensure that your project is ready for usage. And instead of having to write a massive Docker file, everything is done for you. And I absolutely love repo to Docker and massive fun. And it works with whatever, like the beautiful thing of repo to Docker is that it works with pretty much anything or any package of specifications that you're already using, whether it is an environment of Jammel, a bit file requirements. If you're using Julia as well, you can use Julia specification environments and install R for our users. You have a lot, a lot of, you can even use a NICS package manager specification files. And if you are already like a Docker heavy user and need very very specific environments, you can provide also your Docker file. Also, if you're using containers, Docker containers to do your dev work, you probably do it daily or every time you work in a project, that's good. But also make sure that your image is built frequently. If there's a Docker container that I use every day, I probably want to rebuild my Docker image every week or every other week. Because that ensures not only that all my dependencies are up to date, but also the binaries for the operating system that I'm using. And if you're using things like the Python 3.7 Slim Buster, you also get the latest security patches and security updates. So it also ensures that your Docker container is continuously receiving and being updated to the latest security patches. But you don't have to do this manually. If you already have version control and are using things like Travis GitHub Actions to build, well, to test and to continuous integration and continuous delivery, you can delegate all of this build of your image to this tools as well. Something that I do is, for example, now in GitHub Actions, not only you can create your image when there is a pull request or when there is a release, but you can set schedule builds. So you can say for in this example, I have weekly and Sundays at two o'clock in the morning, I don't know, it's a totally arbitrary time, but it can be every Monday at five o'clock or every Friday at five o'clock when the week is finished. And then you have the concrete tags and it can be pushed directly to whatever container registry is, whether that's Docker Hub or Azure Container Registry or a Google Cloud or whatever, or your own in-house. So this makes your workflow much, much easier because you have your code inversion control, whatever you're using to build your images, if it's a Docker file or report to Docker, you can have your triggers on tags, your schedule triggers, build your image, push, and then you can use that, and oh, sorry, and everybody can use this readily from, from your container, from your repository. So just to summarize, and I know I am giving you a lot, lot of information, and probably at this time, your brain is full of a lot of stuff. I'm gonna give you the top tips, and these are at least the minimal or the baseline requirements that you should try to get into your Docker and Data Science workflow. First, reveal your images frequently, make sure that you're getting the security updates for system packages. This is especially important for avoiding vulnerabilities or problems with any of your images. Second, never work as read, minimize the privileges. If you're building your own images, make sure that always arrive forward your entry point as after you've built all of the binaries and all of the system specifications, you are switching to non-privileged users with access to whatever the working directory is. Don't use alpine Linux. It's very good for a lot of stuff, but for Data Science and Machine Learning, it is much more trouble than it's worth. Yes, it is a very small image, but you're paying the price for that small size. My advice, go for Boster. That's probably the best distribution at the moment. It has long-term support, user stretch, or the Jupyter stack. If you can't use a Jupyter stack, always know what you're expecting. Pin all the versions, pin everything. Try to use, instead of using traditional PIP, and this is very opinionated, instead of just doing PIP install requirements, blah, use PIP tools for dependency resolution, or a tundra, or poetry, or PIPM. Choose whatever tool you prefer, stick to it, and make sure that you always know what you're expecting from your base image. From all your dependencies, and even from your databases. Leverage, build, cache, be very smart, separate all of your run commands based on the context. This is gonna ensure that your image doesn't get rebuilt every time there is a minimal change in your code. So make sure that everything, it's making the most of the building cache. Use one Dockerfile pair of projects. Sometimes folks have a single kitchen sink container and or Dockerfile, and they have all of the 70 dependencies that they need for every single project they could be working on or the company works on. It is very, very, very, very troublesome to do it this way. So one project, one Dockerfile, one image. And use multi-stage builds. If you need to compile code, if you need to reduce your image size, if there is no way that you can use build plans or environment variables when orchestrating your code, orchestrating your containers, use multi-stage builds and make your images identifiable. Sometimes you might need to provide different environment flags or different build flags to differentiate from test production and environment, test production, test production and research and development environment. It's because sometimes you need access to different databases. You sometimes need different ingress or different agress rules. So make sure that all of your images are identifiable. Make sure that you are providing the right variables and do not reinvent the wheel. Use repo to Docker. If none of these advanced requirements apply to you or to your project, use repo to Docker. It is amazing and I love it. And I use it all the time. And finally, automate. There is no need to build during image yourself every week and push it manually. Delegate as much as possible of these tasks, like building, tagging and pushing to whatever platform you are using for your continuous integration or continuous delivery. I demonstrated an example with GitHub Actions because it allows me to do scheduled runs or scheduled tasks. And that works for me. But choose again whatever works for you and your team. But don't do it manually because it's boring. It is, it is boring. And you don't want to be rebuilding during image manually and pushing it every week. And use a linter. I didn't mention this before, but I use, well, my editor of choice is VS Code. I've been using it for a very, very long time. And there is a Docker extension and I absolutely love it because it provides linting capabilities. I can write my Docker file and make sure that everything, well, that I'm using the correct commands that everything is written accordingly. And it also helps me with a lot of my tasks on my Docker development workflow. So, especially when you're starting to use Docker, I highly, highly recommend using a linter. Just for you to make sure that your syntax is correct, your construction is correct. And also if you're working with multi-stage builds, sometimes it can be quite hard if you have everything in Docker file. I sometimes split them in separate Docker files, but just generally use the linter and that will make your life so much easier. In a similar way as linters for Python work. So, I hope you find these tips and the content in this presentation valuable and that has convinced you of it to try and optimize or improve your Docker and data science workflow. As I said, I'm gonna be taking some questions now. I have probably like five minutes or so. And I'm also gonna be later on in the Microsoft and VS Code room. So, you can come chat with me about Docker, machine learning, VS Code, I love VS Code, Jupyter and Netflix, whatever is that you that you want to talk about. And thank you everyone, thank you very much. And I think I'm gonna be stopping sharing my screen. Hey. All right, thank you for that amazing talk. It was, I think all of us learn something new out of it. So, yeah, this applause is for you and also for your dog. Oh, my dog is the worst. No, it was awesome. We were having fun. So, we have two questions. I think you have five minutes left and we can take them right now. So, the first question is from Ignasi. Why not use environment variables or volume mounting for sensitive information instead of increasing Dockerfile complexity with multi-stage builds? Why not use environment variables? Especially, well, it depends. If you are just setting your environment variables and then providing them through however, you're orchestrating your container. For example, if you are using a Shure or AWS or Kubernetes, you can provide those as you're running your container. But a lot of folks actually use them as build environment variables and those are persisted into your final image. Those are the cases where you should avoid providing those directly. If you can provide them at runtime environment, that's fine. No, okay. I hope that solved your question. You can also type in the questions if you have. We have actually time left for more questions. So, the next question is from Diego. About users and mounted volumes, I wanted to share a Docker image with multiple users and contain amounts, contain amount of volume in RW mode. The process will set a UID-GID of the Docker user and not the host user. This potentially can lead to all sorts of errors in terms of permissions because the Docker user and the host users are different. How do you solve this issue without rebuilding the image with the right user? Oh, so I normally, to avoid these issues, these permission issues, that's why I set the non-privileged user with Vendor Dockerfile. That's the easiest way because that way, you can set your UID, your G-UID, and I just forgot that command. So you can create your directories and I forgot the command if someone can help me. So you can actually ensure that the permissions are correct. Oh, you can post that on the bake room later. Yeah, I'll do that later. But I can provide an example of how I do it on my Dockerfiles normally to prevent this. It is very, very hard if you don't do... Oh, churning. You have to do a churn on a whatever directory for the relevant user ID and UID. Otherwise, you're always gonna have problems between these Dockerhost and Docker user's permissions and that's normally because of the default behavior of Docker always running as root. Okay, so the next question is from Johannes. Could you say something about EMD bars, build bars, and especially about DB access and what one should avoid? I think I missed that, so can you repeat my question? Oh yeah, yeah, yeah, got it. Could you say something about the environment variables and build variables, especially something about database access and what one should avoid? Oh, yes, so for example, for a database, an example is sometimes you have a production database and sometimes you have another R&D database because that's how sometimes companies or a project is designed to work. So when you're using the command Docker run, you can actually pass a build variable. You can have an environment variable that takes one value when it's production and one value when it's R&D, for example. And when you build your Docker image, you can set those and you can imagine that almost if it's production, your environment variable would be pointing to your production database R&D when you're in R&D. This is very, very helpful because that way, you ensure that folks are working within the domain that they need. And also because in many cases, when you're working in, I've seen cases where folks are working in an R&D environment and they share, for example, the same password or the same user for access database. And that's okay if you can't do like, completely wipe out your production database. But you want to be very, very careful and avoid any destructive operations when you're using your production database. And that's when setting this test development and production environment variables through setting a build flag can be very, very useful. All right, that's awesome. I think we are out of time. I think we can take rest of the questions to the break room or you can reach out to Tanya later. Also, thank you for your talk. Thank you everyone for coming. I think we have a short coffee break after which we will be coming back to the Paratrack again. Awesome. Thank you very much. Bye.