 I had a great time too, and by the way, this was my first year of Python. So I'm going to talk about Docker and Docker and Python, and Docker is a popular way to package and run applications, however, when you're packaging Python applications in Docker, there are some caveats, so I'm going to share with you my lessons learned when I was trying to optimize Docker builds. Now, I hope you will find something useful, something valuable, something that you can take away and apply in your environment. My name is Dmitri, and I'm a system engineer at Cisco. I create Python applications for internal use, and I focus mostly on network automation. You can find slides here if you'd like to follow along. Now, before we proceed, let's do a quick show of hands. Who has been using Docker already? Wow. Okay. Keep your hand raised if you are using Python and Docker together. Okay. It's the same amount. That's awesome. Okay. So in this talk, I'm not going to talk to you about why Docker and what are the benefits and if you should use it. If you haven't already, you should check it out. I started using it around two years ago, and it completely changed the way I deployed the application. So to make sure that everyone is on the same page, let's start with some Docker terminology. Okay. So first, container. It's a lightweight way to package your application with dependencies. And different containers have some isolation. They have separate user space, but they share the kernel of the host. Now, next one is Docker image. So Docker image is a template to create Docker containers. It's built using the Docker file, and it consists of freed-only layers. We are going to talk about layers later. We can upload the image to the registry and share it with others. Now, Docker file is a set of instructions to build an image. You start with the base image from which one you are going to inherit. And then every instruction does something, and it creates a new layer, which is cached for future builds. So in this case, I have a very silly example where I inherit from Debian image. I copy a file from my host system to the image. Then I run some command inside of this image. And then I say that, okay, my default command, when I'm going to start, this container is going to be there. Okay. So what is Docker container? It's a container created from Docker image. We add a writable layer on top. We allocate resources. And then when we start the container, we execute entry point and CMD commands. Okay. And the last one is registry. It's a place where we store and share tagged images. Okay. So here is the very simple diagram. Again, just to summarize, we have a Docker file. We use Docker build commands to build an image from it. We can change the tag using Docker tag. We can push and pull this image to the registry. And when we want to run a container, we do Docker run on the image tag. We apply entry point and CMD. And here we go. We have a container. Now, in this talk, we are going to focus mostly on the left part, on the build process. Okay. So Python and Docker. So on the left-hand side, you can see my sample project. So I have my project is a directory which contains, which is a Python module. And then I have main POI which I run when I want to run this application. And it didn't work some function from my project grid. Details of the Python itself here are not really important. The structure is more important. Now, another thing that I have is the requirements. Which contains Python dependencies. In this case, I have only two. Requests and cryptography. And you will understand why I chose something like cryptography for this talk. Now, in the middle, you can see a sample Docker file where I inherit from Python 3.7 image. I create a directory called app. And then I copy everything from the host system from the current directory to the image. I install my dependencies. And then I define that I'm going to run Python main POI. Now, it's so good. It's very simple. However, the size of this container is around one gigabyte. And this talk is going, in this talk, we are going to see how we can improve on that. So, before we go further, let's define our optimization objectives. What we are going to optimize. And there are two things. Image size, but also the build time. And it can be initial build time as well as subsequent build time. Now, let's also define priorities. And those are the ones that I defined for myself, for my projects. During development, I would like to have fast builds. Okay. I care less about image size during development because whenever I change my code, I like to see results much faster. But for production, I prefer small image size. In your case, the priorities could be different. Okay. So, the first and the most important one is selecting the base image. So, here is the comparison table. I have Python 3.7, which corresponds to talk Python 3.7 stretch, where the base image for that is Debian stretch. The size of it is around 900 max. We have Slim stretch, which is much smaller version of, but it's still using Debian as a base image. It's 150 max. You can see it's six-time difference already. We also have Alpine. And Alpine is very popular. Alpine Linux is very popular base image, especially in container world because of the small image size. You can see it's almost twice less than Slim stretch. Now, what are the differences in terms of the Python applications and these base images? Well, Debian uses JLip C and what it allows it to do is it supports many Linux wheels. Now, when we are talking about many Linux wheels, we have to talk briefly about Python native extensions. So, Python native extensions, we usually have to compile them to make them work with Python. And those are libraries that some of you may know, like cryptography, XML, and some others, they use Python native extensions. However, there was this way of creating a many Linux wheel where we don't have to compile it. It's pre-compiled for us. We just download the wheel and extract it. However, the Alpine uses muscle and it doesn't support many Linux wheels. So, the consequence of that is that Python extensions should be compiled. And if you have ever tried installing something like XML in Alpine, it takes around 15 minutes just for that one dependency. Now, if you want to know more about many Linux wheels and Python native extensions, there was amazing talks this year at PyCon US, the black magic of Python wheels. I strongly recommend you watching that. Now, also, what I notice is that in Alpine, some of the well-known packages take much less space. For example, for Git, I think there was three times difference in size. So, in general, the footprint of Alpine images is much smaller. So, here is my recommendation. When you care about the build time, I would select slim stretch as base. Whenever you care about image size, I would recommend selecting Alpine as a base. The main reason why that for a slim stretch you can use the many Linux wheels. So, let's do that. So, I changed my previous Docker file and now I'm using slim stretch. You can see the size from one gig is around 200 megs now, excuse me. This Alpine is not so simple anymore. Remember, we have that cryptography dependency and it needs to be compiled to be able to run Python because there is no manual Linux, well, there is no build for Alpine for cryptography. So, in this case, I have to install tools like GCC, but also some packages with the headers like OpenSSL Dev. And when I do all of that, you can see the size is 300 megs, which is actually bigger than slim stretch. So, you may wonder, like, you just told us that you should use Alpine if you care about image size, right? So, what's going on here? Well, let's first define the problem. So, the problem is that the build dependencies which are contributing to image size here are needed only for compilation but not the runtime. This is the main issue here. So, the general solution to this is you have to include only files necessary for your runtime in your image. So, how to achieve that? Well, first, let's take a look on when we're copying the source code to the image. So, recommendation here is to use more specific copy statements instead of broad copy dot dot. And you can also use Docker ignore file to ignore some of the files when you are doing that copy. So, here is the example of Docker ignore file I use. I just copy it from every project. So, something like Dunderpy cache or PYC files or VM directory, I usually ignore that to make sure that it's never in my image. Okay. So, let's apply the technique of instead of having the broad statement copy copy dot dot and convert it to more specific copy statements. In this case, I copy my module, my Python module, my project and main py. So, you can see that the size decreased. The reason why that I had VM on my host and it previously got copied to the image. So, now it's not there. And the same is Alpine. We also get around 20 megs from doing that. Okay. Now, this next one is very important. Removing unnecessary files. And it's not as easy as it may sound. So, let's try using that Alpine Docker file. It's exactly the same. However, I added it's a very end additional run instruction where I'm trying to delete GCC open SSL def and some other packages because I don't need them during the runtime. And if you do that, you can see that the size of the image hasn't changed at all, right? So, what's going on here? Well, to answer this question, we have to understand how Docker layers work. So, every instruction creates a new layer and then the new layer can be smaller than the previous layer. Those layers are used for subsequent builds and layers themselves introduce some overhead. But the first two are the most important ones. So, again, the new layer can have smaller size than the previous layer. So, what's the consequence of this? What is the takeaway here? First, it's combining multiple run statements into a single one so that they are formed the same layer. If you need to delete files, you have to make sure that you delete them in the same layer where they were added. Because if you do it later, it has no effect whatsoever on the image size. If you want benefit from caching, you have to arrange your statements in the order from the least changing to the most changing. Usually, the order will be system level dependencies and tools, Python dependencies, and then the source code. And another tip would be not to save anything to cache. For example, with PIP, you can use no cache here, not to save builds. With APK, you can use no cache option as well. So, let's try to apply these principles to our problem here. Now, in this case, for SlimStrage, the only thing I added was no cache here. So, I got additional four Macs. And in case of Alpine, I defined my build dependencies and run time dependencies. And then I combined all of my run statements into a single one. So, I have, first, I install my build dependencies, then I install my Python dependencies, then I delete build dependencies, and then I install run time dependencies. And all of that is in a single run statement. And the result, you can see it's already three times smaller. So, it's 100 Macs. Now, from this point on, I will no longer consider SlimStrage anymore, because we can already see the image size of the Alpine is much smaller. So, we are going to continue optimizing that. But SlimStrage is already good. So, because I personally use SlimStrage for my local development builds, I don't care about those 20, 30 extra Macs. But in case of my production, image size, sometimes I do. So, we will keep decreasing that size. So, here is an optional thing that you can do. You can delete your PYC files and tests from your dependencies. If you want to, if you do, then it becomes even more complex. So, you have to find those files in your user local PYC files, test files and delete them. In this case, you get additional 10 Macs, sometimes you get around 30 Macs from it. Really depends. Okay. What are the disadvantages of this approach that I just shown? Well, the Docker file becomes really complex. You have to always remember that you have to install build dependencies and then install everything else that you need and then delete everything that you don't need, everything in the single statement. The consequence of this is not only it's complex, but also you can no longer benefit from caching. You can't cache your build dependencies in this case. You will have to always rebuild the container. Okay. So, Docker multistage. So, the idea behind Docker multistage is you build an intermediary image where you have all of your build dependencies and you install your application. Then you copy result, for example, binary, if it's go, length, for example, or whatever result from your programming language is to a fresh image and then you label it as your final image. So, you have these two separate images, if you will. In one image, you do all of your build process and the second one, you actually package your application for future use. So, why would you want to do that? The resulting image size is smaller because you will have no build dependencies. It could be also faster because you can cache all of those build dependencies now. You no longer have to delete them anywhere. Okay. So, however, Python is interpreted language and the question is, are multistage builds relevant to Python apps? My answer is somewhat. The main thing is that even though Python is interpreted, you still may have those dependencies which use native Python extensions. Not only that, but you may also have some other tooling that you need as part of your build. For example, I am a big fan of a tool called poetry which allows you to manage Python dependencies. If you think about that, if you think about that, you only need that tool to install your dependencies from the log file, but you don't really need that tool to run your app. So, all of that can really go to that build stage, and then you can copy only your result to your final image. Okay. So, here is the idea and thanks to HINEC for sharing it on Twitter. In order to simplify copying from one stage to another, the easy solution would be to use virtual environments. So, the idea is that you have your code of your application, you create a virtual environment in the same folder, for example, you install all dependencies that you need, and then between the build stage and your final stage, you just copy the whole project directory, including your source code, including your virtual environment. And it works out pretty well. So, let's take a look. So, this is the example of Python Docker multi-stage. It may seem a little bit complex, but it's really easy from our previous examples. On the left-hand side, I have my builder stage, where I still have my build dependencies, I install them, I create virtual environment, and then I also upgrade PIP in that case. I copy my requirements, TXT, and then I install my dependencies. And then in this case, I also delete PYC files and tests. But you don't really have to do that stage. And then on the right-hand side, so, the result from the left-hand side is that in slash app, slash dot VNV, we have our virtual environment, and in slash app, slash my project, we have our module. So, in the second stage, I inherit from Python 37 alpine again, I install my runtime dependencies only, and then I copy my slash app directory, and that's pretty much it. So, you no longer have to care about, you no longer have to care about how you delete those build dependencies. The size is a little bit bigger because you use virtual environments, but I think that's fine. So, one additional piece that you get in this case, your build dependencies are now cached. So, everything, well, depending if you change your Python dependencies often or not, you can cache up to line 14 or maybe even further, maybe even the whole build stage. It really depends because you don't have to delete anything. So, in case you don't change your Python dependencies, you cache the whole layer. In case you change them, but your system level build dependencies are still the same, you can cache up until line 12. So, that's pretty nice because, previously, we couldn't do caching at all. Okay. So, now that we have that, you can also create your custom image with common build dependencies across your multiple projects. For example, as I told you, I like using poetry. In some cases, I need curl to download something. Sometimes I needed Git. I may also need, you know, a bunch of build dependencies. So, I just built a custom image for it and I store it in the registry. So, then that multi-stage is simplified even further. We just have to inherit from your custom image. And everything else is the same. Okay. A couple of quick advices here or suggestions. What I found for my local dev where I use Slim Stretch, sometimes binding, bind out of your source code instead of copy, it really pays off, especially if you have web app with some reload capability. So, that's pretty nice. So, you just have to change the code and you don't have to rebuild the container. It really depends, but it may be useful for you. Now, another one is adding the environmental variables, Python and buffered. So, everything is printed to STD out is a buffering. And then don't write byte code one. If you don't want to generate PYC files, which I think it's not really needed in your Docker image. Okay. Now, this is my example. I'm not going to go into details. You can download slides later where I'm using poetry. So, it's a little bit complicated to build. If you're interested in that, you can check it out from the slides. So, let's do a summary. So, first and foremost, you have to select a base image carefully. So, Alpine for smaller image size, Slim Stretch when you need faster builds. You have to take into account layer caching. So, combine different statements into one. If you want to delete something, you have to make sure you delete it in the same statement where you added that. You have to order statements from the list to most changing to benefit from caching. And then the last one, Docker multistage can help you avoid some complex removal procedures and benefit from the caching. And if you go down this path, I recommend using Python virtual environment. It's really nice in this case. Even though I'm not really in favor of using virtual environment in Docker container. And that's all I have for you today. Thank you very much. Thank you, Dimitri. We have a few minutes for questions. One, two, three. Okay. Hi. Thank you for this great talk. I have just a very simple question. Have you evaluated, like, using other base images, like clearly nooks or mini-depth, because they both have gdpc and maybe much smaller than Stretch Slim? Thank you for the question. I haven't. Now that you mentioned, I probably should check it out. Yeah, you should really check it out. Thank you. Thank you. If you write unit tests, how do you run them? Okay. That's an amazing question. It really depends. In my case, I built a development Docker container for that and I run that. So it's not the same as my production container. So yeah, I just have a separate container to run that. And I include my development dependencies there and I run them there. Thanks. Thank you. Any more questions? I'll go ahead and ask a question. Do you think if we, in the build stage, if you, instead of, like, installing a virtual environment and combing in the second stage, you would actually build the wheels and then use those wheels to install in the second stage, do you think that would, like, lower the size a lot or have you tried using this technique instead of building the virtual environment? I haven't tried it, even though it was one of the suggestions and one of the things that I wanted to explore. I don't have the data to confirm it, but I don't really see any benefit from doing that. But it is just my personal opinion. So because when we saw here, by adding a virtual environment, I just added five megs to the container. It was acceptable for my case. Okay. Thank you for your talk. Just for curiosity, how does your local development environment look like? What kind of tools do you use and do you use Docker when you're developing? So I, well, my local development is my Mac and I use Docker, sometimes I use Docker, sometimes I don't. So it really depends like how complex the application is. If there are a lot of outside of Python things, you know, for example, if it's web app, you know, some database, your front end and stuff like that, then I do use Docker to make sure that everything is working. And I would like, you know, to see and touch the result. If I don't really, you know, if it's only Python and nothing else, and I usually don't use Docker locally, but most of the time I do. So, and adding onto that, I do use tools like poetry to manage dependencies. And I think that's pretty much it. We have time for one more question. Also concerning testing, you said you have another environment for local development, you also use for testing. But what about integration tests? Do you run them in that same container? Or is it like I would be a bit afraid of running that in a completely different container than production? So there are different approaches for stuff. I do actually do it in a separate container. But that's just me. I don't have, well, according to my requirements, that's okay. However, I do understand your concern that, yeah, it may make sense to, if you have your integration tests to run it on your production container. Thank you. Unfortunately, we don't have time for any more questions. You can find them each around the conference. Thank you. Thank you.