 Hello, Ithama. Ithama Turner Trawick is a trainer and software developer and writes about Python at pythonspeed.com. And you've been doing Python for over 20 years. Is that true? Yeah, since 1999. That's amazing. And you still enjoy it a lot. I do, yes. Today you plan to show us something about Docker. Yeah, it's sort of the process that I use for Docker packaging to make it ready for production because there's a lot of details. It's complicated enough you need a whole process for it. Okay, let's see if your screen share works and then I'd say we should start to talk. So yeah, today I'll be talking about production ready Docker packaging for Python. Again, my name is Ithama. And the first thing you need to understand is that Docker packaging is really complicated. There's a lot of details you need to get right. And the reason is that it builds on a whole, on basically 50 years of technology from Unix in the 70s to Docker and modern Python packaging in the 2010s. I don't really want to talk about the 2020s. And so each of these technologies which accumulate over the years has its own assumptions about how things work, its own design mistakes, like however useful these technologies are, they're not perfect. They all have certain defaults which may or may not be correct in the case of running within Docker. And so the accumulation of all these technologies sort of intersects within the Docker packaging for your Python application. So you just end up having to get a lot of details right in order to make some that's truly production ready. And so this is not simple. And so the result is that there's basically I cannot cover all this material in one half hour talk. We only have 30 minutes includes questions. I have my own personal list of Docker packaging best practices has at least 60 items on it keeps growing. This was a training class. These days I would do it like a day and a half because just there's so much material and even then I don't kind of quite cover all the details, but can do it in some depth. And so within a talk we can't actually learn all the best practices. But what we can do is learn over a process for how you do this packaging. And the reason your process is in part just because of this complexity like there's a lot of details. A lot of things are easy to miss. It's easy to get sidetracked by certain aspects of the problem like, oh my image is huge and then forget about other aspects like security. But also because Docker packaging is probably a thing you're going to be doing on your job. And, you know, if you're working a job, there's usually lots of other things that you need to be doing. There's some critical bugs that interrupts you, you have to go to a meeting. And so this isn't the sort of thing where you spend half an hour. Finish it and it's done and it's perfect. You're going to have to put a little bit more time into it. And so what you need is a process process that will help you do iterative development so so that you can stop at any point and come back later. It helps you focus on doing the important parts first and reminds you what the important parts are, and that builds so each step builds on the net on the previous one so that you're sort of have this good cycle of continuous improvement. So I'm going to go through this process and steps in the process. And for each step I'm going to give a example of one of the best practices and list a few more, because I don't have the time to actually go through all of them. So what I'm going to do is at the end of the talk will be a link to my, the free guide I have on my website it's at least 30 articles, and it covers a lot of these best practices and far more detail. And so you don't have to like, try to, you know, remember all this, like, there will be a link to the slides and link to much more detailed guide for those best practices that I don't cover, or at least most of them. So this is an overview of the process. And this is what's going to structure the rest of the talk. You start out with getting some working and then move on to security, and eventually, the last thing you do is you optimize your image, so you can build faster make it smaller. And the idea here is you want to start with the most important parts like security is fairly critical and most applications you probably want to do it first. Having a small image is a thing you want to do we're probably like not immediately it's lower than very list. This is sort of a generic list. And in your particular application in your situation, the order might be different. So this is a starting point. I maybe that reproducible builds for example are really critical for what you're doing and so you might do them first. And this is also a process that sort of guides you the first two or three times you're doing that and eventually like you'll be doing a lot of these press practices automatically. And so you won't need to think so much in terms of this exact order like you might just automatically do a whole bunch of step four right from the beginning. But even then it's useful to have sort of a checklist of here's all the things I have to do because there are so many details to get right. And packaging your Docker image is just getting something working doesn't matter how good your packaging is how secure and efficient and small and correct. It doesn't actually run your application doesn't run your server. This is sort of the bare minimum to have said you've succeeded at doing something useful. And so the first step is just pack get your application working even if it's done in a sort of not an ideal way because it's just your starting point. So this example Docker file. I'm using the Python three eighths and buster Docker images a base image copying in all the files in the current directory running pip install to install the code and then use the entry point to say, when you run this image run the script to start up the container. So it's not a, as you'll see, even from the examples this has a bunch of flaws, but it's a starting point like and you have to start somewhere. The next step is security. Before you can feel comfortable deploying something publicly where anyone on the internet can access it. You probably should be making sure that that application is as secure as you can make it. Otherwise, you basically always have this worry that someone will break into it and get access to your private data, modify your website, take it down. And so, since security is sort of a sort of minimal prerequisite for running anything deploying anything. It's probably the good first step in terms of what best practices you should implement. And we can't cover all security practices but for each of these steps I'm going to give an example. And for security. One best practices don't run as route. Containers are a way to isolate processes from each other and from the host operating system, but they're only isolated in a limited way a virtual machine is much more isolated. And so, the, when you run a Docker image and create a container by default, most Docker images will run as a root as root. So if you run engine X or one is really run the official Python image was root on the window image was root. And the problem with running as root is that it gives rather more access than one would like to various capabilities of the operating system. So if someone manages to take over your process remotely, if your process is running as root, even in a container, it is much easier for an attacker to escape the container and escalate their access and take over your whole computer. And so good security risk practice is don't run as root. And so in this example, I've updated the Docker file so that after choosing the base image we run a command that creates a new user called app user. And we use a Docker file user command to say all later commands should run as this new user. So for example when you copy in files they'll not be owned by this new user, you run pip install it will run as that user. When you start up the container. It will run the container as this non user. And so with, you know, with three extra lines of code in your Docker file you're now having much more secure Docker image. Again, there's plenty of other security risk practices I won't go into them but guide down with you and has more details about many of them. So now that you have a working image that is hopefully somewhat secure. You might just start thinking about automation. This up to this point you just you have this backup file you build it manually, you could deploy it manually if you want. You know, over time you don't want to have to rebuild your application every time someone mergers a pull request, you might have other team members who are using this code base and they want things and built automatically they don't want they don't care about the details. And so good next step is to automate the builds so that your builder CI system automatically builds Docker images and pushes them to the image registry the server that stores your Docker images. So if you have a sample that script that does that for you. Do set minus you oh pipe fill which is a line you should have it every, every fast script just so it stops running when there's errors. We run our tests, do Docker build to build the image and then you Docker push to push your image. So if you put this in your builder CI system. Every time you trigger build it'll also build and push your Docker image. So once you start doing this, you have to start thinking about the way that you structure development and the way your development process integrates your build system. So for example, a common development process is to have feature branches. So if you have an issue 123, then the developer working on that issue will create a branch 123. And do the work in that and the feature fix the bottom that branch and then do a pull request, and then your build system or automatically build on the tests and build from that pull request. And you might want to build a Docker image. For every pull request because you might want to test that Docker image manually automatically maybe have an integration test. And if you use the scripts that showed in the last slide very which is fairly simplistic, what's going to happen is you have this pull request to the branch. And then you're going to build that image and you're going to push it and it's going to overwrite your stable release Docker image because you're always giving the images you push the same name. Like you're always pushing to your image latest, which means random pull requests are going to overwrite your official release Docker image. But as you want to make sure that Docker images builds by pull requests don't stomp on your release Docker image. And one easy way to do this is to name your images based on your git branch. So this variant of the build script. I'm getting the current good branch using the git rev parse command. Just for the record, I don't, I don't remember the commands ever I always use tech overflowers. Look at my notes because impossible to remember. So you get the good branch. And then you say I'm going to name this man such the such that the, I'm going to name my new image such that the part after the colon the tag is the same as the git branch. So if you have branch 123 more cowbell. Now this image will be your image colon 123 more cowbell, and it won't overwrite your production image or stable image. Again, plenty of other best practices you can do in your CI system, which we're going to do today. You have an image that runs your application. It is hopefully secure, and it has automated bills. And so now that you have automated bills you're starting to accumulate multiple images. You have your image of last week image built today the image from a pull request image your team I created. You're running it in production maybe people are running different versions of it. And so now you're more likely to see errors you're more likely to have to try to debug errors. So the good next step at this point is to work on making your doctor image easier to identify and also easier to debug. And so here's an example best practice for making your doctor image or even your Python code in general. So if you have a bug in your Python code. If you're wrong, bad input unexpected issue somewhere, you'll get an exception thrown if it's something that can't get handled. Then what will usually happen is it'll get converted to a trace back the trace back will be stored in the logs. And then if your server crashes, you can go look in the logs. If it doesn't crash, you have a bug report and go look in the logs. Look in the logs, you say, Oh, it was this function calling this function this line of code through zero through zero division error. This is the starting point for figuring out what went wrong. Because you know where in your code is error originated from. If you have a bug in C code. That's not what's going to happen if you have a bug in C code your program is going to crash silently. And the Python interpreter is written in C and chances are many of the third party extensions you're installing are also using C code whether it's a database adapter or math lab or empire. Most projects will end up using some C code. And so if you have a crash in your C code, you might get a core dump, but the file system for your Docker containers ephemeral and typically we'll just get thrown away once the process crashes. So process crashes may or may not. And then the file system disappears and then now you have nothing. You have no logs, no core dump. All he knows that your program crashed. And so it's extremely difficult to debug code in the situation. And I have a solution for this. There's a module called fault handler. And what it does is it adds some hooks so that if your program crashes in C code, it will do a best effort to print a trace back of the Python code. Where you crashed. And that means as crashes in C code give you the same information that crashes in Python code do nice Python trace back, which allows you to sort of figure out where your code that came. And the bug came from fun. You can say, Oh, this came out of the database adapter. There's a bug in the database adapter. And so just no having no idea where the problem came from. And the easiest way to use fault handler is to set a environment handler, an environment variable called Python fault handler so you set it to one. And you can do this in your shell for local running code locally you can your doctor file you can use the end command to end Python fault handler equals one. It's one extra line in your back profile back profile. And from now on anytime you have a crash in your C code you'll have a much easier time to bug you. Again, there are other best practices you can use to make your image easier to identify and easier to debug. So now I'm just up five so you have a working container, it runs your application it's secure gets built automatically, you made it easier to identify and debug. And so the next step is to say well how can we make it run better run faster have be less likely to have issues in the first place. And so that means things like making it start up faster, which can insert applications. You can make a difference, shut down faster, which can if you're like deploying new versions of your code you like fast shut down makes it easier to deploy bug fix. Allow your runtime environment to detect if your process is frozen, that sort of thing. And so one example of a best practice for operational correctness is compiling your bytecode. The Python interpreter runs your source code doesn't actually run the Python source code the text that you wrote in that PY file. What actually happens is it parses the source code and then creates bytecode, which is what the interpreter in the CPython version machine runs interpreter virtual machine. So, and then it takes a bytecode and writes it to a dot PYC file and stores it on disk. The next time we run your Python application instead of having to parse the source code and convert it to PYC, it can load the PYC directly that can speed up your startup. So if your Docker image doesn't have PYC is for all of your source code, that will mean slower startup because every time you start the container, it's going to have to parse the source code. And on your when you're running on your local computer, this is not really a thing you think about because your file system is persistent. So you run a program once creates a PYC is you run the program a second time and the PYC is are there and they can get used in your startup is faster. When you're running in a Docker image, every time you're starting container from a Docker image, every time you start a new container, it starts from a pristine copy of it whatever it was in the Docker image file system. And so every time you start a new container, if there's new note that PYC is it'll create them. And then when the process exits and the container exits that file system will get thrown away. And the next time you start the container and it'll again start without PYCs. And so, if you're packaging sign for Docker, you may have to explicitly create those that PYC files and have faster start. There's a couple of line example lines here you can add to your Docker file, one of them compiles a code that you've installed and typically PIP does this and you may not have to do this. Most of the time, if you just copy some code into a directory and you're just running it from there. PIP doesn't know it exists, PIP didn't compile it and so you wouldn't have to compile it yourself. And so you can use the compile all module that comes with Python to compile the code to bytecode as part of your Docker packaging and then start up will be faster. And again, plenty of other best practices from signal handling for shutdown to health checks. If you want to learn about signal handling for shutdowns, Hinnick, Shvalik has a nice article about this. And so at this point, you have a Docker image that is correct in terms of how it runs, but not necessarily correct in how you build it. So, you know, if it's been like, you know, we spent the past there to among other, you know, fixing bugs in your code and going to meetings but also doing Docker packaging. Over the course of two days. The things you depend on like the Linux distribution you're using for your base image. So if you're using Python, Django, NumPy, whatever libraries you use, there's probably not going to be a major release. And so if you're saying just install the latest version of everything. That's fine. Like if you do it today and do it yesterday, if you did it yesterday and then you'll get the same image, more or less, most of the time at least. Six months from now, if you try to rebuild an image that installs the latest version of everything, some of those dependencies will have changed. If you try to rebuild it two years later, all of them will have changed. And so the problem here is, if you're always installing the latest version of the code, you can rebuild, you might go back to something that hasn't changed in six months, you just want to do a minor bug fix and rebuild the image. It's only three major dependencies have changed and broken everything, even though all you want to do is a major bug fix. So over time, once you have a Docker file that you're using over time, you want to make sure that it's reproducible. You want to make sure that you're installing specific versions of specific packages, physical Linux distribution. So when you rebuild the image will get the exact same image. Just to say you shouldn't be doing updates, you should, but you should be doing those in a controlled manner, not as a side effect of doing a minor bug fix. So one example of the ways you should make your image reproducible is by choosing a good base image. Docker images are typically based on some other Docker image use the from command in the beginning to say use this as my base image and typically they're based on some Linux distribution. And so you want a Linux distribution that will guarantee things like security updates will also guarantee stability for some period of time. So like two, three years of guaranteeing bug fixes while not changing as major version libraries, that sort of thing like just that you want the Linux distribution to be stable, not change out from under you unexpectedly. And so wouldn't you long term support Debian stable or CentOS or all Linux distributions and make quick guarantee. The official Python Docker images are based on Debian stable by default, but they also give you access to different versions of Python not just the versions of Debian stable happens to have so in Python 3.9 comes out. Debian stable and have it with the official Python image will just take Debian stable and add Python 3.9 to it. So I like using the official Python images. So for example, Python 3.9 slim buster is Python 3.8 the latest point release so 3.8.4 if that's the latest release 3.8.5 if that's the latest point release on Debian buster which is the latest version of Debian stable. And slim means a smaller version because like a smaller version, the bigger version, bigger version just takes one disk space but has more debugging utilities like and if you use a stable base image you'll have more reproducible builds. Again, lots of other things you need to do like pinning your Python packages. Once you have reproducible builds, your builds are correct your runtime is correct. In some sense you're done. But at that point you might want to start thinking about some optimizations it's correct but you might be able to make things more efficient. And good starting point is faster builds because your time is expensive and if every time you do a build it takes 30 minutes to build your Docker image. You can't see if your tests are passing until then build passes just slowing you down wasting your time wasting your teammates time. So it's worth spending some time optimizing build times. And one best practice for making faster builds is avoid using Alpine Linux. Alpine Linux is a Linux distribution and it's a small Linux distribution makes for smaller images and so it's often recommended as a base image for Docker images. If you're a Go programmer it's a fine advice. If you're a Python programmer you should not use it as your base image. The issue is that when if you're a package maintainer who uploads packages to PyPI you can upload pre-compiled binaries like Linux, macOS, Windows and then someone who downloads that binary doesn't have to compile the C code in the package. Lots of Python packages you have lots of C code and so not having to compile the packages saves lots of time installing them. Alpine cannot use binary builds from PyPI these days at least. That might change in the future. So just to compare, if you install pen as in mud on my computer, if you use the Debian based official image Python 3.x Limbuster installs in 30 seconds and just downloads it and packs it and it's done. If you're using Alpine variant it takes 1500 seconds. It's 50 times slower because they have to compile a whole pile of C code. So if you want fast builds don't use Alpine Linux. And then plenty of other best practices. The final step is making your image smaller. Having a 2GB of images, waste bandwidth, waste time, might be worth optimizing that part. One example of this practice out of many is typically when you pip install something, let's say pip install pandas or Django, it downloads the Django package, unzips it or untars it and then keeps that package around so that if you pip install later it won't have to download it again. In a Docker image you're never going to run pip install again, so keeping this extra copy of the package around just waste space. So if you add the no cashier option to pip install, we'll end up with a smaller Docker image and then with no harm done because you're never going to run pip install again. Again, plenty of other best practices. With the recap, you start getting something working, make it secure, make it automated, make things easier to do, identify and debug easier and run better, make builds reproducible and then optimize with faster builds and smaller images. The goal here is to have some good stopping points. If you do security first, if you stop right after doing security, at least you have a secure image. If you mix up security and make your images smaller, you might have a half secure image which isn't ideal if you're forced to stop. Again, your particular application environment might result in different priorities. So this is just a suggested starting point for how you should, the order in which you should work on your Docker image. Your application might be different, but this is I think it's a reasonable starting point and a reasonable way to remember all the different things that go into it. So thanks for coming to my talk. As I said, many of these best practices are covered in great detail on a free guide on my website. There's links to that guide and other resources for Python macro packaging as well as these slides at pythonspeed.com slash EuroPython 2020. These are my email and Twitter account. If you have any questions, believe there's a talk channel and Discord for the stock of hash talk dash Docker dash packaging. We might have time for a question or two. Yeah, thank you very much for the talk first. And there's a few questions. And the first one is, can you give an example for install dependencies separately from your code as in your best practice. Yes, so the way Docker packaging works is install things in layers. So each line in your Docker file to first approximation is a layer. And Docker has this caching system where when you rebuild an image you'll say, if this layer hasn't changed, I don't have to rebuild it. And the way it decides if it's changed is based on either the text of the command or the files you copied in. And so if you install both your code and your dependencies together, that means you have to copy in everything and then install your dependencies and and your code at the same time. And so if you change your source code, that's going to invalidate the cache and you're going to have to rebuild. And then you're going to have to reinstall all your dependencies. So even though, like you're still installing the exact same packages you're still installing the exact same version of Django exact same version of your Postpress adapter, so on, you're still going to have to you can't use the cache you're going to have to redo that from scratch. If however instead of copying all the files and installing things together. And then you run pip install minus our requirements that text, then the caching layer can say oh requirements that text hasn't changed so I can just reuse this layer and then your build will be faster because you won't have to reinstall those packages. Every time your source code changes only have to reinstall those packages and requirements that text changes. So this is a question about the compile all idea. How does Python minus M compile all interact with Python code that uses the Dunder main pie. That's from goose he says we often have tools like run like Python minus M my tool arguments. So how will that compile all work there. I believe compile and I could be wrong with my understanding of how compile all works is it just finds all that p y files parses them and writes out the bytecode so it's not running them. It's just parsing them so it's just it's just a file system operation just finds all that p y files. So it doesn't matter how you run the code just matters what files you have in the file system. If you have installed your code, you typically don't need it because people by default compile things for you. Okay, thank you. That's the final question. What kind of security testing would you recommend for Docker images. Any good tools packages. There's a bunch of security scanners. There's bandit which is a security scanner for source code for Python source codes will find things like SQL injections and use of pickle. There's a tool called command line to look for safety. We'll scan for insecure Python dependencies where you have to pay them if you want more than that it's a commercial tool so the default you only get the up to one month last month updates. The updates can be as much as a month out of date. So you have to pay them if you want more tiny security updates. There's a tool called trivet r IVY, which will do scans on your system packages. Again, if you go to my website, the doctor packaging I have an article about security scanners for doctor packages. Thank you very much. So at this point, it was very useful and we have to thank you for all these tips that we can use in our real life. So here's a round of virtual applause for you. And I hope you're going to find people applaud in the discord talk channel as well.