 Bonsoir, Montréal. Je travaille un petit peu mon québécois. Donc c'est un deep dive into Docker's storage drivers. C'est usually like a one hour talk, but nobody got time for that. So that will be a not so deep dive into storage drivers. I try to compress that talk a little bit so we can fit into 20 minutes. Ok, so first, who am I? So I'm Jerome Pettazzoni. I'm French as you guess from the accent. I have a really funny title on my business cards, not that one, it's that one. And I do some stuff with containers. Ok, so the outline is I will quickly talk about Docker. Then I will talk about copy and write. Then I will talk about how this applies to Docker and explain the different storage drivers that we have in Docker. Ok, so first really quick intro to Docker and I have some exclusive content for those of you who have already seen a variant of that talk. And I would like to kind of bounce on the talk about the technical depth that we had just a few minutes ago. Docker is born out of technical depth. Because at some point when we had dot cloud we say ok everybody is burning out, stuff is breaking all the time. We need to rewrite the core of our engine because we can't stand it anymore. And the rewrite of that core was Docker. And we started to show that to a few garage startups in the Bay Area like Twitter, GitHub oh that's great, we want more of this. So we took more engineers to work on it and we made a new iteration on this and the feedback was oh this is awesome we want even more of it. And so on and so on and so on until there was almost nobody left to work on dot cloud and everybody was working on Docker. So that's the story of Docker. So what's Docker? It's a platform to build containers and run them and blah blah blah. Well let's cut straight to the chase. If you've never seen Docker in action like docker run, dash TI, python bash for instance, that means hey Docker take the python image, create a container out of it and run that and give me bash a shell in it. So I have a bash and I do pip install ipython so I guess you all know about pip but that's the python package manager and I tell it hey install ipython that we just heard about how convenient. And so this will run pip to download and install ipython and then I have ipython and kind of stuff in it. Great. So this is what happened. I got a new container, a new file system which is a copy of this python image. I have my network stack with an IP address and I got a process space and everything and everything. But what did not happen is that we did not make a full copy of the python image. We did not take that 100, 200, 300 megs image and copy it entirely to run pip and then ipython in it. So we used a mechanism called copy on write and copy on write is important because that's what makes docker really cool. That's what makes a quick demo of look docker run python. Boom, you have a python container instead of docker run python. Oh wait, I am doing apt-get install, debutstrap, copy, whatever and it takes an hour until I can finally do some stuff in my container. Huge disk space savings because if I run that 10 times I just end up using a few megs of disk space instead of a few hundred megs and a few gigs of disk space and a few time savings because each container starts in a few fractions of seconds. Short intro to copy on write. I am not a computer historian so if you want a real deep intro to copy on write look it up, Google, Wikipedia, everything. But I want to talk about copy on write for memory. If you know some stuff about unique systems in unique systems the only way to create a new process you are not exactly starting the LS process. You have your shell. Your shell is making a copy of itself. I am not going to step away because the cameraman is going to kill me otherwise. So you are not exactly starting LS. You are making a copy of the shell and the copy of the shell will use exec to replace itself with LS. So we have the parent process is still bash, it is still looking on this other child process until it transforms itself into LS and then it waits until it is done. At that point we are not making a full copy of the bash process because that would be kind of useless since we are going to replace it anyway with LS in just a fraction of a second. But we are using copy on write. So copy on write is a kind of magical mechanism using a little piece of hardware called the MMU, the memory management unit and the idea is as if you are going to a library and you say hey I want that book about things and I want to be able to make notes scribble with my pen and everything. Ok, this is the book. It's not exactly the book. It's a kind of magic book where each page is a kind of shadow ghost copy of the actual page so it's instant to make and when you get your pen and you're ready to write on it at the very moment where your pen is about to strike the page boom, the book is replaced by an actual page, an actual paper page and you can write on it and it's your own private copy of this page. That's exactly what happens when you make copy on write you get pages of memory they're actually called like that it's a bunch of 4 kilobytes of memory on most computers and there are actually references to the physical page, they are read only and when you try to write on it there is something called a page fold it's like the MMU says hey you don't have right access to that page so stop, I'm going to tell the kernel the big boss here on the computer that you did something wrong the kernel is like ok, you did a write on something and you didn't have the permission to do it, so what shall I do if you try to write on a location that doesn't exist, you get something that, who has done C code here some people on those people you already had sec folds right, if you don't then you're a liar but sec folds are when you try to write or in a location where you don't have permission to because it doesn't exist at all like you write somewhere but you forgot to allocate memory so that's a sec fold and so the kernel in that case say ok you try to write somewhere you didn't have permission and that place doesn't exist, sec fold for you however oh, you try to write in a location which is a ghost page which is a copy of somewhere in the memory ok, in that case I'm going to make an actual copy just for your private personal use and that's it, you can write on it and everything is fine, so that's how copy and write works it also exists for disks which is great when you want to do snapshots like you have a database, it's running and you're doing financial transactions like I'm going to get some money here and put it here and at the same time I want to do a backup of this database and I want to make that on the low level, on the disk level without copy and write the thing is that disks are slow so I could start to make my backup and then ok, I take the money here so in the backup I still have the old version of the bank account and then I put the money here and then I take the backup here result, in my backup I have the old bank account here the new bank account here so I just created money, economical problem solved but also a little banking problem so you need consistency and a good way to have that not the best but one of the ways is to use snapshots which is suddenly I take all this storage space I make a copy and write version of it and now my backup is going to take a snapshot of that copy and write version so that when I write here my backup still do is stuff on the copy and write version but on top of it the actual database runs on those copied pages so in my backup I will still have the version of the database at the moment when I started the backup and at the same time the database continues to work ok, so that became super important for cloud because let's assume that I am a web hosting provider I have thousands of VMs and as it happens 90% of my customers all run WordPress and so they all end up following the same tutorial they take the Debian image and install Apache and PHP and MySQL and then they download WordPress and they run it so I end up with thousands of VMs with exactly the same thing that's a big waste instead I could make an original install of this and when my customers pick their install instead of picking between Debian Fedora, Ubuntu, CentOS they have WordPress which is like my install of WordPress so they pick that, first it's easier for them and also it means that I end up with those 90% customers that use WordPress they end up using exactly the same disk space with Apache, MySQL and everything pre-installed and only when they make changes to that then they start to use more disk space so if the typical server is 10 gigs instead of needing 1000x10 gigs because I have 1000 servers I only need 1x10 gigs plus as people will start to actually write data in their MySQL database and files in their WordPress instance then they will start to use more disk space and so I make huge economies and more profit than great ok, so as it happens copy and write used to be something implemented only in big storage systems I think NetApps, like a big stuff full of disks, very expensive but recently over the last 5-10 years it started to become available on desktop systems so there is LVM on Linux, ZFS on Solaris and then also BSD on Linux BitterFS, AUFS and so on and so on and that was extremely important for Docker without copy and write when I do Docker run blah blah blah I would end up making a full copy and that would be slow and nobody would have been really excited about Docker and we wouldn't be talking about Docker today because nobody would have ever used it thanks to copy and write creating containers can be super easy creating tens, hundreds, thousands of containers on the same machine can be super easy ok this is like a thank you slides for the people like the super awesome amazing people that have created those copy and write systems I probably forgot some of them but still, thank you and now I can start talking about Docker storage drivers so where does that fit in? at first when we released Docker we only had support for one specific copy and write system called AUFS that was because that what we were using in dotcloud and we were pretty happy about it and we had a lot of experience with it and say yeah this stuff works and it's great we have been running hundreds of containers on a single machine with it problem is that AUFS the mainline kernel if you go to ftp.kernel.org and you get the source tree and you compile there is no AUFS so you have manual patches that you have to download and apply on top of that we will also apply other security patches and some cluster patches and everything so maintaining the dotcloud kernel was complicated but still, AUFS worked we were using it, it was great Dibyan and Ubuntu were using it for live CDs you probably know those live CDs or USB key, you put it in your computer and boom, it boots on Linux it doesn't touch your hard disk it doesn't change your data it doesn't write anything on disk but still, you can edit documents you can install packages it works with copy and write everything is read from the CD or the USB key but when you want to write instead of writing on the CD which is not possible because it's a CD it will write in memory we decided to say let's have Docker on top of AUFS and it works and everybody is happy everybody accept people who don't have a UFS because they don't run Dibyan and Ubuntu kernels that means all the folks using the red hat distros Fedora, CentOS, REL and they started to get pitchforks and torches and they went to their salesperson they say we want Docker so those person in turn came to us and say hey how can we make Docker happen on Fedora and CentOS and REL that's going to be complicated get AUFS merged in the kernel maybe they say no it looks like the kernel maintainers don't want AUFS in the kernel but maybe instead we could use other copy and write systems and so red hat contributed the support for device mapper which is another copy and write system then better fs and then more recently overlay fs so special thanks to those two guys from red hat, Alexandre Larsson who initiated this whole thing not only writing the drivers but also writing all the kind of plug in E interface so that you can easily replace AUFS with device mapper or better fs and so on and so on but without them only the look if you running Ubuntu and Dibyan would be running Docker today ok let's now compare those different storage drivers see their pros and cons AUFS so that's the legacy one the idea with AUFS is that you have a bunch of directories and you we call them branches or layers and you mount them together so now what happens is that when you try to read a file or rather when you try to open a file AUFS will look on the first layer is the file here, nope second layer is the file here, nope and so on and so on two things can happen either you don't find the file and you say file not found or you find the file and then you open it on that layer if you try to open for writes three scenarios either you don't find the file at all and then you create it on the top layer or you find it somewhere in the middle and then you make a copy we call that a copy up because we copy that file on top because all that layers are read only only the one on top is read write so when you find the file somewhere in the middle you copy it on top and because that's where you can write it which means that you open a file if it's a big file and if it's the read only part then it will take some time because before even being able to do anything you have to wait for the files to be copied on top and there is a special case when you delete a file then we place a special file called a white out which is like the little white stuff you put when you want to erase something on paper and so it means it used to be a file here but it is no more so when you try to open a file if there is a white out it says no there is no file because it's here to hide the fact that under in another layer there might be a file ok I will skip the practical details and so what's the problem ups and downs of a UFS ups it's kind of the legacy thing so it's been kind of battle tested we at Docker have been using it for years so we know that it kind of works it's also very memory efficient if you start 1000 containers using the same image it will only use one time the memory for the page cache so that's great downside is as I said if you open a big file in the read only layers then it will take a while to copy that file on top and also there is something that I call the stat explosion which is if you have like a huge let's say python path with 100 directories because you installed lots of packages and you have tons of layers then each time you do import something then it will look for something.py and something.pyc and something.pyo in all the directories of the python path and for each directory of the python path we look in the first layer, second layer, third layer fourth layer and so on so each time you try to do import something if that something doesn't exist we will do 100 time 100, time 3 so 30,000 stat system calls that's bad, that's slow so that's why in some cases people are like hey UFS is slow with python or ruby or java if I have a huge python path, ruby something path or java class path next device mapper device mapper is actually a very complex subsystem that can do read encryption that can simulate latency and things like that in the context of Docker when we say device mapper it means one specific subsystem in device mapper, the thin provisioning target which is something allowing to do arbitrary snapshot of disks so in that case the copy and write happens no longer on the file level but on the block level which means that device mapper sits on the very bottom of the stack, above you have code like a EXT4 so the stuff that when you do open, slash home slash gerome, slash profile each will translate that to a location somewhere on disk where there is the content of that file well, that translation happens and then it maps to a disk to a block on disk sorry and that's where device mapper does its copy and write magic so when I try to write a block here that's when device mapper says wait that's block number 12,056 it's actually read only because it's part of a copy and write story so I'm going to make an actual copy of that block so that this container can have its own private copy of it while everybody else is still looking at the central copy so ups and downs of the device mapper system is that so the upside is that each container gets its own virtual disk so if you want to move between VMs and containers that's a little bit easier it's also better if you want to limit the disk space used by your container because it has a virtual disk so when that disk is full it can go beyond the downside is that by default device mapper is using two big files data and metadata to store the blocks and by default those files are using sparse files and sparse files are also a kind of not exactly copy and write but kind of lazy allocation mechanism which is basically what happens is that the first time you try to write on a file in a container using the copy and write system it ends up to ok you're trying to write to block number 1000 ok actually i have to make a copy on that block and now ok we write here oh it's in a sparse file to write in a sparse file i actually have to allocate a real disk block from somewhere so i'm again instead of writing directly i'm going to go somewhere in the disk pool and find some free blocks which is why by default when you make docker on device mapper it's extremely slow performance problem you can tune that the key word to remember it's storage opt you just google that and you will find the documentation about how to make docker not slow if you're using device mapper ok better fs better fs is something kind of in between because the snapshot happens on the file system level so better fs is super smart you can tell ok this is this directory is now a sub volume it's something that i will be able to make snapshot soft when you make a snapshot of a sub volume it's exactly as if you were making a full copy of a directory imagine you're working on some code and you say hey i'm going to break all the things imagine you don't know about version control ok so you could make a copy of that directory but with better fs you could do a snapshot which will be instant even if your code is like 5 gigs ok and then instantly you can work on it and the original doesn't change that sounds stupid with code but now if you're talking about like a big geographic database with 60 gigs of data suddenly it becomes very interesting so in practice with docker, well that a few commands there are some shortcomings with better fs but i won't have the time to discuss them unfortunately but after i will be down there and i will be able to tell you more about the horrible things that better fs does to your disks fs is just like a ufs except it's in the kernel so well in the recent kernel so after 318 so it means that everybody will be able to get the advantages of a ufs the fact that you can start many copies and be memory efficient and even with recent kernels not only with the digian ubuntu ones ok vfs, last but not the least vfs is not really copy and write it's copy on copy when you use the vfs driver and you do docker run python you're making a full copy of the python image so what's the point the point is where you have to work with some old grumpy sysad means who are like i don't want your fancy copy and write file system on my thing it breaks all the time, that's not supported code, we already did that before so go away in that case if you have some mission you don't want to take risks you can use vfs because it will just use plain copies so each time you start a container it will make a copy inefficient, slow it uses tons of memory and space and whatever but at least it doesn't use risky code and also if you're porting docker to another platform like solaris or bsd it will work because it's not relying on anything Linux specific ok the conclusion is that generally people ask hey what's the storage driver that should use for my setup because the nice thing with storage drivers is like standards that you have so many to choose from well the bottom line is that if you're doing platform as a service or something where you need high density use a ufs if it's available in your system or use overlay fs because those are memory efficient if you want to have big files and write on them and they're in the read only layer or whatever like you have this big geographic database for instance it is some stuff with open street app data and it's great because you can have un whole dataset and you can make experiments on it without having to make a full copy each time then you can use better fs or device mapper so if really you want to tell me which one to pick I would tell you pick the one that you know best if you have some experience already with device mapper because you've done some stuff with lvm then use device mapper if you have some experience with better fs because you already used it on some of your systems and you kind of know how to drive it then use better fs otherwise just try them out and benchmark them and see what works best for your specific workload as always ok that's it I will now take a few questions