 All right, next up we've got Tycho Anderson who's going to be talking to us about an operative centric way of updating application containers. Maybe. Yeah, all right. Cool. I can hear myself even. Hi, everyone. My name is Tycho Anderson. I work at Cisco on sort of a Linux slash container platform. And what I'm going to talk to you about today are basically some ideas we have about how to do this better, how to do updates better. And in particular, a lot of this is a work in progress. So there's a slide at the end where it's like, here's some links to GitHub. Please come help. But I'll try and point out what we're thinking about. And then I think the last thing I should say, my boss would be pissed if I flew all the way here and didn't say we're hiring. So we're hiring, check. So just a little bit of history. So first there were system containers, LXC used tar balls, OpenVZ had this runtime thing where you could do some fancy stuff with a thing called P loop. But it was mostly just a tar ball for, at least for image distribution. And building root FS is people generally found painful. There wasn't really a lot of tooling built around creating a root FS, LXC runs them, OpenVZ would run them with this fancy P loop thing, but if you as a user wanted to create your own root FS it wasn't super easy. And then application containers came along and there's kind of two application container formats which look mostly the same. And they sort of made building root FS's easy. And this is, people really liked Docker because it had the Docker file and you could install stuff in a nice way and it would give you an image at the end and that was very nice. And so that's sort of where we are today is we have both the Docker and OCI formats and people use them and they can build containers and that's very nice. But we sort of have a problem about updating and just general management of these things because tar is kind of an old format. So anyway just some basics before we move on so that I can kind of start the problem. This is, I'm going to describe all of this in the OCI format because that's the tooling that we've implemented has to do with the OCI format. But the Docker format, this is roughly the same. So anyway what it looks like is there's an index which has say a list of manifests. Oh, all my things are off because I had to switch from 16 to 9 to 43. So anyway the first OCI layout if you imagine the error pointing from index.json to the bottom thing. Basically what it is is it's a content-addressed hash of a JSON blob that describes information about the image. Then there's config, shift that in your mind up one. And the config describes properties about the image like what environment variables to set, what the entry point is, things like that. And then there's the two layers which are the actual bits on the disk in the container image. And in particular each layer is a tar or optionally a GZIP compressed tar file. And the image is basically made up of these sets of manifests, config layers, these sorts of things. And so one of the drawbacks actually, so there's basics and drawbacks, and one of the drawbacks is that it is indeed just a tar.GZIP file. And one of the reasons is there's no deduplication. So the way these things are typically constructed, you build one layer at the bottom and then you make some changes in the next layer. And for example if you make a one byte change in a one gigabyte file, OCI or the tooling says oh this is different, I'll recompress this whole one gigabyte file. So for that one byte change you basically end up with two gigabytes of data where you really only need one. So there's no deduplication like this at all of files or even similar bits across different files. The whiteouts are kind of painful so tar doesn't really have any concept of lower layers. So what happens is the OCI standard events this thing that's called .wh.foo. And so if a file in it, one of the upper layers has a prefix that's named this, then during the extraction process it deletes that file and anything underneath it if it's a directory. Which is fine but then you still have all that data in the lower layers even though you're never going to use it. So this is again it's not deduplication but it's not awesome. So large layers are painful. So in particular we have layers that are between 8 and 10 gigabytes and GZip you can compress it in parallel if you're clever but you really can't decompress it in parallel. And so that means there's basically one core that has to decompress 8 gigabytes of data because it's just one big long tar file. And so that's not ideal it can be slow. And turns out I'm not the only one who's observed this and this is like a gigantic blog post that goes into lots of details about what else is wrong. These are sort of major drawbacks from my perspective but there are others and there's also a lot of history about how tar evolved to be the way it is if you're interested in that sort of thing like a lot of Unix spelunking and whatnot. So let's take a step back and think about what would actually be useful. So in particular at Cisco we're interested in image provenance so basically when the build system builds an image it should sign that image and then from then on we can take the signature and validate it and we can figure out okay this is really the image that the build server built it's okay to run nothing bad has happened. We also want audit ability so basically we want some way to say the thing that's running on like if we have a running machine with a bunch of containers deployed we want to be able to ask that machine are you running what was signed at build time. And there's a number of ways to implement that I'll talk about in a little bit. And the last thing is or not the last thing but one other thing is update ability. So and this is sort of the work in progress part. We would like to be able to swap out some dependencies without actually having to go back to the developer and say hey can you rebuild us a new container because we need a new version because there's this CVE and lib SSL or whatever. So we'd like to be able to swap out SSL but still maintaining all this other stuff. And then the last one is sort of use less space so all the problems I was just talking about about deduplication and all this stuff if we have an image we shouldn't be shipping around these bits in production that nobody's ever going to use. So if we look at image provenance basically you can because of this clever thing you only have to sign the index.json you don't have to sign anything else and the reason for that is all these layers and everything else is content addressed. So if the index.json has a SHA256 hash of the content if somebody changes a bit all you have to do is go verify that the hash matches the file name and if it doesn't then things are bad and you can throw an error or whatever. So image provenance isn't really that hard and sort of the image format design supports it very nicely. And for audibility you'd sort of like to be able to do the same thing because the image format could lend itself to this except that you have to extract all these tar files and so as soon as you write that onto a file system you can't put the genie back in the bottle. And so what that means is there's no way to go from a file system that you've extracted back into the tar format so that you can check all of this stuff to make sure that this signature is valid and even if you could you would still have to keep another copy of the tar file around just to make sure that everything was there so that's not awesome. So there is a way actually to do this that's supported in Linux in the kernel today which is a feature called IMA or IMA depending on who you ask and it stands for integrity management architecture but what it really means is you can put a hash or a signature in the extended attribute of a file and then at run time the kernel will look for this special extended attribute and hash the whole file during the open call so that you can be sure that when the kernel hands you back a valid file descriptor that the hash is matched and none of the data has been tampered with and that's very nice because tar supports extended attributes so we could actually shoehorn use of IMA into the OCI image format without a lot of changes and it would sort of just work as long as you distribute a policy and there's a few other things but for the most part you can put the metadata in the existing format so yeah the check sums are stored on yes so the question is why don't we just do this because that seems easy the first reason is because then you have to use it and if you've ever dealt with this before it's sort of painful so the policy language is a little bit funky they use a lot of very specific terminology that really is only specific to them so they talk the way they talk about key rings and all this sorts of thing is kind of different than the way everyone else in the whole world understands it and also it's really not necessary so if we just sign that manifest if we had a better design we only need one signature we already have the information we need so we could do this for free and given that the tar file isn't really ideal for this sort of thing anyway if we're going to throw that away maybe we can be more clever and get this for free so we could use SquashFS instead of a tar file for this and that sort of gets us most of the way there so what is SquashFS? It's a mountable read only file system and in particular if you read the kernel documentation it says SquashFS is intended for general read only file system use for archival use in particular in cases where a tar gzip file may be used okay sounds good kernel developers wrote that they're smart people maybe we should use that so one of the advantages of SquashFS is that the metadata is stored separately so if you know how a tar file works it's basically just a concatenation of first there's a header and then the header describes if there's going to be any data there's the data and then there's another header and then there's another more data and so if you want to open the last file on a tar archive you'd have to seek all the way through the whole thing and then you figure out okay this is the file I want and then you open it and then if you know the next file that somebody tries to open is the second to last file then you have to seek all the way again to the second to last file you can imagine building some index but these guys already did it the metadata is stored separately with pointers into the various points where the file data is actually stored so it's seekable and the last thing is they support parallel compression so this problem I described earlier about because it's one big thing you have to decompress it all on one core you don't have to do that so how what would this actually look like so basically we just use a squash of his file system instead of the tar blobs and so then we can mount each layer as a as a squash of this thing just directly out of the image and then we can map the whole thing with an overlay fs and thus we only have one copy of the data it's seekable it's fast we can do the signature verification because the blobs aren't being extracted they're put on to some file system are really mutated in any way they're just mounted straight out of the image and everyone's happy except there's some issues so one issue with overlay is that the way that you pass directories the way that you describe put the layers in this order is with mount options and mount options are currently limited to 4096 characters or one page so I guess on arm probably this isn't a problem but if you're on x86 this is a problem and roughly what that means is you get about 55 layers so if your container has more than 55 layers the strategy won't work for you and it's I say approximately because it depends on you know kind of what the path you're mounting out of course if you're mounting at some very deep path then you get a lot less but if you're mounting at something reasonable var lib whatever OCI something something the math works out to be 55 layers roughly so again I don't know if anybody hears has containers or images that they're building with more than 55 layers but we have roughly 200 so we've we can work around this but it's kind of a limitation they also have a noncustom customizable whiteout format so in particular they do whiteouts differently than OCI does so the way they do it is with a device node that's of major minor type 00 whereas the OCI spec says do with this dot WH dot prefix and so if you generate an image like this it's it's not exactly an OCI image because A it's using squash if it's in the first place but B if you if you want to use squash if it's in this way you have to use these these device nodes instead of the whiteout prefixes so that's sort of annoying and that's hard coded in the kernel so it's difficult to change so that's not awesome this is a minor thing but it doesn't support exactly one layer so if you have a container with exactly one layer you have to do some some fiddling around with it in order to get it to work so the tooling I've written does all that but just it's a it's a thing to remember yeah and so this is relevant because base images have this format there's also some issues of squash if s so when we're looking at playing around with this the first thing is it's not really active and the last commit in the kernel tree was from August of 2018 so people are not sending a lot of patches and maybe that means it's done but I don't think so because there's also really no user space libraries for generating blobs in particular the way that you generate a squash if s is with this tool called make squash if s and basically there's just a whole bunch of command line arguments you can pass this thing so if you look in the code for what I've done to do this we basically build up this massive command line argument of exclude these files but include these ones so that we can generate exactly the layer diff that we want for a particular layer which isn't really awesome so it's yeah it's kind of a brutal hack so it doesn't support some file system primitives that containers use the biggest one is ACLs so for example we are sometimes we use CentOS and CentOS uses ACLs in various places to in order to so a classic example is Ping. Ping needs to have cap net raw in order to be able to send the right kind of packets out and everyone used to have it as set UID and then there's all this discussion about why the hell do we have Ping as set UID so they started using ACLs and capabilities and stuff so anyway whatever doesn't matter doesn't support ACL that's sort of annoying there are others yeah so there probably does need to be some work on squash if s if we work to continue down this path but we're doing this anyway even though there are all these problems and I guess one thing to say here is we're doing it kind of in the way I've described which is sort of trying to thread all these hacks together because we're really trying to see if this will work this is one of these places where other people's input would be appreciated I think I have a slide later but there's some talk about an OCIV2 and what was that look like I know there's been a lot of work in system D and this tool called CA Sync which is content addressable sync that addresses a lot of the deduplication issues but doesn't necessarily address the sort of signing and auditing issues so if you're interested in this sort of problem come talk to me we'd be interested in collaborating on potentially designing a new image format or I don't know exactly what that looks like but anyway I've sort of this is all size and sort of in the weeds but one of the things that we're really interested in is updating containers and sort of the original pitch of this talk is operator centric way to update containers so what does that look like if you think of sort of the ways to implement containers as a spectrum or implement rather I guess like code management as a spectrum on one end of the spectrum there's Docker or OCI images which are bit for bit exactly what the developer built you get exactly those same bits so in particular you get the same version of SSL you get the same version of Python you get the same version of Java you get the same version of all the dependencies in the whole world that the developer used so you know exactly that that's going to run and that's very nice but then you know you have all these problems where you have to go back hand to the guy and ask him can you build us another version with an updated dependency of SSL or whatever because while you got exactly the same bits that that guy had that means you have all the same bugs that he had and there's software bugs and security stuff and whatever so you got to patch stuff and at the other end there's traditional application packaging so that's like if the way we used to do it where you would build some container or you would build some thing outside of a container you would you know install it somehow or you'd you know build a dev package or something that dev package would list its dependencies you get some version that isn't exactly the right match because you know whatever you were using on your local machine is different when you built the dev than the production environment and so some little bug somewhere causes things to screw up and that's annoying and that's why we all switched to Docker in the first place so this is a continuum and I guess the insight here is that you probably want something in the middle in particular you may know that okay I really depend on this exact version of Python because you know we you know I don't know the garbage collector has this particular behavior and we really care about that because we're a cool HFT firm and we care about things like that so maybe you really know that Python is super important but you know if you're over here and you're talking about some library that nobody really I don't know isn't isn't that important or is mostly unused or whatever or you know is like SSL where there's maybe not a lot of functionality updates but there's definitely some security updates you don't necessarily care exactly what version of SSL you're using you just want the latest one so what you'd really like to do is in some cases you want to use the exact same version and in other cases you don't and you want to use kind of whatever the latest is and allow people to kind of update it out from under you and so one of the tools that I wrote is called Stacker and it has this it's sort of you can think of it like Docker for the purposes of this and this is sort of the basic format I'm not going to explain a lot of it but basically what you can see is there's two applications A and B and they both depend on open SSL and Python 3 and there's some you know way that you install them you clone the repo and you run some install scripts but the first thing that you do is you young install those two libraries and so in this in this world there's really not a great way to say I'm going to rip this layer out and I'm going to stick this other one in so when I when there is an open SSL bug there's not a great way to say oh this is the layer that corresponds to the open SSL package if I want to change the image as an operator I can yank out the SSL and stick in my new one that's patch so you might imagine thinking about this problem slightly differently in particular doing it like this so and the colors I'm going to do some diagrams of the layers layer the colors are relevant for that so anyway you might write your application install script like this where on the left hand side you have these two specifications for this is how to build something called SSL this is how to build something called Python 3 and then on the right you would say start from this base and then add this other thing that somebody built called SSL and then add this third thing that somebody built called Python 3 and then install my application on top of that and so what that looks like so the SSL layer is built the bottom two say are the CentOS base layer and then the SSL layer is built on top and then similarly the Python 3 layer looks identical because it starts from the same CentOS base but then we stick a Python layer on the top of there and then this end result of our total build so if you remember there's the apply syntax so we apply the SSL latest we see that the bottom two layers are the same so we just apply the layer that was different then we similarly apply the Python latest layer the only layer that was different and then we install our application on top so we end up with something that looks like this where the bottom was the base image we have these two layers that we layered on top that were just the deltas for SSL and Python and then we have application delta on top and the nice thing is then if I want to do Python latest plus one all I do is I... the only thing I have to do is I've changed this one layer here and then it's happy and so that's the idea is to build some tooling so we have some runtime tooling fortunately it's not open source to do this but anyway that's the idea and the last thing I'll just talk about is size that was another complaint and we basically punt on this so we just didn't think about this mostly because the problems are more important to us so yeah this is my call to action there's an issue about on some thing called an OCI implementation there's some discussion on that thread I guess there's some discussion on Twitter I'm not a Twitter user unfortunately so I can't help you there but I guess the question is what would a new container image format look like we're sort of doing this now because we're interested in it now but I can imagine that we can solve both the size problem and the and the provenance problem if we come up with a clever solution but we need help to come up with a clever solution that's where you come in so thank you if there's questions I think I have like three minutes or something you did not open source this what is the reason we have open sourced the tools for building it here so there's two the stacker is the tool to build images with this special apply syntax that I described and then AdamFS is the file system to mount the OCI images so that's part of the runtime but we have a bunch of code that actually is built on top of this even that also is not open sourced right now so one of the reasons that not everywhere but with that containers or this kind of things are used is for preserving software which obviously if you start specifying your layer as latest means that chances are you will actually get something completely different if you try to rebuild your numerical calculation platforms five years later so if you got thought about in a way of specifying what's like latest but the preferred versions was something like that or you really get what you mean yes so we mostly actually use semantic versioning for our layers and it functions exactly the same way you would think semantic versions function and then if you want to we don't actually do this now but you could presumably do some globbing like fancy package managers fancy package managers like cargo and the go module system and stuff do where they take the latest of some minor version you could do a bunch of math there but basically semantic version is the way we're handling it internally questions all right thank you I have a question did they understood correctly that Staker do something similar that section files from packaging do so calculate paths which for files we should be a part of this layer it's possible I don't actually know what section files do so just list the files or directors yeah I mean exactly so it's this idea of basically computing a binary diff over the layers yeah that's why I think one more maybe there was one more is it possible to use a squash inside user name species I don't think so I don't think it has fs user and s you can use squash views but you need to use yeah so you can do it through squash views but not squash fs proper yeah so we do in ubuntu and other distros we run squash fs based yeah and like I say squash fs isn't really the greatest format except for it's one that works for this use case right now today and we don't have to spend a lot of time inventing a new format but if we're going to do all this other stuff implementing ideas from ca sync or whatever potentially when we do that format we'll do better and also maybe make it safe for user name spaces but yeah come help thank you