 I go by BJID, everyone on the internet, Twitter, GitHub, Stack Overflow. To slightly disappoint you, if you think I'm talking about shipping containers of production, this is not what this talk is about. If you're disappointed right now, I'll try to make you slightly excited by like, we ship about 30,000 containers daily. I've run an average of 11 containers for build, and we run about 2,800 builds daily. So this talk is more about how we rebuild our CI system, powering with Docker. So why exactly did we even choose to rebuild our CI system at the first place? We had these problems. First was that our builds were too slow. The feedback cycle for developers was too long. They would write code, commit it to master, and then the build will run for two hours on CI. And then the match master will release, and then the build will take another two hours to run, which means a single commit will take four hours to get to production if they really wanted to ship out something or fix something, which was not great. Builds also are not easily reproducible. Things will pass on the developer's machine, but then we'll fail on Jenkins. That wasn't great. We also run what gets mutated by the builds themselves, and the dependence of the build. So we'll see situations of things like this. The Java 7 library requires Java 7 JDK, but it's getting the Java 8 JDK, so the build will refuse to run because of dependencies, or things like, oh, Postgres seems to be pulled from Slack conversations. Postgres is going to a Wednesday on this Jenkins machine, so I'll need to go bounce it. Or Postgres is never running. It was like, oh, this kind of painful experience has made us say, in the spring of this year, we really wanted to add on this and fix these problems. So we came up with very simple objectives. First was that we wanted to make the builds way faster. We wanted to make the builds reproducible. We also wanted to make the builds like host agnostic, so you could build on any Unix machine and you'll run on any Unix machine. So we went on an adventure. This is actually my colleague James, who actually went on this adventure. He went on this adventure trying to get a discovery, and he discovered Docker. So why is Docker suited to all these problems that we faced, or all these problems that we're trying to solve? A brief about things that are nice about Docker. Docker has this image slash Docker file in every time. So you could say from this base image, I want to build off of this image and do all this X number of things, and then create a new image. Docker also supports image level caching. So if application A depended on image X, and application B depends on image X, if application A runs first on this machine and pulls down image X, and application B runs next, Docker will say, oh, I see that this image exists locally. So I'll not pull this image, or if application A was just built on this machine, and application A depends on application X, Docker will say, oh, I saw that this thing has been built previously, I would not try pulling it, which is very great for making things fast. Docker also reuses layers even across images. So if a layer exists in image X, if Docker notices that, oh, this shall show some of this image, a layer exists in image X, and image Y actually needs that same shall some. It would not pull that thing from the registry, instead just reuse the layer that exists in image X. Or if you need to do a push, it would say, oh, I know the registry has this layer, so I would not push this layer, instead I would just ignore that part, which is very great for making things fast. Docker is portable across machines, so you could build your image on a Federal box and run it on a Ubuntu box, which is fine. And Docker is widely adopted on alternatives like Docker. They don't have as much adoption as Docker is. So it was easy for us to just pick Docker and say, oh, this already means that if we run into problems, people are most likely to have seen the same problems that we're going to run into, and we can just leverage on their solutions and run with it. So we decided to put Docker to use. We ever leveraged on Docker's inheritance between images and Docker files, so we helped with this kind of tree. This is a very minimal representation of a very complex tree that we run at Bruntree. We start with the base image, which is the whizzy base. Everything specific to all Bruntree application lives in that whizzy base. We actually have a couple of other bases. We have Jesse, and for those who run Alpine, we have an Alpine base image. And then everything specific to Bruntree lives inside of that base image. Then you have services like RabbitMQ, Redis, Postgres. They all have things specific to that service that lives inside of their images. Then languages now have language-specific images. So Ruby 2.3 would have Ruby, things that are specific only to Ruby 2.3 and such Ruby 1.9, Java, Node, Elixir, and things like that. And then applications will now say, oh, I would leverage off of the base image of the language itself, and everything that has, all my dependencies are specific to the language itself. I would now put all the dependence of this app inside the application's base image. So application A, for example, will not depend on application A base image. And then all the app get installed, dependencies and bundling install, in some cases will leave inside of that base image. Then your builds will now run inside A, which will just be your code that keeps on changing. And we structured this in such a way that each time we, anything changes in the top of the hierarchy, the entire part of that tree is completely built. Like we designed it so that it's built, and we built this thing weekly. So if you don't like pin versions of dependencies, you get almost like the most recent version of your dependency, like sometimes in a week, which is like great for us. So we also run our registries, like make all these things really work. We actually run our own registry. We started running an EC2, a registry just only in EC2. But then we have Jenkins master and a couple of Jenkins workers actually run in our fiscal data center, or in one of our fiscal data centers. And my parents, we started noticing like the cost of a round trip to like S3, and EC2 was like very expensive for us. So we started like spin up like a new set, pull of registries in like our fiscal data center. They all like come to the same S3 backend as the one that runs in EC2. But we like add a layer of like caching. So we do like caching in like engine X. So the first time like you do like a pool from either like a developer's physical development machine or from like a Jenkins worker, it would say, oh, Jenkins will say, engine X will say, oh, I don't know about this layer. So I'm going to go ahead and talk to like the registry. And once the registry, the first time it does that, subsequent request, engine X will say, oh, I know about this layer. So I would like not even like send this request. That like completely like, like we do it just once and we don't do it again. Which kills like all the requests that we keep sending to S3. Makes things way faster for us. And like make things better. We like do SSL termination and load balancing. We actually use source IP load balancing. So you would always like talk to the same registry that like talks to the first time that we do the load balancing from true HAProxy, which like made things like way better than it was. Like cause at some point it was like very painful for us that pools and push it. Well, pools were like very, became very expensive. But like because we are heavily pool based, like you see further down the line, why we're heavily pool based, but like because we're heavily pool based, we have to like do set up this registry in our fiscal data center. So what's our build flow like? We start with like building image, which like becomes like the image I showed at the leaf of the tree from like two slides earlier. We spin up a container and run specs inside it. So like if the specs passes, we like see the database. So if this is like a primary application, we'll go ahead and like see the database and like prepare it for commit the project itself, do a docker commit of the project and database image. We now tag, we do a double tagging of our images. We tag first with like the application GitShare and then we tag with like the branch, with whatever branch that we're running off of. So you can like build downstream can say, oh, or even a human downstream, a human can say, oh, I, given this metadata, which is probably like the GitShare, I want like, I want the registry to provide me with like the image that matches this like GitShare and it would like go ahead and provide for you because that's how we tag or like, I just want like the most recent master build of this thing. You can just request for the most recent master build. And we do like the same thing with our database. We compute an MD5 sum of all the files like are involved in like generating this seed for your database. And then we like use that like tag the database itself and do a master or whatever branch tagging because we run a couple of branches, do a branch tagging also for like the database and like we now like push the project, do a docker push of the project and database image and then we trigger downstream build. So like for like our core application, it ends up triggering over 800 builds based off of this and means that, oh, once we like build this thing once, those builds can just rely on it, which was like the part that was like very painful for like things like our client library builds. Our client library builds depends on like four or five different applications, but each of them previously, you would like do a database migration and see the database, each of them for each of them and a build that should like take five minutes to like run for over 50 minutes and not like complete. So like, any results of this work itself was that, sorry, our builds are very producible based on like where we tag our builds, tag our images you could, if something fails on CI, you could confidently say I can actually pull down this image that actually failed on CI and I can run your my machine and I'm guaranteed that like, I can completely reproduce the failure. Our builds are significantly faster. Like we took down, see, this graph actually shows like what the average time and speed was like in minutes for like builds on the old CI and the new CI. The red line, the red bars there shows the old CI, the green bars there shows the new CI and this was like immediate significant benefits that was like very obvious that like this was like the right way to go. So given this is like, given this success, if this was just all probably my talk, we'll just end here. But then like good things don't come easy. We achieved the objectives, but like it came with other problems. We were like, that's the training Docker files, make files, we would have like, this make file invoking that make file, invoking that make file and then we would have to say our compost files extend, extend, extend and then you would end up with like five or different compost files and we're also like duplicating things across compost files. So because application A depends on application C, B, C, D, E, F, it would copy everything that we, we'll have to copy everything that we've written in B, C, D, E, F into A. So it was like, spin up like the images for like, all those images are dependence for this build, which was like insane. Like it was a very bad user experience for us and we were going to actually deliver this like the old Dev team in Brain Tree, which was like going to be, we obviously knew we're going to get a serious pushback. Like people would like, not really want to like do this thing. So we decided to like, it was like this, it looked like this. It was a big bag of candy. We all sort of things inside of it. So we decided to like make things better. We came up with like a new objective. We wanted to replace this type system chain with just something better. So we decided to like write a DSL. Write a DSL that will replace compose files and make files. The DSL was written in a language that is like a lingua fraca of every brain tree developer, Ruby, every brain tree Dev should and is able to like write Ruby. It's heavily inspired by Rake. It's extendable in Ruby because it's written in Ruby. It reduced and removed duplication. Removing duplication is your choice in almost any language. So like, but it helps you like completely reduce the duplication that we like going to previously. And the DSL actually looks like this. So you, this actually maps like compose, the concepts in compose. You say service postgres. It has like this image. It says I require this, I use this image as like my image to spin up any container. Then it says service sample app. It uses this image. Then your job description is actually the meat of the thing. Your job description specifies, oh, I depend on sample app and postgres. And I want like, the body of the job is any Ruby code actually, any Ruby code that will work. So you could actually even invoke your shell directly from there. And then you say sample app links to like postgres, but then with the earlier sample app DB, it has this environment variable postgres host, which points to like sample app DB. It has this environment postgres port, which points to five, four, three, three. Then it has this compose command called rake. It will create a compose object and do a run on it. That will like just like translate to this docker compose file, except that like you can, it would like once you do the tool itself is called jake. If you want to do jake test, you would just go ahead and like run, do a docker compose run, like it would imitate what you would have done with like docker compose run sample app with like this compose file. So like you wonder where exactly, where exactly is fancy about like this thing? Why, how does this like make things better than using just a compose file? Well, let's go to like a slightly more complex example. So I introduce a new concept here called application. Application here refers to like this previous application I just showed in like two slides earlier that defines that jake file. It says that like application that I'm specifying, like I'm in this namespace called sample app. You can find me like this local parts sample app, but if you don't find me there, I could exist in this remote path, which is this remote path here. And I want you to always like use this revision, which is final here. You can always dynamically bear Ruby to like define whatever revision you want there. And then it says that I am also defining this service called full and specifying this image. Now in the job itself here, you could have n number of jobs inside, inside actually what we call the jake file here. You could also have like n number of services. It doesn't restrict you. And you could write Ruby code inside of it. It's, I said valid Ruby code. So you say job test here depends on sample app from the sample app namespace. And it depends on postgres from the sample app namespace. It also depends on full that we just define inside our own namespace here. And it says full links to like sample app from the sample app namespace alias.sampleapp. And then links like postgres from the sample app namespace alias.spostgres. And the compose command is rake. You create a compose object and you do a run on it. It equates like this same thing here. Also like give you this nice idea of like saying you have this drake host user ID inside of like your container. So you can like tell who exactly is the user that actually runs the demo and it's all the, run the binary itself. I would explain further later why like we added this and like this is like useful for you down the line. So like giving this thing, we've been able to like power cells like actually create complex builds like this. We've been able to like write very complex jobs like this. This is actually our European application here. Like until we actually wrote this thing like it was almost impossible to actually write a job that actually like fully exercised our entire European flow, application flow. So like we actually simulate that in reality of exactly what happens in production with like this job itself. And this spins up like 22 containers. It equates like these 221 lines of compose file. And like for brevity, I just like reduce like the content of like the compose file. And if you draw like a, if you draw a graph of the links, this is like where it looks like. So this is actually what it shows like the complexity of like the job itself. And we always like power even like more complex jobs. People have written more complex things after discovering this in brain tree, which is like based off of like the power of like the GSL that we've written. Also we are applying to open source it. So in case you ask that question. So like, so let's talk about the things that surprised us with like Docker. Like, like this is like a documented behavior, but it's like a very annoying surprising behavior. Like if directly doesn't exist in like your file system and you do a Docker compose run something specifying this directory as a volume and telling it's like mounts like this point on like this container. Instead of Docker just freaking out and saying, Oh, I don't know of this directory. Docker would silently ignore like that failure, create that directory in the local path and then like mount it. Let me just show you an example. So we run LS there as a start there like and see that directory is completely empty. We do a Docker run volume specifying a directory that doesn't exist on the local path. And then saying that it should mount this directory B that does not exist to A and then just run a simple like bash here, bin bash here. Docker will guide and like not freak out here. Docker will just like silently like create these things for you. Now you check A actually exists inside of the container itself. You can write things into A inside of the container. Which is like interesting. You check that the, you write things inside of A into your container and like it allows you write things like you write things there and it's fine. And then you get out of your container and I'll do like an LS inside of your container. Sorry LS inside of the directory from outside. And you see that this directory has actually been created. It's been created as roots, which is like even makes even worse. And it's written things into it as root. So you can actually remove this. This was like very, very painful for us because like we do mount like volumes for like JUnit output for like to render our test results in like Jenkins UI because so like this was like very painful for us. Like, so the tip is that like if you really want to use volumes, make sure like the user ID or the group ID, like matches like the user ID or group ID inside of like the user ID running the daemon, matches the user ID or group ID of the user that runs inside of the container. Otherwise you run all sorts of problems like volumes or you resort to like tricks that we resorted into because we had gone far down the line before we realized this. In our like whizzy base, we actually had that first line where we're actually, because we manage our users and groups to pop it. So we can tell like that, like the group will always, the Docker group will always in 918 everywhere and you cannot run the Docker daemon if you don't belong to the Docker group and they will not put that UMass there. Also, sadly, there's no way all the standard tricks to setting UMass in Linux doesn't work with Docker. There's an open issue that like, we're trying to like convince the, that people are trying to convince Docker teams like may expose it so that you can set the flags like set UMass. But you have to like explicitly set UMass and do all these things so that the user itself that run, so that like one, either container crashes or the user that ran the thing can actually do a correct cleanup because all Jenkins will also like get clean DFX and that would like feel, so like that was like very painful for us. So this is actually like a feature that surprises me. It's a feature that I don't understand. If you do like Docker compose, okay, so let me show you an example of this. So given this very simple compose file, like full depends on, sorry, this is too fast. Given this compose file, full links like redis and then you now do like a Docker compose run full. Compose will like do the right thing here and start redis and then start full. Now you do a Docker compose top peers and you see both of them are running. You do a Docker compose stop and you expect what you expect. You expect it to stop both of them, right? But Docker compose will surprise you. It will stop redis and do nothing. So full is still running. Docker compose full peers. It knows full is still running. It knows redis is stopped and then you now do like, okay, I want to be very harsh here. I do the fossil removal. It moves just redis, ignores full. Now you do a Docker compose peers and you still see that full is still running. So you ask that I really bring out the tough sledge I'm out here from Docker itself and like fossil remove everything here. Like falsely remove everything. This was like, this is like very surprising and like it's like, it took us a while to actually figure this out. We kept finding like, it will feel that like names have been chosen and things like that. Like, and yeah, it took us a while to figure this out. So with every project, there's always like a very painful experience. Like these ones don't qualify as surprising. They are painful because they want to cause a world to like figure out the solutions and like, or even figure out where exactly was wrong for some of them for. So we run into like this, sometimes around like March we run into like this transient networking problems where like randomly we would build to just like either hang or fill with like a TCP timeout. We'll open this box and we'll see it's like on trying to like make a TCP, extra see process and see there's on trying to like make a TCP connection. We would like manually make this connection and the connection will work fine but then the process itself can't make a connection or it would time out making that connection. And this kept happening like very intermittently. Like it wasn't like in a predictable pattern. We like Google don't see as anyone run into this problem but then we found one issue that's on open and almost closed immediately that oh, I can't report sufficiently produce this problem and I've not like seen it again or something. So we like trying to trace back exactly where exactly upon and how did this actually happen? We found that like this actually correlated with when we started saying, oh, we really want to make things super fast when we say like preloading our AMIs and VMs like images. So like this image is already existed and like jobs have to do lesser things like once it started running and once it's complete stop like preloading this AMIs and loading like so many images and posting our caches every day. So every day we like do a Docker RMI all the images, delete everything Docker and then restart Docker again like every day wants to say like busting our caches, all of these just like completely disappeared. We also run into like random canal problems. Like we run into like this was like very annoying that like the crash jumps I was getting from the from the canal crashes were not like you couldn't correlate them. We would find Zen related things and Docker related things but really we couldn't like correlate like you will get a crash jump today here and get a different crash jump tomorrow there and then like we run different versions of Zen host. We run different versions of Zen. If we still like saw these problems like replicating themselves. So like almost with almost like every canal problem I faced like in my career. The thing you always resort to first is that oh, let me actually run the most recent Linux canal and see does that fix it. So like two months ago we pulled down 3.18 or three six which was like most recent stable attack that time patch way effects and compiled it and like the problems just magically disappeared. So it's probably something fixed in a mainline canal. Thank you for like listening to me. Special thanks to like my teammate there, my colleague. We worked on this like and a couple of people who are not here, we worked on this and like saw this through and questions. Hi, thank you for your presentation. Quick question, you mentioned that you guys made the custom DSL. What was the process evaluating writing the DSL versus patching Docker compose? So one more DSL adds another layer of on top of it. I was curious the process. So someone actually went on one weekend and actually spikes this in brackets in just one weekend. So unlike we felt like we were good enough to like do it in one week or two if we could like invest the time in it. So we like, we had four people on the team and we sacrificed a pair to like continuously work on this for like a week. And after a week they came up with like a proof of concept that was like very interesting. And like the second week it was like completely polished. So you mentioned you were using AUFS, is that correct? Yeah. Had you considered any other union file systems or was AUFS like the only thing that? So because this is not production, like we like choose the simplest things like get it dropped on. Just as a follow up, how did you deal with persistent volumes when you were dealing with stuff like databases? Is that something where you basically rebuild the database on every deploy or? So first is that like we do it not like the, so we don't use persistent volumes. We like put everything inside of the image and push the image because this is just like, we want to like just like do things once I reuse it, right? So we push it so that like people can just like, humans or builds can like refer to this MD5 sum tag to like pull whatever they want and not like, we also wanted to make it so that like it will run anywhere independent of like, so using persistent volumes meant that you have to restrict where the build could run. You seem to be building a lot of Docker images. How do you handle data retention? How do you get all these images cleaned up that you don't need anymore so the registry doesn't fill up like crazy? Okay well this is a discussion that crops up once in a while and like we have not like solved this problem. It's an S3 and we have for now thought of S3 as an infinite pool of like space until we really need to like, we have pushed that, we've kicked the can down the road. That's just a simple answer. Infinite space, I like that. Any other questions? Okay, is this your first conference? Yeah, this is my first conference. Give it up, yeah.