 Hello everyone. My name is Megh Dutt and I manage the container platform development at PayPal. So welcome to the talk. And just before we dive into the talk itself, a little quick show of hands on people familiar with Apache Messos. Okay. And how about Docker compose? Okay. So the topic of today is how to run container pods with Docker compose in Apache Messos. And with that, this is pretty much the goal of the talk that with Apache Messos and Docker both getting treated as first class citizens how to run container pods in it. And by first class citizens, meaning that the developers and operations is using the Docker runtime, the Docker toolset, the volumes, the network plugins with network, a lib network, and just not the Docker image itself and having a different runtime like Rocket or something else. So you're treating Docker as first class in your ecosystem. And then you already have Apache Messos running also. So in case of PayPal, we were running Messos before containers came into the picture. So it was running POSIX processes. And when containers came in, you did not want to switch our cluster manager just because of other like non-pod support there for some time. And the other aspect was we wanted developers to run the container pods locally in their laptop or desktop without even getting worried about cluster managers. Get that going. And when they want to take it into QA and production, use the same specification to run it in those environments so that there is no drift and translation errors. And with that, the solution which we kind of built and we just released it last week, it's out there in open now, a Docker compose Messos executor. And we will go through all the details in the talk. So what exactly are pods? So pods loosely are a collection of Docker containers which you are bundling them together and scheduling and deploying them as a unit. So you kind of treat them as a single scaling unit. So every instance is a collection of these pods. Now, two important things to keep in mind is all the containers which you are bundling together as this unit will most likely share namespaces. So network namespaces is the most common thing. So one of the containers in the pod will have the primary network interfaces and the IP associated with it. And the rest of the containers in the pod will just collapse and use the same interface and IP. You can collapse PID or IPC namespace as well, but network is the most common. The second point for pod definition is this group of containers treated as a unit will be capped by a C-group label or a top-level C-group or a common C-group so that the resources consumed by this pod in total, whether it's like memory or CPU or disk, is kind of capped. And that is to ensure that it does not steal resources from other pods running in the host. Now, the individual containers in the pod, you can constrain them. You may not constrain them so they can fight within each other to steal resources. It's up to the configuration. But the whole unit as a whole needs to be kind of capped. And the last point is very important. So if you are using a cluster manager and you have different containers and you are using just constraints to make sure all of them land into the same host and say, oh, that is my pod, that isn't because you are most likely not collapsing network namespaces. You are most definitely not capping all these containers under a common C-group to limit resources. So co-location using constraints is not pod. So why are pods needed? So the first use case or the first point is definitely true for PayPal. As we are migrating our legacy workloads which we're running in VMs into the container ecosystem, pods help in a lift and shift kind of strategy. So let's say you have three processes or applications running in that VM. Some of them might be using local host communication. You just don't know all the details. So you model them in a pod, treat them as a unit, and bring it to the cluster manager. And that actually gives time to start refactoring. You can't go to microservices in day one. And the other fact is that as you are bringing these pods together in the same host, you say, oh, there is a common container in both the pods which you can refactor into like a singleton system service container. And those things come like organically. And you don't have to like worry all about that beforehand. So pods is really helping lift and shift of the legacy workloads. The second point is definitely the microservices and composite containers or composite services through containers is well supported through pods. So whether it's a sidecar, ambassador, adapter patterns, you can definitely model it through pods. And it's a very common pattern in that case. The third point is oftentimes in traditional deployment environments, you will have like a pre-deployment step in the node you are deploying. And then you deploy your actual application. And then you might have like post-deployment steps. So if you are just running a pod, some of those tasks can be like achieved through short lived containers. They could be data containers. They could be just containers running some steps and then going away. And that can be cleanly modeled through pod as well. So those are the three use cases I can think of why pods are needed. So with that, I will just go through Docker Compose and Mesos kind of recap. And then we will dive into the solution. So Compose has been there in the community for a long time. It used to be called FIG and Docker accurate the team. And it was called Compose after that. It provides you a specification of running containers through a YAML file. So you don't have to remember all the Docker run options and pass it like that. And on top of that, it has ordering primitives and certain other semantics there. And you can both specify and run just through command line and Compose. Now, two things has happened lately. So there is a 2.x Compose version series. I believe 2.1 is the latest one. So that preserves strictly all the local features by meaning that is Compose always interacted with a single Docker engine. So you launch all the containers with a single Docker engine. That could be most likely locally running in your laptop, or it may be in a remote location. But you are launching the containers against a single engine. Version 3.x, which is the other series, it is basically Compose with Docker swarm. That's the cluster management integration. And as they introduced certain options that like Docker Compose deploy and things like that, they took away certain parameters which meant, or which made sense in a local pod kind of environment. So 3.x actually deprecates certain features. Now, 2.x and 3.x are both alive, and I think they are working on collapsing them in some form, but to call that out. So if you wanted to model pods through Docker Compose, let's see what we can do with the Compose tooling is already there. So you can define containers all in that YAML file. They are bundled together, realized on the 2.x for sure. You can, if you have written your libnetwork plugins, your volume plugins, you can specify them in the Compose file. It could be pre-created or you can create right through the Compose file themselves. You can collapse network namespaces or other namespaces. It doesn't happen by default. So in the Compose file, you have to say, use the namespace of the other container, but you can do any combination of the containers defined in the Compose file. And we will see the examples when we do the demo there. You can define ordering of the containers in the pod. So before, there used to be a depends on primitive in Compose. So you can say, let container B start after container A has started, and sometimes you need that sequence. But what used to happen is if the process inside the container B started faster than the process inside the container A, you just had a reverse order now. And a lot of the legacy applications broke. So people had different kind of containers injected for weight. There was no thing supported. In 2.x, support is now there for health check of the containers. You can say, let container B start when container A's health check has passed. So much more strict ordering guarantees now you can do in 2.x. As I said, you can point to external created volumes and networks. Multiple files can be dedicated in Compose files that are merged. So you can have a base file. You can have a QA environment file and a production file, and you can merge those definitions. So it's pretty powerful. We definitely use that in PayPal. Last but not the least, you can run a bunch of these pods through Compose in the same host without getting conflicts. They name them differently with certain configurations. So one of the things definitely missing here is that point we made in the pod definition is bringing all these containers under a common C group. So even if you use Compose, each of these containers will have the top level C group created. So that part doesn't happen. And the collapsing, you have to explicitly say that. So that doesn't, a collapsing of the namespaces, I mean. So we will revisit things during the demo. I wanted to do a recap of the mesos architecture. So as you see in the mesos architecture, it's a cluster manager. It has the master and the worker models. So mesos master, there's a primary master. The other two masters are more than that in the quorum. It's not a quorum, actually. They are passive. You have different frameworks. It's a two-level scheduler. We'll dive a little deeper. You can have a marathon, which is a framework for services. Similarly, Aurora is there. You can have Cassandra frameworks, Kafka frameworks. And the mesos agent is the agent running in all your hosts where your actual workload is running. Mesos agents do have a concept of containerizer, which is mesos definition of a container. And we will cover what is that. Now, the executor is your task lifecycle manager. You can think of that way. And the T1, T2, those are the actual workloads getting run. It could be Docker containers. It could be KVM. It could be unicolonial. It could be POSIX processes. So in the whole mesos ecosystem, the primitives are such that the frameworks, the masters, the agent doesn't know what actual workload you're running. So the executor is the one which implements the primitives and knows what actual workload is running and how to maintain the lifecycle of those workloads. So it's really agnostic in that sense. Okay, some of the key abstractions in the mesos world, the agent, if you see at the bottom, this is where your application workloads are running. The agents typically will advertise resources to the master, saying this is how much memory and CPU and disk resources. You could have GPU. You can figure that one out what you want. And they advertise the resources to the master. And the master, in turn, advertises these resources to the frameworks. So you can have multiple frameworks competing for these resources for the different workloads they are supposed to run. And the frameworks will then decide which offers best match for the application workloads. So then they will basically say to the master, I want to pick resources offered from host number one and host number five. And by the way, here is the executor to launch that task. And then the master will get that details and then contact those agents and launch the task and say, by the way, use that executor. Again, the master doesn't know what kind of workload it's running. He just knows that how much resources will be consumed and the agent as well. So that is kind of the abstractions in the mesos model. And different executors can coexist in the same host running different workloads. So to double click in how we obviously implemented executor dedicated for Docker compose kind of workloads. And we mentioned that in Docker compose executor. So if you look at the top box, that's the mesos agent and it uses the default mesos containerizer. Now there is a concept of isolators in containerizer. So isolators help define the top level mesos container, which I mentioned. And so in this case we are using the C groups memory and CPU isolator to get this like bottom gray box if it's your parent container created with the parent C group in mesos. And then your executor task is actually the main task running under this C group. So mesos is pretty much monitoring this parent task in the parent top level C group. What we do in compose executor is then launch this actual pod which is this collection of containers under the parent C group and maintain the hierarchy. So if I go to the next point and say this helps achieve the pod criteria of making sure that the containers coming up in the pod has a common C group. And so you can define the C group CFS HUD limits. So throttle the maximum kind of CPU. You can have memory limits so that containers can be ummed. The entire pod will be ummed, ummed, killed in that case. Now the point is the individual containers in the pod docker gives you options today with minus minus CPUs as of the latest and memory to kind of cap the individual resources of the container. But in this case, even if you don't do that, you have a parent C group who which has the resources for that entire pod. So even if you even if one of the containers is acting bad and stealing more resources, the minute it goes over the parent C group resources, it will be taken care of by messos and the C group itself. So that's what I said. Individual containers will not be limited unless specified but cannot go over the parent. And the last point is when we have a C group hierarchy, CPUs are throttled, but for the memory you have to make sure the use hierarchy flag is enabled. And that will ensure things are in order. Now we will now go into the details of the compose executor and how it works and what features it's bringing by default. So as a executor going back to the programming model, it implements the callbacks to maintain the lifecycle of a pod. So these are the primitives and the callbacks which is to implement with the messos agent so that it can it can launch a pod. It can kill a pod in terms of the task primitives. So by default, you can give a series of compose files and we can see in the demo shortly. But this addition of the C group. So Docker run takes a C group parent kind of option. So in this case, the executor will figure out what is your parent messos C group and add it to the containers in the pod. It will add other important labels like executor ID and task ID so you can do Docker PS and filter those if needed. So it will do a bunch of things similar to that. It will collapse the network namespace by default in all the containers in the pod. So local host works for all the containers in communicating with each other inside the pod and they share the same IP. There will be a pod health check monitor which gets launched. This is not only kind of monitoring that whether one of the containers in the pod crashed out. In that case, the default behavior is completely kill the pod because we don't want to know like whether it's an important container or a sidecar or whatever it is. We kill it. But with the addition of the Docker health checks, it is oftentimes that your container may be up, but your health check is failing. It is in a bad state. So we will detect that and also kind of kill the pod because when you're running in the cluster manager, it's better to kill it and have a new pod come up. So it supports running multiple composed files. It also implements and this is outside of the executor. The project defines Messos module and will cover what Messos modules are to prevent container pod leaks in case the executor crashes out. Because remember the executor is managing the life cycle of the pod and what if it crashes out, your containers will keep running and so the module will help to clean those things out and we will see that in the demo as well. The DCE Go, I did not explain the name of that. Docker Compose Executor, we implemented this in Go. There was an earlier version with Java and of course the JVM completely sucks out performance if you're running 40 pods in a large node. So it's in Go and a lot of features have been implemented in this specific version. So what exactly are plugins? We'll cover that in the next slide. And last but not the least, any existing Messos frameworks whether it's Marathon or Aurora or Singularity, whatever you have for running services, you can just plug and play without making any changes in the framework. Okay, with that let me look at the time. So what exactly are plugins? So plugins provide a way to easily extend the inner workings of Docker Compose Executor. So think as plugins, I will just get all the points here. Think as plugins to customize the behavior as you are launching the pod. So we have hooks before the launch and before the launch task primitive, before and after the launch task and similarly before and after the kill task. So whether at runtime you are deciding on what preexisting network to configure the pod with. Remember the idea is the developer has run the pod locally with their bridge or network mode host and is specifying the same pod files in the production or QA environment. And at runtime you decide that I'm using a container networking solution. Here are the network details. Here are maybe some volumes you are mounting, certain runtime labels you are injecting. So you can do all those massaging before the actual pod is launched and you can actually even get resources and free up resources. So that helps you to customize behavior to plugins. And you can have multiple plugins. They can be chained in order so different teams can contribute towards it. So we do provide a default plugin and the default plugin adds labels like mesos task ID and executor ID against every container in the pod so that if you are in the host and you are using Docker PS minus minus filter and you want to quickly get to containers of a certain mesos task you should be able to use that. We by default add the mesos parent task C group to all the containers. And then as we looked before that Compose doesn't collapse the network names by default. So what happens in this case we create a secret kind of infrastructure container in the pod and you can define it in the plugin and that creates the main network interfaces, gets the IP and all other containers in the pod basically collapses against this infrastructure container. And similarly how Kubernetes Spots work is just that they create the infrastructure container as well is just you don't see it. Okay, with that we will cover what exactly are mesos modules. We specifically use a module called hook. So just like plugins extend the Docker Compose executor's default behavior one thing about plugins these are just all modules which you will compile in with the executor. In case of mesos modules these are actually shared libraries which are dynamically loaded. It could be in the master or the agent to extend the inner functionality. Mesos modules very very important is you have to build it against the version of mesos you're running in the cluster. And in the project which we have put out in the open you have full instructions how to build modules and we have a Docker container which will help in building the mesos module against any mesos version you want. We have provided all the good practices on building a modules because that one is not documented very well. There are different classification of modules so there is allocation module which can switch the default DRF allocation algorithm of mesos on scheduling side. You have isolator modules which we saw that in that slide we had different isolators inside a containerizer. And then there is a module called hooks. Now hooks are basically a way to tie into the events that a task has been launched, a task has been killed when these things are happening these hook endpoints are called by the mesos agent or the master. So we very specifically implement one of the events which is the executor removal event and this we will see in the demo when executor crashes the mesos agent will still detect that the executor exited and you have an additional way to intercept that and make sure whether the pods were really cleaned up or it was leaked or things like that and that hook is guaranteed to get called. Okay, so how is the current ecosystem around pods looking? As of last week, let's put it that way because every cluster manager is moving very fast. So Docker swarm as of 1.2.6 that's the latest version I think it will move to 17 versions as well does not support local pods. So you cannot have native support for local pods. What they have is the Docker Compose 3.x version you can launch the set of containers in Docker swarm so they will land in multiple hosts in the cluster and by default they will have an overlay network so that all these containers can locally talk to each other but they are not treated as a single unit to be co-located, sharing C groups or namespaces so it is not a pod. I think they are working something on it so in the future that should help. Kubernetes has excellent support for pods in fact they are the ones who coined the term pods but they don't treat Docker as first class and by that I mean is they have different volume specs when you want to do storage integrations they have different, they don't follow the network they have their own specs for network integration is CNI so if you are looking at the CRI runtime which till date used to call Docker engine is actually going to switch to container D there is a project active because container D has been donated to CNCF and they will completely skip Docker engine to launch the Kubernetes pods they don't need runtime in the future and of course their pod spec is different than the Compose spec and even the Docker commands don't work in the Kubernetes cluster as I said they set up the networking differently they provide equivalent commands but not all Docker commands will work so it's pretty much you will have the image which is the common minimum thing but it's a different environment nothing bad about it but that's how Kubernetes is kind of going ahead and Mesos for a long long time used to natively support running Docker containers through a Docker containerizer but what it did was it used to only spawn one container against a task so you could not ever run a collection of containers against a task so in 1.1 they added a pod support through experimental task groups and nested container but this is not for Docker so they have again taken the primitive option of it's a collection of tasks which you can model as a pod so the tasks could be containers could be POSIX processes could be something else and obviously that specification is different than something like a Compose spec and they are also going towards a universal containerizer approach to kind of swap out Docker runtime to say that we will implement all the different isolators whether it's a volume isolator, network isolator to create a container runtime which can consume a Docker image but it's not the Docker runtime however as I mentioned in the last point Mesos still continues to be the most flexible cluster manager out there because literally it can run any sort of workloads Unicornals, containers, KVMs, POSIX processes it's not tied to containers and we could do something like Docker Compose Executor with Mesos as well with that I think I will switch to the demo a little bit to see all of this what we talked about in action so I have a vagrant setup running and everything is local because I did not want to trust on the Wi-Fi connection so I have even the Docker registry running locally in vagrant so what I am going to do now is launch so let's see what we have a little quick we have the Mesos master view it's showing that there are no tasks running right now in the cluster we have Aurora here which is a framework in Mesos to run services or long running tasks we have Marathon which is also another popular competing framework to launch long running tasks and with that first we are going to launch a workload in Aurora now Aurora comes with a CLI client we wrote a thrift Go library against it it's open sourced as well and I am going to just launch and I will go through the details of that we will also launch tasks in Marathon in parallel so we can have things up and running as we go through it so let's go here go to Marathon create application or chase on mode so what we are trying to do is launch a pod in this case we have given 0.5 CPU 64 megs memory to the pod we said we need three dynamic ports executor and then the URI is a standard way in Mesos to say what resource artifacts you want the Mesos agent to download before you start the workload so in this case the app has the Docker compose files inside it along with the application bundled so with that we will just launch that now if you go to Mesos we see the workload which was started by Aurora is already launched so let's go to the sandbox to see what else is happening there so there is a folder here now you see Docker compose the EML file let's say that was the file which the developer tested in its development environment so if you open it and bring it this is actually pure compose you mentioned the version at the top and is it okay or shouldn't be the font so you can see there is a Node.js service running we have a NUNEX service running and in this case NUNEX is doing the SSL termination for Node and there is a reason for that Node is not super efficient in that and they are using let's say in a local environment they are using bridge networking they even are hard coding the ports what port they want in host as well as we see from this example and then they define a health check for the web or the Node.js container and it can mean a different compose file compose merges those files now when compose executor launched it created first the infrastructure container so the infrastructure container is the secret container it is defined as a service there and the first thing you see is it got all the ports defined in the different compose files together as part of the network container itself because for Docker the primary container which is setting up the interface has to get the ports there as well so it does that in the cgroup parent if you see we specify what is the parent meso cgroup so it is there as well and then labels like executor id and task id these are meso's label has been added and we are using bridge networking in this case now if I see the generated Docker compose the base file and we generated this file out of it the port sections have vanished as you see because ports have now landed in the network proxy here is the network mode which is now using the collapsed network namespace of the network proxy there so that is how things are as we are collapsing namespaces and if we go back here we see actually marathon has also started the task now one of the things we will see here different to see the plugin now in action is the same example which we launched is if I open the Docker compose generated file there has been an additional test label getting attached to the container and this is what the plugin implemented so the plugin definitions are here so let me go here the config so this is what is saying that the general plugin is by default bundled with the executor but if you have other plugins you can define that the code is already compiled inside the executor but it activates your plugins as you go forward or as you configure now we will do quick two things so what we will do is we are going to see some things real quick so I do a Docker PS we see a pod running in two minutes this is where marathon launched it and four minutes which is the set of containers which Aurora launched and what I am going to do is I am going to kill as if simulating a container crashing out the Node.js container here in the pod run by marathon and if you go here to messos and open the sandbox and open the logs the logs will actually show you all what is happening with the executor so if you see it detected that the pod failed and then it is signaling to messos agent that actually we are killing the entire pod and consider the task as failed and when messos gets those things it will go and replace so if you just see it's running just now because messos got the signal, killed the pod and has restarted the pod so one last example I want to say if I do a Docker PS here it's just coming up 26 seconds if you see one thing I want to show is a rare, rare scenario of the modules coming into the picture so we will have so the first executor is the one which marathon is running the second one is what Aurora is running so what I am going to do is think you are upgrading your messos agent and your messos agent is not running the tasks continue to run in this world so I am going to just stop it and then I am going to just kill the executor there let me see 3, 1, 2, 0 so your executor basically crashed out so if I am doing Docker PS everything is running your containers are supposed to run by default when agent doesn't run because there is checkpointing enabled but when the executor crashed out your containers are still leaking here now what I do at least the containers are managed by the marathon task so what we do is we start, let's say maintenance has completed the agent is coming up and if we do a Docker PS real quick you see basically the containers have vanished because as the slave agent started up actually uses the checkpointing feature in messos to figure it out what are the executors which are supposed to run what are not and for the exited executors it fired that hook which we intercepted and cleaned up the containers which the executor could not clean up and if I am back to the messos dashboard you can see the containers are basically now again getting respawned because they were killed and in the true sense if you want to kill the entire pod because you really want to kill it I don't want to do a create application anymore I can just destroy the application which will call the life cycle management methods to kill a pod in the executor and I can blow both of them out actually and that is the pretty much the end of the demo you can do a kill in Aurora and that should do it and it will take a while to messos to get the signal so one of the tasks has already been killed the second one will get killed as it's happening so that was pretty much the demo and so we have the executor which is out there in the open in the first point the second one says which Java executor which deprecated and I took the diagrams from messos from this talk didn't want to reinvent the wheel here so quick recap was you have messos as the cluster management of choice you love docker engine and all the docker tooling with it you don't want to let go of that thing and you want to solve this problem at scale so that is the recap of the talk the engineers working in this project Kumar and Mia are here as well so we can take any questions if we go after this so I can take questions at this point yes about it right yeah so one thing about the task groups and the nested container feature is A you have to make changes in the framework so if you are using Aurora marathon has started adding I think added the features if you are using Aurora or Singularity first to even consume that feature you have to make changes in the framework whereas the approach we have taken is you can just plug and play with the frameworks the second thing is we wanted to really say that developers don't care about cluster managers as they are developing in their local environment and let them define the pod specs and things like that and can we take the exact spec file and run it in a cluster manager you can't do that because now as we said even kubernetes or the task groups have a different way to define the same containers of course we can run right translators and things like that so we will just wait and watch how things are so yeah yeah and things like that so one thing is if you are having a cluster manager behind the scenes some of those things can often get masked because if the engine crashes by default it does take the containers down with it unless you are using the live restore flag and the cluster will just replace the running tasks in another healthy node if you don't have a cluster manager behind the scenes then you are having issues that yes my engine is in a real state but we have we run it at large scale especially the latest versions of Docker and we haven't seen the issues of the past which we used to see a lot of the engine stability which led to a lot of the people saying that we don't need the run time we haven't faced it so we have run this executor at scale in public clouds as well up to like 50,000 of these pods running and haven't kind of faced any issues yep yeah yeah so that's the common pod versus application group kind of concept so if you are running let's say a web and a redis and I see cool it's a distributed that is not a pod so that you continue to do that so your pod is I have a main application container but I need to run engine next to do my SSL terminate or I have the sidecar container to do certain work which is not the main application containers job but you still need that to make the application run and there is ambassador pattern there is adapter patterns where you are sending monitoring logs so it is the application itself broken into components not the distributed nature of it yep no so a single service broken into smaller parts and that is what a pod is but if your system needs service A service B service C service D it needs to be actually different things not co-located together but the service A needs to be co-located together with its components example so for example the example we showed in PayPal when we run Node.js we run engine X for let's say SSL termination we run exactly like that a lot of people use Prometheus as let's say monitoring and stats you might have your application container doing legacy monitoring of the log messages which the site card container can consume and then send it to Prometheus so it's like an adapter container running with your application container similarly you can have ambassador containers where you don't want to say that your main endpoint is redis but you are running some sort of proxy so that from redis you can change to memcash or something like that so it's like you can run service discovery site card containers so it is a very very common pattern when you go to enterprise 90% of the time you will have a situation you need site card containers actually there was a great talk by Brendan Burns in DockerCon 2015 about composition containers where he goes about all the patterns of a pod sounds good I think we give five minutes back thank you