 to the big two of Mesoscon Asia to see a good crowd. It's the first term Mesoscon in Asia as you guys know. So it's awesome that that's a huge turnout. I'm hoping it's going to be a big in the next few years. Or as Trump would say, it's going to be huge. So let's hope so. So a little bit about us. Myself, I'm Vinod Kone and this is GU. Both of us have been long-term Mesoscommiters, as you can see. And coincidentally, we also have very similar trajectory in our professional life with our PhDs, then went to Twitter and then kept Mesos here. So I wanted to start this talk off with the history of how we did containers up to this point in Mesos. As you guys probably know, since Mesoscon, Mesos has been around for a while. And we had some form of support for containers for a very long time. In fact, as far back as 0.10, we had rudimentary C-group support for CPU and memory. We were able to isolate tasks and executors. And then we added pluggable isolation, which was a pretty big deal. The Mesos architecture so that you can add your own organization-specific isolation techniques. And then we added support to Docker containers when Docker became a big thing in the DevOps community. First, we dedicated most of the stuff to Docker daemon, and then we added support to run Docker images without the Docker daemon. And continuing with that innovation, what we are talking about today is a new type of containerization that we are doing in Mesos with called nested containers. And most of you guys probably know it from communities as a pod. Always it's a lot more generic than that, and we'll go into details of that. So before going too deep, I want to take a step back and kind of set the context of how things work in Mesos. So if you're familiar with the Mesos architecture, a typical Mesos cluster looks like this. You have a bunch of masters, typically three to five. And then most of the rest of your cluster run these processes called agents. Mesos agents in your cluster. And the agents connect to the master, and master gives offers to these components called schedulers. And schedulers are these components that schedule some work on the nodes, whether it's a thread, whether it's a process, whether it's a container, it's up to the scheduler. So what happens when a scheduler tries to schedule some work onto a node? So I can let you zoom into the agent node to see what are the components that run and what are the primitives we have here. So Mesos always had these primitives of executors and tasks. So the way a scheduler schedules work is it sends this thing called a task, which is typically associated with an executor. And the executor is that the executor runs the task as it sees fit. And it's a contact between the executor and the scheduler and how to run a task. And the executor was always able to run multiple tasks. And an agent, of course, could run multiple executors. But the key thing to note here is that executors typically run one each to one with respect to containers. So each executor and all of its tasks were running inside one single container. So what's the issue with it? Or what are the limitations in this kind of model? The first limitation is that it's kind of hard to manage a group of tasks as one single unit. So even though you can send multiple tasks to the executor, it's not very easy to give all or nothing semantics. And the other limitation is that since executor and container are one each to one, you could only have one image for the whole executor and the task group. So if you wanted to have different tasks belonging to different Docker containers, for example, that's not possible to do. And also because all the tasks are running within the same container, they all share the same C group, which meant that you could not do isolation between the tasks within an executor. If you wanted isolation between the tasks, you had to run a single task executors and that's all you had to do isolation. And these kind of limitations or something that we are beginning to see that it's not helping some of the use cases that we wanted to solve. So we wanted to come up with a new primitive that doesn't have these limitations or that can address these limitations. And this new primitive is a mess of part, if you will. So the idea is very simple. We wanted to have a primitive that lets us manage the code schedule and code manage a set of group of containers as a single atomic unit. And then we wanted these containers or tasks to be able to share some resources like network or some volume so that it's easy to coordinate work between them. But at the same time, we want them to not share some sort of resources like their images, for example. We probably want some tasks to use different image than their tasks in the same executor. And more importantly, and this is where this is more generic than a Kubernetes part, is that we want this group of containers to be dynamically updateable. So either it's growing and shrinking the part or adding containers to the part, we wanted to be able to do this stuff in a mess of part. And also we wanted to do hierarchical isolation, which means it's not just two levels like a Kubernetes part where there is a part and all the containers at one level, but we wanted to do arbitrary levels of nesting. And so what do we get? What are the use cases that we can actually solve by having this kind of primitive? So let's look at a few use cases that are really important in today's world. The first use case is a sidecar pattern that you might have seen in microservices architecture. And the idea here is simple. You have a main container, let's say Memcache D, and you want a sidecar container which can extend the functionality of the main container. Right, so for example, you could have a log-saver container that's getting logs from Memcache D and trying to shuffle it to some remote location. The nice thing is that this log-saver container could be reused for lots of different main applications, whether it's Memcache, or Redis, or your web app, or whatever. And they might be managed by different teams. And that's a great thing about this pattern of using sidecars. And the other use case that we wanted to address was this notion of a transient container. This is for the use case where you do not know upfront all the containers that you want to launch in your pod. So you probably want to launch the backup container for your Cassandra sometime later on a Chrome schedule or maybe it's user defined. Like user clicks a button and you want to go into a backup. But you still want the backup to be some of the same main spaces as the Cassandra container so that it's easy to communicate and get data out. Right? In the last use case that we wanted to also address is running something like Kubernetes and making it really easy to run in Meso's Kubernetes or Jenkins. So for a kubelet to run as a Meso's task, since kubelet runs its pods in a level that's below it in terms of C groups, for this to really work nicely in a Meso's ecosystem, you want to have a history levels of nesting. So we wanted to be able to also solve this use case. So what are the primitives that we came up with? So there are two essential primitives that we have designed for 1.1. These are called task groups and nested containers. So I'm gonna briefly talk about task group and then I'll let G talk about nested containerization in detail. So the first primitive that we introduced is called task group and this is simply a collection of tasks. The main idea here is that this task group is automatically delivered to the executor. At any point in its life cycle, if one of the tasks gets killed for whatever reason, the whole task group gets shut down. And for this we added a new offer operation called launch group, much like we had launched before to launch a task, now we have a launch group. And similar thing on the executor side, we have a launch group event that executors can use to get this group of tasks and run them as containers. So next up, I'll give it to G to talk about nested containers. 大家好,我就用中文讲了 然后另外用比较讲到 我们加了两个primitive 然后一个是 task group 一个是 nested container 然后我中中讲一下 nested container, primitive. 然后其实比较讲到了 其实主要的那些特性就是说 我们这个primitive是支持任意 level of nesting 然后我们可以 reuse 所有的那些现有的那些 isolator 然后更重要的是 我们可以支持动态的创建 nested container 比如说进态的让用户 在启动跑到之前 指定你有多少个 container 然后其实我们就是加了一个 agent API 就是说在 agent上面有一个 API 然后可以让你动态的去启动 等待和销毁一个 nested container 然后看一个例子吧 就比如说 当你有一个 container 已经人在那边 然后 executor已经人在那边 然后当 executor需要创建 一个 nested container的时候 它会先 call to agent上面 agent有个 API 可以说让你 launch 一个 nested container 然后 agent会帮你创建 一个 nested container 然后在那个 top level nested container 下面 然后 但是 还有个问题就是说 你创建的这个 nested container 你需要等待 nested container 结束 你要知道 nested container 是否 结束 就是说推出正常 所以你需要调用另外一个 API 叫 wait 然后 wait 可以等待 nested container 结束 然后如果结束的话 它会返回 nested container 最后的推出的状态值 所以你可以 nested container 可以决定 是否冲起 this container 或者说是把整个泡的杀掉 就是支持 nested container 可以做决定的一件事情 然后最后就是说 当你有一个 container 在这边 有时候你比如说 你的那个 主要那个 container 结束了以后 你想把你的另外一个 container 杀掉 所以你可以做这件事情 就是说你还有一个 agent那个 API 叫 kill 然后你可以把这个 container kill掉 然后为什么 就这里着重讲一下 为什么我们要支持 多层 level of nested 然后其实 其中有一个 use case 是为了支持 debugging 也就是说 我们把 debugging container 看成一个 那个另外一种 nested container 就是在那个 你要 debug container的下面 就比如说在这个场景下 一个 operator 就是一个 运萎者说 我要 debug engine x 这个 nested container 然后它可以做的事情 是它用同样的 API 去 launch a nested container 在那个 engine x container 下面 然后那个 debug container 可以装自己的那些 工具比如说 GDB 什么的 然后去 debug engine x container 然后那个容器的销毁 跟创建过程 跟其他 nested container 是没有任何区别的 所以说 这也减化了那个整个 단 然后其实那个 nested container 那个它的语异是这样 其实我们没有说 我们指定了某种语异 其实它的语异是 被那个 isolate 所定义的 就比如说 如果你的 isolate 是你自己写的一个 isolate 你也可以做你自己的 拓展 就是说 当然就比如说 nested container 是不是 共享一个 命名空间 是不是共享 cgroup 这些都是由 isolate 来决定的 只不过我们现在那些 default 那些 isolate 做的那件事情是 所有的 name space and uts name space 是共享的 然后 mom name space 不是共享 因为你想每个 part of the container 它有不同的 docker 进向 然后 cgroup 现在也是共享 但是 cgroup 我们的 plan 是 马上加一个 就是可以不共享 是的 container 在一个 part of the container 之间它的 resource 是可以做隔离的 然后 为了这个 我们还加了另外一个 default executor 去来取代 原来那个 command executor 因为原来那个 command executor 只能处理单个 task 然后 default executor 是可以处理 task group 然后它用了 最新的 b1 hdp api 所以说 所以说如果你 没有用 b1的话 你也可以考虑用 b1 你应该是相当稳定了 然后我们 我们的 plan 就是最后 eventually 把那个 command executor retire掉 然后 最后就是说 他那个 default executor 现在那个 restart policy 是这样的 就是说当 任何一个 pod 里面 任何一个容器 结束的时候 它的状态 不是 0的时候 它会 杀掉整个 pod 就是现在是一个 非常简单的 重启的一个 策略 但是 以后我们可以 做更多重启的策略 然后 frame 可以告诉 那个 asecure 你的重启策略是什么 然后 asecure 会 implement 那个重启的策略 然后 最后我想做一个 demo 然后去 demo 这个 nancy container 跟 task group 的功能 然后 后来是一个 我这边持了一个 在我那个 讯息器上 持了一个 labor in the box 然后下面装了那个 dcs 然后我用 dcs 做个 demo 然后我现在先 create 一个 service 用 送 create a service 我直接切换到 那个 json mode 就比较简单 给我把那个 后 这个 is 这个 task 其实做的一件事情 很简单 它的名字叫 producer consumer 然后它其实 那个 pod 里面有两个 container 一个是 producer 一个是 container 一个是 consumer 然后 producer 做的事情 其实非常简单 就是说 它有一个 share volume 就是说 producer consumer 它共享一个 一个捲 然后那个 producer 就是在那个捲里面 不断的写文件 然后那个文件 也就是当前的时间 然后每一秒钟 写一个文件 然后 consumer 做的事情 就是不断的去 看那个 共享了捲里面的文件 然后我们单狼 看起来 auto 是然后 相当之处 是这两个 container 用的 docker image 是不一样的 一个是用了 outporn 一个是用了 busybox 然后这些都是 沒有 docker demo 就是说 所有都是 通过 unified container 来的是 docker diesel 然后我这边 来看一下 deploy 然后 你可以看到 这个 UI 这边 这个词 但是我这边 设定 instant11 所以我要 skate到你 然后你可以看到 这个整个 泡的音词 不过你点下这个 task 你可以看到 producer 跟 consumer 不相当于 这是一个 pod 然后这是 pod里面的 container 然后 然后你可以点到 某一个 container 去看它那些 状态 然后其实 你还可以到 Mesos UI 去看这件事情 不过你去 Mesos UI 看的话 然后你可以看到 一个 consumer 一个 producer 都在那边启动 然后你可以去 Simbox 里面 看这边的状态 就我们先进到 consumer 里面去看 它输出的点 然后你可以看到 那个 consumer 不断在打出 那个共享卷里面的文件 然后里面都是 当前的时间 然后这是一个 demo 就演示了 怎么去共享卷 这是为什么 人们要用 pod 的一个原因 然后还有一个 demo 是我创建一个 网络跟网络相关的 一个 client server 然后这个 也非常简单 就是一个 server 一个 client 在一个 pod 里面 然后他们通过 local host 进行通讯 然后 server 做的事情 就是他用 nc 起个 server 然后每当有 a connection 的时候 他就会 print 一个 mesos count asia 返回给 client 用的是 busybox 那个 docker container 然后 client 的话 就非常简单 就是不断的去 通过 local host 通讯 然后 count time server 然后 也用的是 busybox 这些都是通过 unified container 让人起的 docker container 然后还有一个就比较有意思 就是说 这个东西也是有 IP per container 用的是 DCOS 里面 Overlay solution 就是说 你可以把 container 持在某个 Overlay 去拿到 IP per container 然后这是一个 name  dcus 然后我把这个先起来 always go 然后你看到 pod.sr 然后 client server sr 然后你其实可以 到 mesos 以外 但是看一眼 就是 client server 就是如果你点到 client 然后去看一下 std out 你会发现 他一直在答应 mesos connection 也就是说 他们往后时间 试通过 local host 给互相通讯 差不多这就是 我那个简单一个 demo 谢谢大家 I guess that's all we had Now any questions You can chat with us Up on the talk Or you can have questions from Do we have time for questions? No So let's start In the hallway For any questions Cool Thanks guys