 My name is Aaron Wood. I work for Verizon Labs. Tim Hansen, senior software engineer also at Verizon Labs. This is a presentation about our fault-tolerant framework. We wrote our own scheduler based right on top of the open-source proto buffs for Mesos. And we're going to talk a bit about our scheduler and framework, as well as how we integrated CNI without Docker. So in the very beginning, we had kind of an easy choice to make, I would say. So a lot of the frameworks I've seen so far are still using the older V0 API that require bindings and has a bit of a different architecture underneath for how the framework communicates with Mesos and how Mesos communicates back. So as you can kind of see, it was pretty easy to pick the V1 API because it's just a very easy stream that we can subscribe to. We can compress it. We don't need any bindings, so we could pick any language we want. I mean, I guess technically you could do the same with the V0 API. You just have to write your own bindings. But for this, this was really easy for us because we could just kind of pick it up and go. We decided to stick with the proto buff payloads instead of supporting both proto buff and JSON. For us internally, we didn't really have a use case to support both, so we just did V1 with the compressed proto buff payloads to be the most efficient. So we use Go as our language of choice for our framework and our SDK. So Verizon Labs is really big on Go. A lot of the new developments going on there are using Go. We kind of have vested interest in it. A lot of things started in it and a lot of people that were brought in kind of learned it from the ground up at Verizon Labs. Not everyone, but a lot of people, it was a new language for them. So what Go gives us specifically is increased developer speeds. It's a higher level language, but it's still really fast and doesn't take up much memory. And since like, schedulers and any kind of scheduler in MISOS is mostly just IO since you're communicating with MISOS and kind of coordinating tasks and resources and scheduling, we don't really need like anything lower that, you know, we're not like crunching any numbers. So this was perfect for us. The concurrency primitives in Go are great. So it makes it really easy to do threading and concurrency and all that good stuff. And we get single binaries in the end for both our scheduler and executor. So it makes it really easy to deploy. And if for some reason we ever wanted to deploy these in a container, we could just take like a scratch or busy box or Alpine image, which is like two or five megabytes, just drop it in and deploy it that way. Talk about the SDK. So as we started our development on the scheduler, we noticed a lot of common functionality. So we said, hey, we should just make an SDK for Go. There is another Go SDK out there that was written by one fellow from MISOSphere. Blake on his name right now. But a lot of what they had done was customized their protobufs. So it was hard to take the protobuf definitions that we had from MISOS and then compile them. And what we ended up getting was some custom protobufs that they had utilized. So it wasn't easy to transfer over and define the protobufs from the open source version compared to the one that we had found online that was already out there. So we ended up just using our own SDK because anytime there's a version update and the protobufs get updated, we want to just run protoc, create the Go bindings and then just go, right? We'll be done with it. It's a lot easier. Some of the common patterns we had is this scheduler framework, what that will look like, task lifecycle management, lifecycle management of offers and resources that are coming in from the cluster. Also, decoding the Record.io event decoder. So if anybody's familiar with writing frameworks when you authorize and connect to the master and you connect, it gives you back a data structure essentially in the formative stream called Record.io, tells you how many bytes you have and then gives you a payload message. So all that decoding is done in the SDK, as well as we have an abstraction for our persistence storage backends. We use SED, but we didn't want to make any decisions for anybody else. So if you want to use ZooKeeper or anything, they can choose their own backend for that. And we have our own protobufs as well that we've defined for the scheduler and executor. So containers are definitely just containers. There's nothing really too special about them. Part of the stock was, you know, we started using Docker and we used Marathon. It worked fine for most of our use cases and we ran into a lot of issues once we started optimizing things and wanting additional features that Docker didn't give us like an additional network interface, which where CNI comes in. So we had a real big push to go towards the UCR and also get away from vendor lock-in. We didn't want to be stuck with Docker. We wanted to be able to do Run-C or any other type of container runtime. So some of the reasons why we went towards the UCR. Again, like I said, we didn't need Docker and all the dependencies to be installed. We noticed a lot that operationally when we had customers using our cluster that Docker, Damon would get stuck or crash or system D would cause some issue and that would go up to the Docker, Damon would get stuck and it was causing a lot of pain points for our customers. With the UCR we haven't had as many issues in terms of bugs during production. There's a reduced attack surface. We can control the types of issues that are happening with the UCR. You can be very specific about exactly what you wanted to run. And there's also no need to manage secrets. A little while ago, I can't remember what release it was from ASOS, you had to actually use kind of a hack and use the fetcher to grab the Docker JSON with the encoded registry information loaded into your container and then use that to authenticate to a registry and then pull down your image. Now you can actually just send it right into the protobuf. And now we also have OCI support. Docker has that as well, but we have broader support for container image specifications with the UCR compared to Docker. So the UCR isn't without its faults. So we hit a lot of issues on any version of MISO less than 1.2. So if anyone's looking to use the UCR I would recommend sticking with 1.2 and up. So I just put up some of the jeers here. If anyone's interested in looking them up, feel free, but generally the bottom section is it describes a bunch of issues around the overlay handling and the whiteout files and the whiteouts not being applied properly and some of the image backend problems we had were I think one of them was the older versions of MISO's defaults to use the copy backend. So when you pull down the Docker image it will use this copy backend and it failed on something when it was untarring and there were sim links so it just exploded. There are a couple of other issues surrounding that. I think there are even some issues with their other back ends and the older versions. So there's a little bit more than this but generally the issues here kind of affect 1.0, 1.1 and a lot of the minor versions in between. So we just totally skipped those versions. We had the luxury to do that. And we would really like to see user namespace support in the UCR. I know this is something that they're thinking about working on. There are some issues right now that make it really difficult for them to implement this. I think they're hoping to wait for dust to settle on UID shifts when you mount. So when you have two containers and they share a volume and they're both pointing at this volume it doesn't turn into a total disaster with different UIDs and everything. I think Sec Comp they're moving pretty quickly towards. It's a much easier thing to implement. So initially we were thinking of doing this stuff in our custom executor that's part of the framework but I think it makes more sense to have it in Mesos in the long run because then everyone can benefit from it and you don't have to have your own custom executor if you don't want you can just use the default executor. So definitely not a smooth path but once we got through all these issues it's been good. It's been really good. So I had recorded a demo initially. I want to do a live demo instead. Take the risk, so why not. So I just want to show launching maybe 500 tasks on our framework in our cluster and just kind of give you an idea of how easy it is to use and just show you what it would be like to do this with a custom framework. Do you know how I can switch this over to the screen? We can write a framework but we can't operate a Mac. It's not our specialty. Not a Mac guy. Yeah, sorry guys. Here we go. I'll just mirror displays. Just get in large. You guys see that okay? I guess I can zoom in a little bit too. So this is the Mesos, sorry. So this is the Mesos UI for the cluster that's running in our cloud that we actually are working on in addition to our framework. So we're building out our own cloud and the framework's running on top of it. I just want to show you how all the tasks will be running and how smooth of a process it is. So we've basically tunneled into our cluster. We're just going to run basically a load tester that we have, 500 tasks. Go for it. So what this is doing is hitting our endpoint. Our API is just returning this simple little JSON payload back telling you that it was successfully queued. It will just keep going until it's done and you'll actually see it appear in the web UI for the master. So we can see all the tasks here running. It's very simple. It's extremely fast. We've done some other tests. I launched 50,000 tasks because it was awesome. There was really no reason to do it but it was able to launch 50,000 tasks and roughly I think it was like 40 something seconds. So it's quite quick. I believe a demo we had done two years ago. I think it was two years ago that Larry did a demo on Marathon and Marathon did it in about 70 seconds. So we're actually faster than Marathon in terms of our task deployment. So the tasks that I'm launching are kind of basic. They're just like a very general like, you know, go something and sleep for some random period with a range and then just finish. So one of the things that's a little bit unique about our framework is we made it flexible to not only run long running tasks but also like jobs. So when things, I mean, when things fail or crash or fail hard, we'll actually get like failure statuses but if you have something that runs and completes and exits zero, it will be finished. It's some of the other frameworks that will probably stop or go down regardless of how they exit. They'll try to reschedule it, so we'll just regard it as finished. So if you have a long running process, just we assume that if you crash or if you stop running, then something's gone wrong and you've crashed. Otherwise, if you exit successfully, we assume that's okay. And this is kind of interesting, so kind of like a side note, but I found the new versions of MISOS in a way, if you view it over a tunnel, it's always trying to like hit the internal IPs of the new leader that's trying to detect or any agents that you go to, so you might just see some of these errors pop up, so I have to keep refreshing manually to get around that. So I think most of these should be done. Yeah, so everything is completed, everything's finished, you can see. It's really pretty simple. Once you have the basics of a framework up and running, however you want to accept data to launch tasks and actually run it, it's really quick. Go has helped us to be really efficient, so this takes practically no CPU, even heavily loading it, it's not that much. There's not that much memory. It's maybe like 15 megabytes of memory when he was hitting it with 50,000 tasks at the most. It's like nothing, absolutely nothing. We also open sourced our framework, so if you'd like to check it out, it's at github.com slash Verizon Labs slash Hydrogen. We'd love to take pull requests and any reviews, you can tell us our code sucks. Anything would be awesome, we'd really appreciate it. And if you run it too, just for fun, at home on some Raspberry Pi's or whatever, that'd be cool too. We want to support all use cases. We really tried to make the SDK because we wanted more people to utilize it and not just make it specific for Verizon. So it'd be great to see what other people use it for. Our focus was really smaller clusters and workloads. Marathon's very good at using, or Marathon's better for larger clusters, so like a thousand plus nodes, but some companies, or I should say most companies, probably run anywhere from the 50 to 100 or a few hundred node range for their workloads. So another aspect that we want to talk about is container networking interface. Like I mentioned earlier, the main issue that we had with Docker was we couldn't use multiple networking interfaces. This was a big problem for us on our front end. We wanted a public IP in our container and we wanted a backend IP. We also wanted to be able to plug in, say a storage network, whatever. We wanted to give it multiple interfaces. We couldn't do that. We also wanted to change the way that we were networking. Docker just gave us pretty much just the bridge and the NAT, right? You get your 172.170.0.1 slash 16. It hands it out and that's it. And that was problematic for some of our applications that are very networking heavy and are doing long-lived ECB connections. So you can see a supported type of plugins here for container network interface. I don't know if how many people here have heard of CNI know what it is. Just raise your hand. Okay, you guys already know then. I won't go over it. So you're well aware of how CNI works and all the things that it supports. The good thing is since it's a standard and the UCR supports it, we can then leverage that, make our own plugin if we wanted to, and we can also hook in as many interfaces that we want into each container. So what we've done is... Just go back one second, sorry. What we've done is we've stuck with Bridge for now, but we use a host local IPAM. So we run on a 10 slash 8 network for IPv4, and what we end up doing is giving each host the dot one. It acts as a router. We route on the host, and each slash 24 gives us 254 usable addresses for containers per host. We actually don't need a DHCP server because each host has a specified format, right? So it's 10 dot rack dot slot on our chassis in the physical data center. And what that allows us to do is we don't need to really keep track of things. The host just needs to know about that slash 24, and it's all unique across every host, and we're all good. So we actually reduced the need for DHCP right there, which was pretty cool because even managing DHCP got a little bit annoying when things would go down and other tasks would go down and it sort of escalates from there. We do, in our lab, use Macvlan. We've also done IPvlan. It tests a lot of different things. We've written our own plugins as well. We're trying to accomplish EVPN, which would be really cool to do, for Vxlan, for multi-tenancy inside of our clusters. So there's other things that we're looking into, and CNI is definitely really exciting to enable us to be able to do all these different networking types in the data center. So since you guys already know about CNI, how it works, I won't go over this. This is just a bland, you know, where does everything go? How does it actually interface with Macos? So all you need to do now in our definition, you can see this on GitHub in the Read Me about how to launch a task on our framework. You really just need to define a network, just like CNI, tell it, okay, it's a bridge, it's Macvlan, whatever it might be, give it the name, data net, my network, whatever it might want to be. And then the network protobuf for us underneath the covers, we just say this name. You don't even have to say CNI, you just say my network inside of your task, and you can put multiple of them. It's a list that it takes in, and when you do that, you can end up getting any number or any end number of network interfaces, and they can be each unique type, right? So one could be a bridge, one could be Macvlan, one could be Vxlan. I mean, you can do whatever you want with it, which is pretty awesome. We've done that in our cluster for segregating our storage, and we also have one for our front-end, so we can define a front-end network with public IPs and pick right from that pool, so we no longer have to manage state of who has what IP here or there. Some people can also pick static IPs that are reserved, and it'll always give them that static IP, so we can use any cast on the front-end, so we can have six or seven, eight, whatever 100 instances of some application uses any cast, and whatever is closest, it just gets routed via eBGP and gets hit. Any questions about CNI? I just want to make sure everybody knows what it is and not skimming over anything. Okay, cool. I just wanted to touch on one other thing, too, before we moved on. We didn't put too much about this in our talk because it's a little bit more specific, but I wanted to say that we did make our framework HA, so if anyone's thinking about making a framework, I would say that in general it's very easy to do. If you're using, like, a distributed key value store underneath, or if you're using ZooKeeper, it's even easier because they have more primitives, but with that CD, all we really needed to do was... I mean, it's a little bit more complicated than this, but generally, like, we have all these frameworks come up. One wins as the leader. The ones that don't win connect to the leader, and then if that framework ever comes down, another one, they'll all basically fight for the leader position again, and one of them will win, and we'll go back to this connection model. The reason why we had this TCP connection across all of our frameworks is because ZooKeeper internally has this ephemeral Xenode concept, and I believe underneath it's just a TCP connection, so when you die, your connection gets cut, your key goes away. So we've kind of emulated that, so after getting that part solved, it was very easy, so I just wanted to touch on that very quickly. One quick note about that as well is we have STD as our backing store, and that already runs the RAFT consensus algorithm, so there was really no need for us to implement another consensus algorithm with sort of a waste of time, so basically what he's describing is we just have each leader try to do a TTL key, basically, in STD. Whatever gets there first becomes the leader, and STD takes care of distributing that state, so once that key goes away and it hasn't been refreshed for some and amount of configurable time, we can pick a new leader, and we assume that's a network partition. But back to CNI, I mentioned a lot of this as well. One of the biggest things for us was end-user visibility in a multi-tenant environment. When somebody looks at their application, you don't want them to be able to see other people's networks. You don't want them to be able to connect to it, ruin it, send crazy amounts of broadcasts over it, do whatever they might do. So the isolation is really key for us, and it allowed us to basically, like a VM where it looks like you have your own private bridge network or an L2 network, you can do the same thing with containers and have multiple tenets on it, and you can actually manage each network and CNI per service or customer, however you want to put that. And all the other benefits that come with standardization, right? There's no vendor lock-in. We can move to different standards above that. I mentioned earlier, we can change to an overlay mechanism like VXLAN or NVGRE without the end-user noticing. So, see if we originally used a bridge for this customer, we can say, hey, we're going to do VXLAN. That is, we can just do it, and it will work underneath. It also gives you a decent IPAM address management. Like I said, CNI doesn't have DHCP on its own. What it will do is basically make a call to any other DHCP management system. So you can run like, ISD DHCP server, and that will take care of it for you. And DNI basically calls some hooks into there for that. So, sorry for the tiny text here, but this is some logs from our agents. You can see there's two networks, DevNet and Datanet. This container got 1150.11.2 and the other one got 1050.11.12.24. These are just two networks I made, just as a face of example. But there's some logs here just to show that there's some IPs coming there. And this is the host local with a bridge. So these are just two bridge networks on the host. They got attached into the UCR. And you can see this is, I just used NSenter, got into the namespace, did an IP address show, and you can see the two Vs there. They're both lower up. You can talk over everything. The routes are there if you want to get out. One thing that we did run into is that whatever network comes up, this default route sometimes goes out the wrong interface. So sometimes we need to make sure that the default route is correct for that application. So if you have a front-end network and someone's trying to get out to the internet but the default route is out, the internal network, it's not going to work. So there's some ordering that has to go there. But this is just an example. I wanted to explain that caveat. Is anybody else enabled any multiple network interfaces in the UCR? I haven't heard of this in the community. I'm really curious if somebody else has gotten this to work or done anything with it. Okay. I'm really excited about it. All right. There's no IPv6 support in Mesos right now. It's being actively worked on and CNI just got some preliminary features for that. But I'd be really interested in getting that done with IPv6. IP per container becomes extremely easy. You don't have to worry about stepping on toes. We don't have support for dynamic traffic filtering. This can be done in other ways, but it would be really nice to have a hook inside of CNI. It could just be another plugin, but that's not really there yet. And support for dynamic updates to existing network configurations. I believe it was before Mesos 1.2, the CNI configs. You actually had to restart the agent to read the configs into memory. Now you can change the network configs now. It should just update. You don't have to restart the agent and restart all these containers and cause a lot of issues. Before we wrap up for questions, I just want to say that we get asked a lot why we made our own framework. It's generic. Marathon obviously has more features. It's been out a lot longer. It's kind of a generic framework that anyone can use. But I think the real power from this is some future work that we're going to do. But also a lot of the CNI and storage stuff that we have going on in the background too. So one of the things that we had talked about doing in the near future is kind of tying the framework into some more advanced scheduling algorithms. So we want to basically efficiently schedule tasks and to nodes across fault domains and shut down everything that we're not using. So when we do need to scale up, we can actually integrate our framework with IPMI tool or make a call out to IPMI ourselves or however we want to do it and power things on dynamically as we need them. We can also do a lot of fancy stuff in our custom executor. Like I said earlier, we were thinking of doing username spaces and setcomp in there. It might be better to have it in MISOs and focus on doing it in MISOs instead. But we had some security requirements where we wanted to make use of keys in the TPM. We can integrate with that. We can do a lot of stuff that is very unique to our platform and since we're building the platform and the framework, we can be very efficient and kind of control everything from start to end. So while it might be generic, it gives us a lot of flexibility, really a lot of flexibility and we can move very fast and we can fix things very fast and I have to add that it was mentioned earlier today just getting away from ZooKeeper has been excellent. SCD has been really good for us so far and one of the things we're doing with MISOs is we're evaluating ZSCD which I believe is from CoreOS. So it's basically a layer that sits in the middle so you point MISOs at it and it accepts an incoming ZooKeeper interface so it connects like it thinks it's ZooKeeper but it goes back to SCD so the 500 tasks that I launched earlier on to our cluster and showed you how they're all running that's actually an SCD backend and we have five masters and I think we have five instances of our framework and three or five instances of SCD so I think things are changing and getting easier and people are going to have more options it just gives a lot of flexibility So we're just going to open up to questions Any questions? Yes Hi, you mentioned the framework that you guys wrote is intended for smaller clusters what is limiting it from scaling to clusters of larger sizes? Nothing right now we can test on larger clusters but we haven't done that and it certainly was for smaller clusters but you can certainly try and use it on larger clusters One of the performance issues we had was scaling past 1,000 nodes even with Marathon and what ends up happening is I think ZooKeeper itself struggles to keep up with all the state especially if you have a lot of tasks and the policy management there or I should say life cycle management of tasks becomes very difficult because with Marathon a task will try forever and application owners will say I want it to start as fast as possible so if you have a task that's just dying it'll just flap forever and it never dies we're a little bit opinionated about that in our framework where we will kill the task after so many retries and say your stuff is broken you should probably go fix it So some of those little tweaks like that about policy management I guess end user education on how better to use the cluster and not just throw a bunch of broken tasks on there help scale a little bit as well as health checks I know now the health checks are out of experimental the ones that go from... Mesaus finally has health checks that go back to the executor instead of the scheduler that helps scale as well and some other issues as well I know the design of Mesaus DNS we're starting to design our own DNS because Mesaus DNS right now actually hits the master and just gets a huge list of stay from the master and does it very frequently so we're basically trying to leave the master with just his core responsibilities in our framework and we're hoping that will allow us to scale a lot more if we distribute the workload of cluster management across the executor and our own our own framework So I can't answer your question and wrap full circle I don't think there's anything necessarily limiting it we just haven't tested on anything really larger than I think 100 nodes or something around there but we're actually going to take over a lab and I think it's about 600 nodes so we'll see how it runs there but if you want to run it on a thousand nodes that'd be awesome talk to me I'd love to know how that goes one of the things we focused on around that was how we're restoring state in SCD and how we're retrieving it and how much we store because we have hit a lot of issues with I don't know if it's so much zookeeper or marathon it's just how much state marathon stores and the limits of zookeeper and the zenos and the older versions of marathon stored different amounts and maybe the newer versions we took a lot of time to do that sure, that was a question two questions what was the reason you created your own SDK are you using DCOS SDK no, this is our own SDK we didn't want to use any particular reason for doing that compared to using SDK from DCOS yes so we didn't want to use we didn't feel like using their SDK it limits us we wanted a lot of customization really after performance in our clusters while having a YAML format is cool to make your framework, that's really good I think for basic tasks and or very cookie cutter type of things but if you want to get really extensible really squeeze performance out of your cluster and get rid of a JVM that makes my boss happy then that's why we do it it has its place, don't get me wrong if you want to do that, absolutely you can use it, but we just it wasn't for our use case we kind of have almost two levels in SDK, at the very basic level we have the proto buff bindings which I think MISO just now brought into their official Apache repos but we have more opinionated later on top which kind of provides a default task manager a default resource manager things that every scheduler will have to handle on their own no matter what, so we try to make that a right once type thing we're also Go so a lot of the existing stuff from MISO sphere and in DCOS is Java, it's fine we went with Go any other questions? thanks guys thank you