 All right, so next up we got Gabriel Hartman, a technical lead at MesoSphere, who's going to talk about an SDK for building stateful applications for DCOS. Thanks, Ravi. Everybody can hear me OK? All right. Louder? Louder. All right. That better? All right. Cool. OK. We've spent about two years building an SDK so that you can write stateful applications on Mesos. And I'm going to talk about its high-level design, and then we're going to drill down into what are the stateful problems it solves, and there's going to be a bunch of demos along the way. Let's see. I skipped who I am. Ravi's told you I'm Gabriel Hartman. I'm a tech lead at MesoSphere for the last two years. That's my Twitter handle, but I don't use it. But everybody puts them on their slides, so I do too. All right. So we're going to do disclaimers. There's only one. And then we'll talk about stateful problems in general. I don't know. Has anybody here written a stateful application on top of Mesos, like actually stored state? OK. OK. Cool. We'll talk about the framework landscape in general, and where does this SDK or this framework fit into that picture? The SDK does a lot of things, many, many, many things, and we're only going to talk about a few of them. So I'll go over all the features real quick just so you know what they are. And then we'll dig into the state problem and where we're going. So Meso's familiarity. Everybody here is at MesosCon. So everybody knows what an offer is and a resource and a framework. Is that a yes? OK. I see some nodding heads. So I'm not going to go over how Mesos works again. If that's a problem, please ask questions anytime. OK. So I break down stateful problems into kind of two main areas. And they all derive from one assumption. All these many stateful services have a legacy view of the world, that there's some persistent context where you can go and run commands any time you like. This is very inconvenient in the containerized world because there is no persistent context. You bring up a container, you do something, and then it goes away. So here's like one main category of that problem is you want to do something before you do the main thing. There's often this like prepare task. Is anybody here familiar with like HDFS? Anybody use HDFS? If you're a deployed HDFS, there's like format, the name node, and there's bootstrap, the name node, and there's run, the actual name node. There's all these tasks you have to do before you do the real thing. That's a common pattern that we saw. And then you want to do things after you deploy the service. So if anybody's run Cassandra, there's like repair tasks that you want to run. You've got Cassandra running, but then you want to run the task in that container later while it's still running and do something without disrupting it. So that's what I mean by these like maintenance or user-defined tasks. Here's some examples. You'll see HDFS comes up a lot. It's sort of a motivating case for the SDK because it is so legacy and has, it's very, very difficult to deploy. If you go read HDFS as like deployment docs, it's like 17 steps long. And they assume you can SSH into a node and mutate state. So for Cassandra, replacing a dead node or a seed node. Have you seen these directions? Step one, two, three, four, five, six, seven, eight, nine, 10, 11, 12, 13 steps where you SSH into the node and read some file and query it. So we were able to automate this. I'm just saying that in general, stateful things have this kind of problem where they assume that you can go to the node and do something with it. Let's see, is that still? All right, so let's talk about the framework landscape in general. You guys have probably heard of all these frameworks out there. There's Aurora, Marathon, FENSO, Code, Jarvis, Spark. Like some of these might be a little bit controversial to call them frameworks, but generally these are the big, general purpose frameworks that exist in the world. And now we've got this other one. It's a general purpose framework. We call it DCS Commons or the SDK. And I wanna try to place it in the world of frameworks so you can sort of have some context about what it's trying to do. Generally speaking, there's like two kinds of schedulers in the world. There's either mono schedulers or multi schedulers approach to deploying tasks on Meizos, right? Like Aurora, Marathon, Cook, these things consume all the offers in the cluster. They have sort of a global view of the thing. They're a mono scheduler. And then you've got a multi scheduler approach where you launch lots of different frameworks and they all sort of cooperate and fight for resources. We're over here. We're on the multi scheduler end of the spectrum. So the SDK is a factory for frameworks, for schedulers. And then there's like this other spectrum, like generalized versus like specific scheduler. So some people have written a scheduler that is just MongoDB, right? It's just Cassandra. It does one thing. And then on the other hand, you have like Marathon on the other end of the spectrum. It's very general. You say I wanna run my application in this container. So on that spectrum, we're on the generalized end of things. And then there's this another axis, right? Long live versus job oriented. So many of these frameworks are focused on either running tasks that last for a long time or running tasks that come and go. Don't step on the wire. And we're on the long lived end of the spectrum. So this is something that launches tasks and they last for a long time, generally speaking. So amongst our peers, I think the closest comparison is Aurora Marathon FENSO, right? We have these things in common. We wanna do long live things and we wanna provide rich orchestration of task deployment. And then we have these additional emphasis in the SDK based around stateful applications and being extensible. So by extensible, I mean, you should be able to create your instance of a framework and the special little problems that are associated with your application like Cassandra or HDFS, you should be able to solve just there. So to sum up, the SDK is a multi scheduler, general purpose framework generator that has a special emphasis on stateful, orchestration and extensibility. So here's the SDK features section. There are many, many, many of them. I would say, did it go away again? I would say that the core mechanic of the whole system is this rolling configurations and software updates. So you say, I wanna make a change to my service and it restarts all your tasks. We have a lot of other things. This separate deployment and update plans means sometimes you wanna install a service one way but you wanna run a different orchestration pattern for when you would do an update or an upgrade. So you can decide to use different deployment strategies based on that scenario. There's a million Mezos features that it integrates with. It does resource reservation for everything all the time right now. And it's getting pretty good at it. Is anybody here familiar with resource, or I'm sorry, reservation refinement? Anybody? It's a brand new feature in Mezos and we'll demo it later. It's pretty cool. And we also have these placement constraints. Have you guys used the marathon placement constraint language? You used that before? Yeah, that's fully natively supported by the SDK. So you can, every time you describe a pod or a task, you can apply those same placement constraints. So far, we've built all these services using the SDK and they're all available in DCOS right now. Kubernetes was announced, I think recently, that was built on the SDK. Kafka, Cassandra, Elastic, HDFS have all been deployed using the SDK. We can go look at them later. EdgeLB is something for enterprise DCOS and that was built on there and there's a bunch of other ones too. So let's talk about it in general. What does it look like and how does it behave? And then we'll talk a little bit about the special things it does for state. So we have pods, I don't know. There's like four different definitions of pods. We made up another one. It's just a group of tasks. So here we have a pod named hello. We want two instances of that pod and it has one task in it called server. It has a goal state of running, a command and some resources. Is that clear? The only thing that might be, everything here is sort of native Mesos terminology. This is a Mesos task. A pod is an executor in this case. The only thing that's not Mesos-y is this goal state of running. So you say when this task crashes or whatever something bad happens to it, we try to maintain its goal state all the time. The other choice besides running is finished. So those configuration tasks I was talking about before that you would want to run before your main thing, those usually have a goal state of finished, like format name node. Okay, I formatted the name node, now I'll run the real thing. So the first one has a goal state of finished. If you want to talk about it, there's two main pieces, pods and plans. Pods is like what is your service and plans are how are you going to deploy your service? So we have a pretty rich orchestration system. You don't have to understand every little detail about what this is, but basically we have two pods, hello and world. They can be deployed using various strategies and they're maintained according to their goal state for you forever. Okay, so if you remember the stateful problems I talked about at the beginning, I'd like to present this idea as the main idea that resolves those problems. And it's the fact that we've decoupled tasks from resources in Mezos. So in Mezos you can reserve resources, which we do, and then you can run tasks on those resources. Now usually in every scheduler I've ever seen, you launch a task with the resources always, and you don't mutate, you never launch a different task with the same resources. And that's what we allow here. You see we define this resource set. It's just a bag of resources and you can apply different tasks to that resource set. So here's that prepare task I was talking about. Like you can run this prepare task, use on that resource set. It has some effect on your persistent volume that hangs around and then when you later run the server task, it can see the effect of that prepare task in its persistent volume because they're sharing resources. So far I've presented the service spec and the plans. So this is everything that I've shown you. So it's like what, 45 lines of YAML and you've defined a service. I think we're gonna demo this in a second. And what you're gonna see is, this is what a plan looks like. You said, here's the deployment plan and here's the hello pod being deployed and here's the world pod being deployed and they even have their prepare and server tasks. See they're complete and running. And you'll see that the goal state of finish and running pans out in Mezos. So we prepared world pod zero, we were prepared world pod one and then they started running. And so the goal states are achieved, right? That first task finished and then we run this next one. So let me do it and demo real quick. This is DCUS just starting a scheduler. It's nothing, everybody hold your breath. Okay, so here's the scheduler coming up. That's the hello world scheduler. Takes a few seconds, brings up an API server. It's gonna, there you go, it's running. Then in a second the tasks are gonna come up. There we go. So the hello zero and hello server are up and then we're gonna see those world pods come up in a second. Yeah, and see, you'll see this world zero prepare thing once you're finished and this thing is running. Now let's go look. If you recall, let's see here. If I go up here, there's a bug. You gotta go out here, you see the output. So we're maintaining this cumulative state, right? That first prepare task did something to that your context and then the next thing is able to see that context and continue working, right? Which is like one of the core problems that Staple Services are presented with. So let's go back here. All right, so that's like one use case where you do this preparatory task and then are able to see the results of that preparation later on. Yeah, generating this cumulative state. So why do we bother building this feature? Basically, this is something that HDFS does all the time. I can show you our HDFS plan. This is, so what does it look like when you build like a production ready one of these? It's about 355 lines of YAML instead of like the 45 in the hello world example that I was showing. Now if we look here, you'll see that I have these same finished and running goal states, right? Format is one of those finished tasks that runs before the other one. Then we run the main node. Bootstrap is one of those finished guys. And then we run the main node. ZKFC format is another one, right? You wanna run this, finish it, and then run the main thing. And then there's this other feature that I alluded to earlier. So this is how you deploy HDFS, but it's not how you update it. If you wanna do a rolling update, you run all these things together, right? You restart the name nodes together with their ZKFC nodes. You restart the data nodes together. Okay, so the next main feature, instead of this like cumulative effect where you can sequence operations on the same resources, is what I call sidecars. I think it's a pretty normal term, but basically your service is running and then your executors and your tasks stay up and then you run another task. Inside that executor that can go see the other tasks context and like do things to it, right? So let's look at that. So here's an example of using resource sets again to enable this sidecar pattern. So you see I've defined another resource set called the sidecar resource. Hold on, and then you have, now we have three tasks, the prepare and the server that we had before, and I've added another finished task, sidecar. Okay, I can just inject this task into the executor in shared context and run it on demand. And you can tick off such work by defining a plan, I call it a sidecar plan, that says go run those sidecar tasks on the world pod. And so you can just hit an API endpoint and say hey, run the sidecar plan and it's gonna run all those tasks in your executor context and like perform sort of a maintenance operation on your service. So hold on, I wanted to uninstall a little world. We'll wait for that to be uninstalled one second. So we're gonna see the same thing. We're gonna be able to launch those sidecar tasks into the running containers context by executing the sidecar plan. And this is the result you're gonna see. Can you see these timestamps? You see it that says three minutes up here and two minutes down here. That means that the world servers stayed up the whole time and I was able to inject a sidecar task into that container without affecting the running service, right? You wanna maintain the service being up. So I can show that, let me schedule it coming up. Just one second. Let it deploy the service again and then we will run the sidecar plan and what we're gonna see at the end of this, remember we had that prepare and then we had server and the output. Now you're gonna see prepare server and then you're gonna see sidecar injected into that same like context that everybody's sharing now. Give it a second and let's so you can list all the plans. These are DCOS CLI commands but all it's hitting is a REST API that which is served by the scheduler. There's, so if you say plan show deploy you'll see the deployment plan in a second. There you go. So the deployment has completed. So let's go start that sidecar plan. There, it's already almost done, you see. So those prepare tasks are done and the sidecar tasks are done and if we go look into that volume, we'll see that it's affecting that persistent state, that shared context that staple service is like. Cool. So why did we build this feature? We did the, I mean the real motivating use case for this feature was Cassandra because Cassandra has a number of tasks that you wanna roll out across the cluster. There's repair, there's cleanup, there's backup restore. We can go look at those. So you see we defined a number of plans. Here's the backup S3 plan for Cassandra. This is a lot more complicated than that world scenario but it backs up the schema. It creates snapshots, it uploads those to S3 and then it cleans up the snapshots. And you've got the inverse over here for restore. You can go fetch everything from S3, restore the schema and restore the snapshots. And this all occurs while your Cassandra cluster is still running, unaffected. You can define how you wanna run these plans. Would you like to do them serially or are you gonna hurry? You wanna run all this backup in parallel? Do you wanna do it 10% of the time? All this is up to you. You can define how you wanna operate your system. Okay, so so far we've gone over sort of the stateful aspects of the service and some of the orchestration capabilities of the service. And I wanted to touch briefly on the extensibility of this. So far you've only seen YAML in brief. So you can modify that YAML, what we call the service spec, is instantiated as a Java object. And you are free to modify it before the service starts at will. That example service I talked about Edge LB, it's an Edge load balancer. It uses this a lot. It programmatically modifies the port characteristics of its service. Let's see. And then sort of the first class easy things that we allow you to extend your service with are APIs. So you saw that all those commands I was running, there's a whole bunch of them. So if you say DCS, hello world. I think it'll spew them all. Yeah, there are many, many, many endpoints that you get for free in your scheduler, but you can add your own. Kafka does this. So if you wanted to expose the ability to create Kafka topics through your API server, you can do this. And so that's APIs and then there's failure recovery. So anybody know how to replace a name node that's failed? In HTFS there are two name nodes usually. And if one of them permanently fails, you need to bootstrap it off of the other one, like copy over all the bits and then start it. And so that's like custom recovery logic for a permanent failure case. We, you can extend that. Are you sure? So you can define your own custom recovery logic in Java and add it to the service. Let's see if I can just show you that. Yeah, here's the HTFS example for failure recovery. You see? You were able to write a little Java. I don't know, maybe that's too small. Here you go. You wrote a little Java. It says, oh, I'm replacing name node zero or name node one. Oh, bootstrap off the other guy and then start the name node. Like this is not a lot of Java and you, but you have this programmatic extensibility so you can handle all the special failure cases that many stateful services have. Cassandra does the same thing. So Cassandra, when you wanna replace a Cassandra node, you have to tell it, or if you wanna do it right, you should tell it which IP address it's replacing so that it can take over the token range. So you see this, you should modify the command that you run in this special case. Instead of just starting the Cassandra node, you say Cassandra node replace IP and then you give it this special IP address and then you rebuild the pod spec and then you'll see down here, if it's a seed node, you need to do this extra special thing, right? Like all this if, else, if, else logic makes sense in a programming language. It doesn't really make sense in YAML so we allow you to extend your services in this way. Don't ever did that. Okay, so nobody was familiar with reservation refinements, a new feature in Mezos that MPark wrote. Thank you MPark. And it allows you to, you guys know about static reservations while you can start an agent with statically reserved resources. Now, I showed you a bunch of neat things that you can do with stateful services, great. But still there are probably edge cases where you wanna isolate sort of permanently your storage services. You say like here are my 10 storage nodes. I'm gonna isolate that from the rest of my cluster so I wanna statically reserve those resources. If you're running multiple frameworks, you would, in the past you would be stuck with saying, I wanna statically reserve these resources for Cassandra zero. I wanna statically reserve these resources for Cassandra one and you sort of have to know every single time you wanna install a service what reservations you wanna perform. Now with reservation refinement, you can say, look, I'm gonna reserve these 10 nodes for the storage role, okay? They're just storage. And every time I wanna install a Cassandra, I say, hey Cassandra, go consume some portion of that storage role. So you can say, I'm gonna refine that resource, I'm gonna refine that reservation. It was reserved to Cassandra, or to storage. And now I'm gonna reserve it even further and do storage sass slash Cassandra. And the real power of that is that you maintain a clean offer stream. So each one of the frameworks that consumes resources in that way, doesn't have to cooperate with other frameworks and say, oh, I'm using some portion of the storage role, and you should not use it. Like, how do you do that? You do that by having one-to-one mapping between frameworks and roles. So I have a demo that shows this working. I gotta uninstall the last one. One second. I don't know if you guys are familiar with DCOS, but DCOS has, out of the box, it's got something called public agents or public slaves, and those are statically reserved. So I'm gonna refine the reservation of those statically reserved resources, and I'm gonna run some of my tasks on those statically reserved resources that have been refined, and some of them on the normal, like, private agent pool. I have the YAML for that, if you'd like to see it. So you see, the only thing I had to do was say the pre-reserved role here, the statically reserved role is slave public. And now all these hello tasks are gonna refine that reservation. So let's see if we uninstalled already. Is it still, attach the wire to the slides. So what we should see, what we're going to see, is that we're gonna get three hellos, hello pods that are on these refined resources, and three world pods that are on refined resources. Oh, not working. So, can you guys see that? You see the role here for the hello pods? It's slave public slash hello world role. God, take my word for it. Do you see that? That those are refined resources. So, there's only one framework in the world that has hello world slash, or a slave public slash hello world role resources, and so it has a clean offer stream. Those resources are for it and it alone. This is actually also made possible by the use of multi-rolls. So you see, maybe you do, maybe you don't. The framework is registered with two roles. The hello world role, see? The hello world role and the hello world slash, or the slave public slash hello world role. So it's able to create reservations in both of those roles. So you need both multi-roll and reservation refinement to make this work. I can do, one more thing I want to show is that I can install another one, and they both refine resources, and they don't conflict. I wanted to show that real quickly, it behaves. So we can install another one with a different name, and it's gonna do the same thing. See, hello world two is staging there. So that's the new scheduler coming up, and you're gonna see that it's going to refine resources from slave public into hello world role two. And so now they each have clean, reserved resource streams through multi-roll registration and reservation refinement, which is kind of neat. So the idea here is that instead, so read slave public as storage. So you could have like pre-reserved a whole bunch of storage nodes, and then you could have installed Cassandra, HDFS, Elastic, Kafka, into that pre-reserved storage pool, and they would have all played nicely without having sort of the data center operator have to talk to the guy who's installing the stateful services. See, hello world two role. Okay, okay. So where are we gonna go from here? So at the beginning I talked about these sort of axes that the framework or this SDK lives on. And right now we're very, very focused on long lived services, but that's gonna move. Right, we're gonna start to support sort of short lived job oriented tasks. It's a very general purpose framework right now. And I say it's gonna move to be a more specific. So what do I mean by that? So right now it's like a really fancy deployment mechanism for somebody else's software, right? Like I wanna deploy Cassandra. Somebody else wrote a distributed system already and I'm just deploying it. In the future we'd like the SDK to be able to be used to build new distributed systems that are like it's not just deploying Kafka brokers or something. Right now it is very 100% focused on stateful services. It's like it reserves every resource. It's very interested in volumes. It doesn't let you blow your foot off. If a task goes away it doesn't like automatically replace it and throw away your data. It's very focused on that. But I'd like it to move in a more stateless direction. So we'll have pods that are not solely focused on state but can be stateless. And by that I mean if they crash that's fine. We can launch them somewhere else. It's not obsessed with the idea that it has to come back exactly the same place. Exactly the same resource. It's exactly the same volume. Then there's a bunch of stuff to just make it more fun to use. This is something we're working on. Better operations and tooling debug and stop pods. So I keep talking about this shared context is what makes stateful services so annoying to work with. So these debug pods is what I mean is that you should be able to say to one of those hello pods or Cassandra pod or something say put it in debug mode which is what it's gonna be the moral equivalent of restart that pod and suppress the command and make it sleep for infinite. And so you just have the container sitting there not doing anything. And it's like a VM. It's like a server. And you can go in there and modify the state and run interactive commands and fix it. So for example, HDFS name nodes. What do you do when your HDFS metadata is corrupted? You run this command. It's like name node recover and it's interactive and it asks you questions and you answer them. Like how do you do that in a containerized world? I think you do it like this. You stand up this debug pod and you start modifying the state. And then when you're done, you let it do its thing again. The programmatic extensibility is like that you saw on the Java is like really focused on a couple of use cases like APIs or modifying the service back at start time or custom recovery, but five minutes left. Those aren't super easy to use and we'd like to make it better. Service composition. Okay, so you have a bunch of all these services, right? You wrote Cassandra, you wrote Elastic. I don't know if you guys were here for the Esri talk. Like they have this big stack of like five different services. It would be nice to have Esri in a box. You like push a button and it says, oh, I want one elastic that looks like this. I want a spark that looks like that. I want a cough that looks like this. Boom. And our development tooling could use some work. That's it. Does anybody have any questions? I think we have just five minutes left. I'm sorry. Elasticity meaning, so you want to, the question was how do we, okay, so the question was how do we expand Cassandra capacity in an online fashion like add nodes with leaving the servers uninterrupted. So the core mechanism of the SDK is that YAML, whatever you want to call it, that service spec, you compare old configuration and new configuration, right? So your old configuration said I wanted three nodes of Cassandra. Your new configuration says you want six nodes of Cassandra. It is smart and goes, oh, you already got three. We'll leave those alone and it adds three more with a readiness check and says, oh, here's a new one. It joins the ring, waits for it to be up and normal. Adds the next one, joins the ring, waits for it to be up and normal. Does that answer your question? Oh, no, that's not a thing you need to do. We don't scale in right now. It's something that, I don't know. We might do it someday. It's very dangerous to scale in and I would, the risk of data loss is very concerning to me and so if somebody, they had this service spec that had like six nodes in it and then somebody screwed up and said, oh, now it has three nodes. What should I do? Should I knock off the last three nodes and throw that data away? Like I don't really know where that data is. So if we were to ever do it, I would probably say that you have user input say, yeah, I wanna go from six to three and I want those three nodes to go away and no others, right? We have readiness checks. It watches what's going on. I'm sorry, go ahead. The question was, does the service support the capability of waiting for something to happen or be ready before you move on? Yes, there are health checks and readiness checks. So readiness checks means run this thing until it passes and then that means that that step of the deployment plan is finished and then we move on to the next thing. Health checks are, oh, if something is sick, we should kill it and it just runs that on intervals or defined by time. Like run this every 30 seconds, that's a problem. Cool, anybody else? Yes, sir. I'm sorry. Rescinding offers. Oh, yeah, sure. The question was, stateful services are very sensitive to maintenance and do we support the rescinding of offers? Yes, we support rescinding of offers when they're in process by the scheduler. So if you have like a thousand offers come to you, we have like an offer queue of a particular size. We drop some and then while we're processing those offers, if a rescind offer comes in or a rescind request comes in, we say, oh, we removed that from the queue and we're not gonna process it anymore. But you're talking about like taking running tasks down. We have not crossed that Rubicon. That is correct. Yes, I'm unaware of any framework that's taking full advantage of the MAISOS maintenance primitives. It is unfortunate. Got time for one more question. Okay, anybody? I'll talk to you later. I know you. What's up? Not yet. The question is, do we support external disks for these data services? Not right now. It's all local persistent volumes. We're sort of waiting on CSI. Once CSI lands, we'll support that and everything that CSI can do, this will be able to do. I think that's all the time I had. Thank you.