 All right, good afternoon and welcome to our last MISOS university session actually today and for this MISOS con so Ben, Ben is a Software engineer at MISOSphere and he's going to tell us all about what it takes to build like our first stateful DCS service And in particular this is going to utilize the SDK So this is the underpinning for all the great services you might have heard today about in the Keynote so all the smack stack services are basically based on the SDK and today It's up to you to learn and write your first also stateful DCS service enjoy Okay, is this microphone working? Yes, sweet Yeah, so I am wood. I work on the SDK team SDK team SDK will be we keep saying SDK. I'll explain a bit more about like, what does that actually mean? Cool, so first things first you should go to bit.ly slash first stateful Every single command you see up here is in there like with section titles and things So you'll be able to follow along without trying to hand type Incredibly arcane commands as well as a bunch of yaml after that you're gonna want to do so First in order of business There's a lot of you We don't have clusters for every single one of you So we're gonna do pairs or maybe triples so basically like if you're sitting by yourself You're doing it wrong You should sit next to someone and then when we hand out clusters you'll pair up with somebody But you can definitely like follow along and type all the code on your own personal box But yeah, so you basically want to do the Docker pull mesosphere DCS commons latest That's also in if you went to the bit.ly thing first you will see that and you can just copy and paste It's just because our Docker image is a little bit big But this is actually the image that all of our engineers and our CI system use for working with the SDK All right, so everybody on board bit.ly first stateful Docker Good good vague vague acknowledgement great. All right. So our agenda is I'm gonna go over an intro of like What is the SDK? What is the motivation behind it? How does it go about accomplishing what it sets out to do? Then we'll do a bit of developer setup basically get at least one laptop for every couple people a few people Hooked up to a cluster so you'll actually be able to deploy it and you know do a little bit of debugging and then also just have you set up in general to Write the code on your own machine then we'll do a very simple hello world Sort of introduce you to the development cycle with the SDK And then we'll do memcache your kept saying stateful. He said it so many times Memcache is not technically stateful right like it is a it is an ephemeral cache however, I Know that's fair. So however, however, you would You would have a very unfortunate time trying to do what we're going to do with memcache on marathon And we'll we will use like leverage things like persistent volumes a little bit There's some element of my colleagues and I have stolen all the low-hanging fruit for like implementing stateful distributed services Like we already did Kafka Cassandra cockroach Elastic HDFS and and it's sort of like Those ones, you know I'll probably show a few of those at the end in terms of like what those real productionized yamls and and a little bit of Java look like but So basically we'll we'll do configuration templates and sidecar plans Those are sort of two of the more useful in terms of making a robust framework Features of the SDK and then we'll sort of go over what you can't do Which is mostly going to be me telling you everything the SDK can do and then a little bit of where we're kind of going with it Okay the intro neat whom I I am a software engineer at Mesosphere. I work on the SDK team. There are about 12 of us Game drill Hartman who gave a talk earlier today He is sort of the original author and then I one of the tech leads. I'm the sort of co-tech lead I focus a lot on like process and how the shit the team can ship like a really high-quality SDK and like a bit more on Taking the SDK and building really robust data services with it my background is a basically real-time performance monitoring so I worked at a couple of companies where we suck in a bunch of data from users around performance like website performance and Provide value to fortune 2000 companies and then also in doing that did a lot of infrastructure automation And so the SDK what does the SDK mean the SDK is fundamentally a github project under Mesosphere Org called DC OS commons I don't know sometimes we're like we'll call it the DC OS dash SDK, but hey, it has the name it has right So let's walk through a little bit. Okay, like why the SDK distributed systems are hard Mesos is a pretty good approach to you know abstracting the hardware of a distributed infrastructure Who here has written the Meso scheduler? It's really hard right like Right like an onboarding task at Mesosphere if you are a Mesos here employee one of your onboarding tasks engineer We do not make the accountants do this is you write a mesos framework, and it's like neat With a person who most recently joined our team. He was like, oh, yeah, I made this cool batch scheduler It only took me like a few hours, and I'm gonna try persistent volumes And we all kind of just like snickered and we're like, okay, and then two days later Evan's like And we're like, yeah, we did that already So right like like Mesos is a fantastic hardware abstraction, but it is wily like you got a really Deal with a lot of weird edge cases and and I think right like we've seen it like there There are only a few truly successful schedulers what I think we've seen is is people trying to do specific data services Like a Rango DB has made a pretty good framework But they have written tens of thousands of lines of C++ to do it right and no one should have to write More than like zero of C++ so Additionally, you know the the SDK essentially targets DC OS and DC OS is a very powerful system It has like you know right you can think of it as it's it's mesos plus marathon plus a bunch of really fantastic orchestration and sort of you know service Discovery all these wonderful things, but it also is kind of hard to leverage right like you can leverage it in marathon But like if you want to do it yourself, it's kind of tricky And then also So like right like how can we try to solve this with like an actual platform with an SDK and right? You might say like well Cassandra is pretty different from HDFS and like It's like yeah, but like if you squint at most distributed systems right if you sort of blur your vision It's like they start to look Really similar right like it's it's like 90 percent similarity. They all care about deployment They care about how do you recover when any given node type goes down? They care about upgrades. They need service discovery. You want to have performance metrics There's all this stuff right and you can sort of see like DC OS and mesos Have the primitives of all that right and we want to just very easily sort of Provide you a clean abstraction or combining all that together So what is the SDK it is a declarative orchestration abstraction for Apache mesos in DC OS Fundamentally it is an Apache mesos scheduler factory, right? You give it some input in the form of yaml or a bit of Java and what do you get on the other side? You basically get a jar that jar knows how to deploy your service very well All right, so in terms of where does it live? So if you want to look at the docs, which is probably the best entry point to it. There's actually a link At the bit.ly link the first staple in that gist So there's there are docs upon docs upon docs of all the features I'll probably if we have a bit of time at the end I'll sort of take us through those where you can sort of pull out the key value So what do you get if you use the SDK right because you might think like well I can just write a meso scheduler. It's pretty simple the thing I'm doing. It's like yeah, sure We probably right there's there's a handful of very good Apache meso schedulers out there in the world But like Aurora and marathon and like Apple's never gonna open source Jarvis Well, maybe never say never but right like they're they're very specific and and primarily like the two particularly good ones Aurora and marathon are these sort of mono schedulers, right? And they're they're kind of hard to customize hard to extend if you want to like I have had to look at the marathon code base It is a good powerful piece of software, but so much Scala So we've built an incredibly good default scheduler, right like it is We've we've written Six services on top of it that we're you know selling to folks to run their data solutions on top of right like we We've gotten very good at building staple data services with this SDK So it covers you know deployment updates recovery You can do powerful custom orchestration and then additionally it does tightly couple with DCOS in terms of Security networking service discovery right like sort of fundamentally like it would more or less work out of the box on top of Just raw Apache mazos except that raw Apache mazos doesn't force you to have DNS, right? Like that's sort of the fundamental thing that's missing if you had mazos DNS or so DCS has Spartan Which is the fancy the fancier DNS we recently added Additionally there is so DCS CLI has a sort of concept of modules So every service gets auto-generated sort of an operator module for interacting with it And then additionally that what is that doing? It's basically just talking to a rest API that is served by the scheduler Talked about powerful customer orchestration. So there's the sort of plans logic will go into but additionally you can extend scheduler behavior Very deeply by writing a bit of Java. We have nice We have okay hooks for diving into the sort of actual Java that is you know interpreting the declarative API and Turning it into orchestration Additionally, there are 12 people that work on this full time, right? Like we are every time like Something lands in Apache mazos and then as soon as like DCOS can like supply it to us and then like boom We implement it right like we have a very tight relationship with Apache mazos for obvious reasons. We work at Mesosphere We want all those new features We're also pushing like a lot of the things that have come out for mazos and for DCS are pushed forward by the drive Towards like can we do these big stateful complicated systems on top of these platforms? So there's a very benefit. There was a Good cycle words escape me Okay, so what is the SDK ultimately so essentially the the SDK has three core interfaces like what the the SDK Driven towards so essentially there is two programming time interfaces the declarative API and the programmatic API Declarative API is in yaml programmatic API is in Java and then at runtime you have the rest API for interacting with the service So declarative API, what does it look like? So you have a service name neat you then have pods and plans, right? So pods are what are you going to run? Plans are how are you going to run it? When are you gonna run it? What should it do, right? So pods in terms of There we go. So pods right so here's some kind of Fake pods that I made up but essentially a pod is a set of tasks Translated into mazos land a pod is a mazos executor and a task is a mazos task group The reason for that is essentially mazos task groups are atomic units Everything in a task group has to fail together. We don't want to do that if we're running your database Right, we want like the database task to be separate from the maintenance tasks so that they can be turned on and off Separately fail separately, etc. So as you can see here, we basically have two different pods You can define a count there are dozens and dozens of different parameters in the yaml all documented in the documentation I linked you to But essentially, you know, you can do all the obvious basics you can define a command you can set CPUs You can set memory. It's not shown here. We'll see it in the tutorial. You can set a Docker image We don't do the standard like a lot of orchestration systems. They're just like Docker and they run whatever the Docker thing is supposed to run. We don't do that, right? We basically use Docker as a file format to present, you know Bundle dependencies and then we launch everything with the universal container runtime. We don't touch Docker at all Okay, so that's the clear of the API in terms of pods, right? So again, this is like what do we want to run? now plans are pretty powerful and They allow us to define both deployment. So There's sort of a there is a built-in like if you define no plans What would happen if I installed this like well, what's gonna happen is when our scheduler Parses this it just sees. Oh, I have two things that have a goal state of running I want five of pod B and I want one of pot a it's gonna be like well I'm gonna seriously launch one pot a but I'm only gonna turn on task a one because it has a goal state of running I don't know what to do with a task that's finished and then pod B It's gonna be like well I'm gonna seriously launch five of these right and it's like obviously that that is trivial That is like the naive case that that doesn't that none of our services right like none of them the naive case is actually correct But it's like it's good to have a default so plans cover a number of things, right? So they cover deployments. We also have a separate update plan We'll have in the future the ability to do relatively complex upgrade plans for things like database migrations File format conversion stuff like that And additionally you can define something called sidecar plans. So sidecars are essentially Auxiliary tasks within a pod, right? So if we look back here task a to be considered a sidecar, right? Like it has this goal state of finished, right? Well, not necessarily a sidecar, but it has a goal state of finished There's two possible goal states. There's running and finished running means Always make sure I am on if I fail bring me back finished means keep trying until I succeed and then stop Do not bring me back, right? Now. There's a series of different interactions You can do some cool stuff like obviously you can have a bootstrapping task a great example is if you look at our HDFS framework We basically do some we like format the name nodes Those are essentially tasks with a finished goal state then after that we actually start the name node server Right additionally you can do sidecars where you might define a plan where an operator is going to run it a great We're going to go over one today We're basically going to so we're going to do memcache as I said one of the things we're going to do is to find A sidecar that flushes the cache, right? So you can you can tell memcache individual instances like hey flush your cache We're going to find a sidecar that is allows an operator say hey flush the entire cache of everything And it'll just either serially or in parallel go through and flush the cache for everything So plans are basically built up of so here deploy is the plan It has an overall strategy of serial which they see means for all of my phases go through them serially within a phase So here we have pod a phase you can also again have a strategy a phase operates against a pod a Step which you don't actually see any here, but you'll see them later a step is essentially What should happen for an individual pod instance, right? Excuse me, so you can have you know Different configurations of this sort of thing if you look at our elastic framework is actually a really good example of different deploy and update plans for deployment elastic doesn't care just like Throw all the nodes out there right and then for update you do care for all sorts of reasons availability things like that So then it has a separate update plan that is much more careful about sequencing of how it's rolling out new binaries All right Programmatic API not really going to show you any examples if we have time at the end I'll go over a couple basically if you look at Anything but elastic they all have custom recovery. I'll show you at least show you where to look so you can look at the examples So essentially the scheduler can be extended by writing Java to add custom recovery So that's things a great example is Cassandra Cassandra when you replace a node You need to somehow communicate to the ring this node is going away Right because and by so there's a distinction between restarting and replacing restarting a node means Cyclet in place replace means Throw it away unreserved the resources find somewhere new to put it right like replace is the rack Broke or we're taking that rack away. So all of those instances need to go away So replace you have to do a little bit of custom logic to check that like oh, this is a permanent replacement Okay, I need to go and tell the ring Hey, this one's going away Yeah, and then additionally a good example of this is our own agile B framework they have to sort of Oftentimes the things you will do in Java You're essentially either hacking around the fact that we're net that the SDK does not plan to support something or it's like in our Future right it might be like I need to iterate over a list and it's like neat You can write some Java to do that so essentially the service spec is the in Java representation of the yaml and You can then you know take that service spec and you might have An end bar that is a big list and you're like great I actually need an add a port for every single one of those things things like that The rest API the endpoints you actually really care about so in points Which is all about telling you how to connect your services We'll see an example of that on memcash right like neat I deployed a service, but I actually really care about talking to it The endpoints API is all about revealing to you the user. How do I actually talk to it right? You can you can very carefully curate like what does and doesn't get revealed It's not sort of just a barf of like here's every port and IP address. This is ever used plans so plans are all about receiving information about the individual plans deploy recovery Deploying recovery of the two built-ins deploy is how does it bring stuff up recovery is a reactive plan It's basically looking for failed or yeah, basically looking for failed tasks and Then any cycle plans you can also perform opera like a number of different operations on any given plan and Pod is just a way to look at pod and task state You can also perform operations on the pod so I alluded to you can restart a pod Which basically means restart it in place bring up any tasks It should be running according to the current plans and then there is replace which is Throw it all away unreserved the resources destroy the data and move it somewhere else You shouldn't do replace Ever or often at least there are cases where you might want to do a replace You should test replace when you write a framework like whoa, you really should test it It is a great test of is my recovery actually good Okay, so how do SDK services like actually run in BC us right like what is what is the true output of The SDK so fundamentally what you care about is You need to install it right so that the true output of the SDK Like a truly fully compiled output is a DCOS package So what is a DCOS package DCOS package is a set of instructions Which are basically some package metadata some resources that that package will utilize like basic URIs and An options file that essentially defines all the parameters of the service and a templated marathon app Cosmos is the package manager for DCOS So cosmos knows how to translate those first three files and that marathon app Definition into a launched marathon app right so fundamentally what we spit out is a jar and a Package definition that when we hand that to cosmos it ultimately launches a single instance of the scheduler As a marathon app, so which is great marathon will make sure the scheduler sticks around We operate under a concept of we don't need the scheduler to be like ha or anything because it's like You know your tasks live if the scheduler is dead for a tiny little bit marathon We'll always bring it back in terms of configuration updates and things like that What it means is ultimately marathon is killing the scheduler and bringing it back up with a new configuration Whenever you do any sort of configuration update upgrades are basically just configuration updates that change binaries So it's the same deal. You know you're doing a package Update and you're ultimately just restarting the marathon app with a new configuration It's gonna see oh this URI changed and it's gonna go and download it So essentially what that mean right so cosmos The user issue is DC us package install cosmos sees that it renders and launches the marathon app Marathon receives that request it launches a single scheduler instance The service scheduler then comes up registers with mazos as a framework and then it starts to receive offers and Put tasks out so you can see in my terrible topology diagram That essentially what it means you have the masters off to one side the scheduler is Coexisting with its tasks and its pods each of those pods is a little executor that's running some number of tasks Okay, so anybody have like three minutes of questions anybody have questions No, all right neat Okay a note Polyglots the SDK is complex it contains go for the CLI and our task bootstrap Java for the core scheduler about 55,000 lines of it Python 3 for testing there's some bash in there So what that means is we're all very sad about it, but the good news is we recently added a docker container So you have all of that bundled together that is supposed to be the docker container. You are all pulling already So what are we gonna have to do to get set up? We're basically gonna download a template So I have gone through the labor of making it so we have a nice little github repository where all we're gonna Need to do is touch the ammo like we are we are not gonna touch the Java. We'll take a peek at it We're not gonna have to go through the setup of like Bootstrapping the whole configuration and all that stuff. We're just gonna touch the ammo. It'll speed things along nicely We're gonna do everything inside the docker container. I'm going to show you some AWS credentials That are very temporary and will disappear after this And then we'll do a DCS CLI setup and I forgot to take it off, but we're not actually gonna use the private key So neat. This is all on the gist that I had you grab. So basically you just want to do get clone of This then we're gonna change into that directory Yeah, I have my github set up a little weird Does that make sense to people you can also switch it to here I'll show the also true This is gonna be a great test of our conference Wi-Fi I'll throw the other one in here as well Okay Everybody good got the file did we change into the directory if you didn't change in the directory change in the directory neat All right, so enter the docker container I mean you can you can change the working directory and where I mounted it feel free But this command will get you in there With all the right things It is also in the gist. That is the default of the repo Yeah The amulet file. Yeah, the amulet file is already there. We're basically just gonna hit build once we're all set up All right, everybody in the docker container that wants to be oh, it's still downloading All right neat. Oh, yeah. All right neat. I was pretty sure this would happen I'm very glad we had not optimized that for size yet. All right so let's just start going over it and If you can sort of just like I don't know We'll check in in 10 minutes on how we're doing on the docker image Who does not raise your hand if you want to have the docker image and you do not have it yet Okay, we're like 50% all right neat neat if you have a phone you can tether with it might be faster Yeah, all right. I love conference Wi-Fi If I was better at my job, I would have told you all to download this sooner Okay, thanks Google you're bad at emoji. So let's talk about what we're gonna see so Actually here I have so much time to kill. I'm just gonna do it like this Okay, so this is the template essentially if you said build SSH This would build and output what we call a stub universe So universe is the sort of package repository There are a couple different ways to generate one a stub universe is basically a JSON file That you can add to your cluster as a repository and then install from there So let's take a quick peek Due to the wonders of Java What we end up with is very deeply nested our service dot yaml So as you can see I'm calling it memcache already because that's where we're headed and feel like changing it later I'm setting here the scheduler. It's gonna run everything as the user nobody and I'm basically just saying neat Let's have a pod that says Hello to us right. So we basically have here's our pod. Hello. It has a single task called server It's goal state is running its command is echo. Hello world. It's gonna use 0.1 CPUs and 256 megs of RAM and It's basically the outcome of this is a task that is just gonna loop constantly, right? Because it's gonna keep finishing it's gonna exit cleanly and then our scheduler is gonna identify. No, that is bad I want I want to keep you alive So I Pop open Okay, unlike you suckers. I already have the dark container alright, so now I'm in here and As you may have noticed we have this delightful thing where I just give you some AWS credentials So I'm kind of just gonna go ahead if if you already have the Docker container fall along I think if you don't already have it by the time you get it you should be able to jump ahead We're gonna slow down as soon as we start like actually interacting with the cluster Actually, we're gonna have to do the whole cluster provisioning thing. So we should get to that anyways so Do do now for the fun part So I have my own cluster So all of these clusters are DCS 110 are they enterprise or open your oh, yes. Yes, so if we pop over here I Lost the gist So yorg has very kindly spun up lots and lots of clusters for us So All right We're gonna whoo so many so as I said, I think do we have enough for everybody to have one? Are we doing Paris still Paris? Okay, so I'm gonna point at people and tell you which one you are See how this goes Can you give me edit on this yorg so I can Yeah, let's just do that. All right, so momentarily we will all be able to edit this and then you will you will engage in a very fun Distributed systems problem, which is a bunch of human beings all trying to pick the same thing at the same time There we go. All right. So basically find a buddy and Go ahead and grab a cluster just write your name down that you have taken it and once you got that Look at me like stare at me. So I know that you're done Or raise your hand or something Yeah, everybody should have edit you might just have to reload. All right, so once you have your cluster the command you want to run is DCS cluster setup space and then your cluster URL It's not gonna prompt you for a user the user is Bootstrap user well first is gonna ask you a yes or no question. You should blindly say yes You're actually technically Agreeing to a ULA. No, it's just an SSH thing so user is bootstrap user and Password is delete me That's just the default user That you get when you first install DCS enterprise. All right. Does anybody still not have the Docker image? Impressive Same exact people. All right, we're making great progress with the internet Okay, so so we all want to find in our favorite text editor The animal file so you can see here. It's very simple It is basically just going to loop and say hello world So the first thing we're gonna do once we have everything set up you get your cluster You're gonna type in build that a say it's you're say AWS. Basically, it's going to build It's then going to upload upload the built assets plus the stub universe to a s3 bucket This is also gonna take a while. So this is a great time if people have questions Is it what was the error was the error? Interesting. Yeah There should be a There should be a very clear either access tonight or Okay, we'll see how this that should work fine. Yes, sir right Yeah, yeah, yeah, so the question is basically Can you use it for things like jobs, right? Like can you use it for jobs to where you scale out and then you finish and you want it all to go away, right? Yeah, so We're gonna Someone's doing that right now on the team right like we're working on that for spark essentially So it it's a very common right like and that also sort of speaks to a general thing you see like the edgy'll be that we've released There's sort of this concept of like a meta scheduler where like you want to deploy These things right? I mean that's what spark is like the spark dispatcher is a meta scheduler that deploy spark jobs that are themselves schedulers and We're working towards sort of having sufficient support for doing that sort of thing of like go and do this and then finish and then Go go away kind of thing Yes, sir Right marathon is focused Yeah, sorry the question is in comparison to marathon. What is the benefit of using the SDK? So with marathon marathon is quite good at stateless scheduling It has some amount of support for persistence, but not a ton and Additionally, it doesn't have particularly complex orchestration right, so like if you sort of think about it like a marathon app is As far as I know you basically can get like a single image, right? Like a marathon app would be like a single pod type, right? So if you wanted to sort of copy what the SDK does you would need to launch a bunch of marathon apps Right, so that's sort of one the orchestration is quite simple our support for persistent resources is good amount stronger also additionally So when we make when we place tasks, right when we deploy pods We do full reservation of everything, right? Which means that we can always restart again and again in the same spot and never have to compete with Other users of the cluster whereas marathon doesn't have those guarantees could be that the AWS Yeah, could be that the AWS nuclear credentials when you put them on the internet accidentally hammer has gotten a lot faster But we shall see Gradles very large Okay, so let's let's just sort of go through the presentation. We can always I see you saying so the sort of question is His association with stateful be things like Rex Ray sort of the and I guess are you sort of specifically saying the ability to like move between hosts? Yeah, so so our view would be so in terms of state So we don't actually support external storage providers yet But what we do support a persistent like Well, the SDK does not I DCOS and mesos do the SDK doesn't right? So we focus on host mounted persistent volumes via mesos Where stateful like there are many reasons you want your tasks to keep landing in the same spot Right in terms of like you might have pinned it to very specific hosts that you want things to land on because they have Fat drives for Cassandra that kind of thing Many of the network attached storage choices might not have the performance that you want for a data service So in terms of stateful, it's it's very much a like tasks Don't get to just move like flit around a cluster, right? And there's a lot of different things over that would be true like memcash is not a great exam It's a good example for this tutorial because it's very quick to sort of throw together But in the sort of Cassandra is an HTFS is in the world, right? I think you very much want things to keep landing in exactly the same spot Okay, so hello world Either bring the host back or you would replace the pods that are sitting on that host And they will get moved to a different host No user intervention. We do not touch data destructively unless someone says touch this data destructively Yeah, I mean so they would just be monitoring right like they would be monitoring Cassandra And through that they would see like oh that node went away And then they would be able to go and reschedule it somewhere else. Okay, so hello world a very simple example So what is sort of the development cycle look like with the SDK essentially we build Then we install then we test whatever we wanted to test then we do an uninstall In this case. Oh, these slides are old But basically so uninstall what does it do it goes through and it puts the scheduler into an uninstall mode It then proceeds to unreserve everything ahead of reserved and then walk away leaving a very clean cluster That you can put other things on and then you just sort of loop through that again and again Correct Yeah, yeah, so if you're sharing a cluster I mean you could both run it and then you'll sort of just compete For whose memcache runs or not Okay, so let's look at how we'll actually start to do memcache, right? So I've built a Docker image that has memcache on it It's not gonna come up and try to run it or anything like that So basically the simplest possible case of memcache is to say yes I want three of these using this Docker image and then I'm gonna put the command in as memcache Right, it's gonna launch three tasks. Sorry three pods each of them has a single task It's gonna reserve one CPU and a gig of memory for each one memcache is just gonna turn on neat Right, it's a very simple most basic possible example okay, so This was a bad like this was bad this This has created a toxic framework that is doing something very bad that mazos has problems with which is port reservations So mazos has zero enforcement about ports It is the honor system and like not a particularly honorable system So what we did in silently here in the background by turning on memcache is we stole the port 11 to 11 Right, so to be good citizens We can come in here instead and we can actually leverage some nice features like obviously Hey, I want to stack a bunch of different memcache frameworks that you know, I have a bunch of developers I want to stack them all on the same cluster I don't really care about them overlapping because it's all different memcaches Just some nice syntactic sugar. It'd be like hey, give me a random port give it this n var And then advertise we'll see in a bit basically it just means when someone calls the endpoints command Tell them that this endpoint is available and Then you can see we just modify to say hey, you should listen on the mazos container IP Which is the private IP of the agent so it's whatever IP you start the agent with Excuse me, and then you should use this port which is the memcache port Which again is coming from this random port that it is assigned pretty much always you're gonna get 1025 Because there's no other ports taken but that's just a side effect of having your own little empty cluster So again, this sort of distresses right like we make all of these reservations Some of them are not enforced right so a great example is when you have multiple tasks That you know they each have their own different CPU and memory settings cool They're just gonna share like everything in a pod is gonna share executors like everything under an executor shares a C group There's no enforcement there yet. They're probably will be someday essentially within a pod the things that are shared our network namespace C groups and Then the only real isolation is that these sandbox and mounts are different Right like if you if you want a volume to be available to all tasks in a pod You need to put it at the pod level if you put it at the task level it will not be visible to everybody Same for the sandbox if you do something in the sandbox of one task you will not see it in the other task Same for environment also right like Tasks inherit the environment of their executor. They do not share environments with their task neighbors So close this is doing a go get of 50 kilobytes that takes a very long time on this network Okay, so let's talk a bit about so in terms of when you're developing task exec is your friend So task exec is basically a way to specify. What is actually a regex? Cache zero server is basically gonna match like if there's only a single task that matches that it'll hop into it Otherwise, it'll tell you like here's all the tasks that match the regex. You've given me So task exec dash it bin bash Exactly the same as if you had SSH'd but the very cool thing is this is exactly as if you were that container we sort of just Task exec is just plopping a little container right at the same level of nesting you to share the environment of whatever task you're hopping into You can see everything it can see super useful I used it while I was doing this to do some debugging around a shared volume because it was like I couldn't figure out Why one task could see it one task couldn't and I still don't really know but I made it work So I feel very accomplished And then in terms of trying it out basically the container also includes net cat and we'll be able to see once we get it Running just do some simple net cat commands to memcache part of the reason I pick memcache. It's very easy to talk to it with clients So configuration templates So obviously when you have these complex services, they all pretty much all of them have these large unwieldy Configuration files that you need to pipe things into the way that we handle that is that we essentially Give you the ability to write Mustache templates that are then populated from the environment of the task using our bootstrap utility So the bootstrap utility is available in all your frameworks You do have to explicitly include it as a you or I at least for now and then you run it Whenever you want right so you might not necessarily want to run it as the very first thing in a command You might want to do some other setup calculate some end bars that are then going to get templated in your template files and then but obviously you're going to ultimately want to run it before you run your server task So Booster up does a couple other nice things. It'll allow you to wait For DNS resolution on certain things. So both mesos DNS and then Spartan the slightly fancier DNS of DCOS Are not instant right like it has to propagate to some extent. So it'll let you do things a very common Loop that we have and our stuff is to basically sit there waiting like okay Like wait a minute until DNS has resolved because if I try to launch this thing and like it tries to talk to its fellow It's like it's just gonna freak out. So let's sort of wait until the DNS record actually exists before we proceed What other fun things can you do? Oh, yeah, you can also install certs. We also default that to true This isn't really applicable going forward in 110. We used to have a we have a custom executor in one nine And that needs the certs for certain things okay, so If we look at our configuration templates, so you can see that here we are adding configs We'll call it memcache the template That is just a convenient n var to pick up the location This actually got a little bit better the next version of the SDK You won't have to remember to throw this magic n var in there So you might notice that there's actually like wait, why is there mustache them putting into my service yaml? So the way this works is the scheduler reads essentially a mustacheed yaml and It uses its environment to populate it right and that environment is coming from the marathon app And that is ultimately coming from the options that the user supplied when they install the package So you can see here. We're basically taking this very simple Configuration file we're throwing in the memory limit the memcache port and the mesos container IP And then there's also a listen on local hosts. So it's a little bit easier to talk to So you can see there. We're also adding memory limit to the environment and then So we're doing bootstrap memcache D And then I'm just catting the memcache D.com because in my container I couldn't figure out how to get it to read from it in the right place. So we're just doing a cat and Cool, let's see how we're doing over here. Yes, okay, so Jump all the way back So this is a very simple hello world So we can take a look at yours in a bit. I'm not sure why it didn't upload Is anybody else able to do build or is anybody tried to use work? I Would double-check your AWS configuration Okay neat so Essentially that dev loop again, right so you build okay neat you've now built you have What you wind up with at the end is This right here is basically adding what you just built as a stub universe and then the package Install is just going to install it with whatever the default options are as you noticed We aren't really templating very much in this So an actual productionized install would have a bunch of mustache templating and a bunch of things in your marathon environment And a bunch of things in your options file, but that is neither a compelling demo And probably would scare you away, so it's not super great So we're going to do the very simple where we're just touching our yaml at least for today So I go ahead. I have noticed the installing CLI sub-command is taking an atrocious in a long time, so There we go it worked neat so as you can see so DCS memcache is just the CLI module for this framework and We can do fun things with it Cool, so as you can see the deploy plan that itself calculated is very simple It has the hello pod it has the hello zero server and it is bringing it up and saying yeah, it's complete the fun part is so because of the fact that this one if you recall is Literally just an echo. It's just going to keep finishing right so I should go to see here That our recovery plan is going to sit here constantly trying to restart this thing If you look the output of DCS tasks, you'll see it's running right now Well, it's staging then and then if we go into What I would refer to as the mesos framework developer UI or the mesos UI because it is nice and speedy you can see here We have a bunch of Hello zero server tasks. They're all going through finished if we hop in here We can look at their standard out. You can see hello world neat so So again that dev cycle right so okay, we're gonna move on to the next one What do we want to do we want to do DCS package uninstall memcache This is going to very annoyingly sorry I wrote this code prompt you to type in the name And it's also going to very scarily tell you this is going to delete all of your persistent data logs configurations database artifacts Etc as because that is exactly what uninstall does it is going to go through and unreserve all your volumes It's going to destroy your scheduler. It is a gnarly and dangerous process. Don't Do uninstall unless you don't want the data anymore You can actually do a dash dash. Yes, and it will skip that warning. So Maybe don't tell everyone that you work with that So what happens so we'll actually see there we go So you can actually see so hello zero server is sitting here in the failed state And the reason is just that these schedulers has just been rebooted and it hasn't reconciled yet and Acknowledged that task state So the way uninstall works is we restart the scheduler So any new goal state for the scheduler is a restart right like any configuration change anything like that we restart the scheduler with a new sort of immutable goal state and In the uninstall case that goal state is get rid of everything and then die So essentially what it does cosmos restarts it in uninstall mode it kills all of its tasks it receives back all of the reservations that it had made it unreserves all of those things and then it Deletes all of the bookkeeping that it's doing in zk So we keep state like we track everything in zk, and then it basically then just says I'm done Via its deploy plan and then cosmos deletes it via marathon Essentially because you have to have that you need user credentials to do the final step So cosmos essentially holds the user credentials in limbo while it's waiting Yeah, so there we go. So Yeah, and so just what happens like right so if your scheduler is down and a task fails It will you actually get this weird state where it shows is active but failed And it's just because no scheduler has acknowledged like the scheduler has not acknowledged that like oh, yes I agree that task is dead now All right, so it's gone. We're gonna go To our handy-dandy Cheat sheet Copy in the super basic memcash do A fortunately much faster build the network connection here is actually bizarrely asymmetric and the uploads are like Four times the bandwidths of the downloads, which I do not understand and never will Does anybody have any questions, but we've covered so far? Yes, sir Which one are you? Thanks, York Anyone anyone else technical difficulties. Let's do any technical difficulties Otherwise no, okay question Yes So to be very Yes, to be very clear so the question was like can I use can I run Kafka or Cassandra using this? Yes The Kafka Cassandra elastic htfs Confluent Kafka and DSC packages are this Like it's YAML and a bit of Java that is all they are ah I see no, so it is a multi scheduler right every service has its own scheduler It is not a mono scheduler right it it is not Aurora or Marathon or the New one uber is building Right, but it is a different model right it is the model of multi schedulers versus Yes, yeah, you can build all of them with the SDK and we have Yeah, I think yes, yeah, so the question was when you have a a Right, so our executor right is a set of task groups right so pod lines up with executor task group lines up with our tasks right so When we deploy a pod We statically reserve all of its resources we look at the total some so there's a concept of resource sets where essentially a Set of tasks some number of tasks under the pod can all use the same resource set They cannot it means those tasks cannot run at the same time, but that they don't need their own sets of resources and So we look at the sum total of resource sets and then individual resource allocations within the tasks And we statically reserve that entire chunk And we never touch it. We don't shrink it or grow it depending on what's running No Yeah, so the question was like if so say you had ten tasks in your pod. They're all running if one of them dies No, I was I was gonna maybe hand it to someone to talk into but Yeah, sorry, so the question Yeah, so okay, so you have ten tasks in your pod all of them are running one of them dies There's no changes to see groups or anything right like it is all one big atomic unit as far as the C group is concerned Okay, so neat So again, I very conveniently can just copy this in basically going to remove the existing package repository add in the new stub universe and then install it with defaults over here All right, so our scheduler is coming up In terms of look at some fun scheduler logs So you'll see the scheduler Scheduler is brand new. It had no previous configs. So it's just gonna see its current config It's gonna say neat. There's no other configs to compare it to skipping kidding skipping config diff. There's no old target config and So this point is just waiting to start its API server It's already you can see here we acquire a zk lock just to make sure that there are no two schedulers running that think they are the same scheduler and then it is waiting to Have its API server up and running. No, it just got desynced Okay, let's walk through this. So I think it's definitely useful to sort of see like how does the scheduler Interact there we go. Okay So So you can see here. We basically we get a offer set, right? We then process those offers We essentially build a set of launch criteria for the cash-throw server You can see here We're then checking in evaluation stages. We're just seeing like okay. It has enough CPU. It has enough memory It has the disk we're looking for which isn't really any So there is the executor we have to give it 256 megs of like host disk or it doesn't work So we always end up doing that So you can see here what we essentially do So the astute and frequent mesos users may be familiar with the fact that there is no Feedback on operations with mesos except for one which is launch So what we do is we stack everything into a big stack, right? So you see here like these are not Sequential operations. These are all sent as a single set to mesos and we then get feedback on the launch Right. So like any one of these like if the reserve like if this fourth reserve failed We'll just get a failure on the launch and the way we process that is we can then just sort of walk away from those resources and It'll come back around get offered to us and we'll be like wait I don't care about these resources. I never used them. We'll throw them away And we will have proceeded with different offers to try to schedule Okay, so You can kind of see you kind of get a sense for it So you can see here Here's its final step, right so it's saying okay It's processing the deploy plan Has the candidate of cash to server It then looks at offers so it builds enough offer evaluation pipeline and then looks at the offers How to pass for all of them? It's going to do the five reservation option operations And then it's going to finally issue the launch so You can see here. We have all the cash servers running So let's go ahead and poke at it So I can do DCS task exec dash it catch zero server so you can see there it pops me right into the sandbox of that task and If I do Environment you can actually see so I'm sharing exactly the same environment And see there's my mesos container IP. There's my Memcache version framework name, etc You see framework hosts. So one thing I have a comment on so every task winds up with a Two DNS entries, right? There's a mesos DNS entry the primary entry that we use is the Spartan entry which looks like this So this tasks IP address would be cash dash zero or its DNS address would be cash Dash zero dash server dot memcache dot auto IP dot DCOS dot this DCOS dot directory Because long DNS names are fun. I guess that can be the only reason that they made it that long is my assumption Okay, neat. So if we do netcat Local hosts because we haven't configured anything yet So see there we're talking to memcache cool So memcache is running at least but as I already talked about we didn't actually reserve the port that we're using So we're being very bad framework citizens So in lieu of doing that that cycle yet another time I'm gonna kind of skip over this one and go right into adding a template. So First because uninstall does take a non-zero amount of time. We're gonna kick off the uninstall. So we will have a clean cluster Copy this over So you can see here. So bootstrap is already in the marathon environment. That is just a URI to a Pre-built version of the bootstrap utility. So when we release the SDK So we release so the current GA release is zero dot thirty dot zero is the version So you actually see that here in terms of like the dependencies you'd be pulling in are just these three and then we also release fixed versions of the custom executor for one nine clusters and The bootstrap utility for all clusters and those can be consumed and you can see that in the resource.json Right here. So there's our executor and there's our bootstrap They're both version zero dot thirty dot zero Okay So again just quickly. So here's our config template We're going to run bootstrap that will template everything into it. We obviously need the config template to do that So we'll pop a new file in copy of this over Do that fun thing where I kill for time what builds? All right, and we got any questions, but we've gone over so far Things that they are wondering about can the SDK solve this problem for me Or this other problem, you know Okay Who wants to give an example of something they would be interested in building with the SDK? Okay. Oh Yeah, sorry. What was that? NC Rooter Gotcha. Yeah, that would definitely be a good Yes, you can do that. You can do hdfs if you can do hdfs you can do Just about anything except for a relational database. We have not added the correct primitives for relational databases yet All right, so throwing that in there so one of the Schedulers cool tricks is it serves as a very simple artifact server, right? Like there is no sort of built-in blob store for mezos or DCS So we need somewhere to stick those configuration templates the scheduler is that place as you can see the disk file Is where we put stuff that gets distributed with the jar the scheduler comes up one of the apis is Artifacts and it then serves those out See if you got cool. So you can actually see config templates that got shot in there and Yeah, there we go. So you can see here here is the un-templated version And then if we look at standard error, we can see the output of bootstrap So you can see bootstrapping bootstrap by default will print into your environment This is very very very helpful when you're debugging things. So we can see here that dynamic port has been put in the environment via that environment variable you have the memory limit there and Then you can actually see here Bootstrap sat there. So by default bootstrap will just try to resolve its own host, right? And then move on so it was able to resolve it and Then now it sort of very nicely logs for you like hey, here's the template. I took the template here Is the final output I wrote it And then we can see that bootstrap was successful and then here you can see the very verbose logging of memcache and Sort of to prove that I'm not lying that it's actually using it Let's hop back into that same task If I try localhost 11 to 11 connection refused. Okay, what if I do localhost memcache Port hey, what do you know cool? Okay? Exit back out of there Okay, cool. So sidecar plans. So like we talked about sidecar plans are essentially a way to have Exposed to operators ways to interact with the service while leveraging all that orchestration, right? Like like obviously it would be very simple. So let's say right like why use this over marathon Okay, so let's say I wrote a marathon app that is just a bunch of memcache servers. Cool What if I want to flush the cache? Well, it's easy, you know You just like write a shell script that like calls netcat and it like echoes the flush command. Okay, cool How do I run that against all of them? Well, that's easy you write another script that like finds all of them and then like issues that command again Oh, it starts to get very complicated sidecars are a great way to basically have a plan that runs against some number of the pods Via additional little tasks that you've added into the pod definitions. You can do all sorts of things. So sort of the What are good examples, right? So Cassandra we have backup and restore So you can issue both a backup and you can also restore from the backup via sidecar plans where it'll you know send it up to azure or S3 We also have I think we have yeah, there's other various Cassandra sidecars. I think we have some in elastic, but I'm not sure So neat what is our goal? We want to flush each cache in each node Flush each thing. Wow It's a great sentence. All right. So first thing we're gonna do we are going to add an executor volume So what is it a cool thing is that I have a dynamic port very cool Where does that dynamic port wind up winds up in the environment of my task? Oh, I need another task for flushing the cache That doesn't share the same environment. Well shoot. All right. How can I get around this like well? I As an SDK developer might just go add pod level ports But in the interim what I can do is I can throw a volume on the pod So basically an executor level volume just a tiny little one. I'll put it to pass shared This will allow me to share state between sidecar tasks and the primary task so Then we basically write All right, so let's write down the dynamic port right So we're gonna echo the value of memcache port into shared slash memcache port And then our sidecar tasks that we write down that we want the goal to be finished meaning like hey complete this And then you're done like run it till it completes, but after that good So, excuse me, we're then gonna echo flush all netcat dash q10 which just makes netcat finish when it receives into file Local host and then we'll just cat shared memcache port onto that So you can see here. I'm using a very minimal number of CPUs and memory Again the total footprint of your executor is a little tiny bit of overhead for the executor itself and then additionally the total CPUs and memory of all of your tasks added together and we do like right, so that's sort of why If you have lots and lots of sidecar tasks You want them to use a single resource set a great example of that is Cassandra Cassandra has like 10 sidecar tasks And if they all had their own Sufficient resources allocated to them the footprint of Cassandra would be like here's Cassandra, and then here's the sidecars So instead with a single resource set you're able to share resources between them But never at the same time for obvious reasons Okay, cool. So our overall plans as soon as you add a single plan You then have to write down your deploy plan or it gets sad which is technically a bug So I file the bug against that guy It's because deploy right you're never gonna have a naive deploy naive deploy is never gonna work All right, so we'll write down flush all serial and flush all parallel The only difference being a serial versus parallel strategy, right? So serial means Do this one wait for it to finish do the next one wait for it to finish do the next one wait for it to finish Parallel means do all do all right now so So let's go ahead if we pop over here I Forgot to say uninstall already So kick off the build does anybody have any questions about sidecars volumes Yes, sir Gotcha, so Right the way we get around this is when we are evaluating can we Yeah, sorry. Yes The question is how do you ensure so right you you have some number of sidecar tasks How do you make sure that you can actually launch those where you're intending to launch them, right? So the way we do that is when we were making those reservations We make the reservation for the entire footprint of the pod Right meaning if you had eight tasks each one of those tasks request one CPU and a gig of memory The total the total footprint we are looking for in a single offer is going to be eight CPUs and eight gigs of memory Right, so the reason for that is We want to make sure that we can super duper for sure run that sidecar when you ask us to right It would be very easy to be like I mean that would be easy It actually be kind of difficult to be like well Let's wait until we get an offer that's on exactly the right agent that has like the right number of CPUs like no We want to statically reserve when we launch the pod the entire footprint of the pod So that at any time we can turn on or off Any given site and not really turn on or off we can launch the tasks and the ones that reach finish states reach finish states And we stop trying to launch them the ones that have running states start running and we keep them running Does that make sense any other questions? Start this installing So this is a great example. So I'll talk a little bit about this so while I was writing this I was doing this echo real bad and And So this task was winding up being able to so this task could see that file But then this task could not see that file. So how did I debug that? Right? Super annoying all I did So I went in here. I said sleep forever and Then that means my little flush cast task. I can I can start the plan, right? It'll come up and it's just sleeping forever So I can hop into it with task exec and then I am exactly as though I am that task, right? I can see what it sees. I can see that like, okay, it can see the shared volume But like why can't it see the file and I can compare the two different tasks and see like what are the permissions? What weird thing have I done and then via that I was able to correct the bizarre way that I was writing That file and we're actually gonna work on features that are a better version than having to edit your thing and Put in sleeps, okay, cool So you can see here. Let's actually And show deploy so you can see our deploy plan is complete. Very cool Hey, let's make sure nothing's in recovery Nope, no recoveries have been initiated ever Let's do memcache. It's like what was the name of that plan? Okay, cool flush all serial or flush all parallel plan start flush all Serial so you can see there. It's quite quick The first one already finished if we go in here Look at standard out. So That is what you get when you when you send flush all to memcache just returns. Okay. Very cool Very simple So let's use this a little bit. So See here. So here's the total set of commands So we have actions we can take on plans actions. We can take on pods There's some debugging stuff down here Updates or a feature in DCS Enterprise Edition only on 110 plus basically It's a way in your package metadata that you can codify an upgrade path Meaning like you can go from this version of the package to this other version of the package And it'll like super duper for sure work. No other paths being allowed outside of that mechanism All right, so Let's look at the pods So we have all these pods cool So if you look at let's Touch this bad boy. So cash to server, right? So I think someone had asked like what do you do when an agent dies? Okay, cool. Well, let's say it's catastrophic failure, right? Like this agent is just never coming back Then you would go in here You would say pod replace what that's going to do is As you can see so we Stopped so it killed catch zero server in the background the scheduler is It is unreserving all those resources and then it is rescheduling it right Fun thing about replace replace does not guarantee agent movement However, if that agent had actually been dead, right, like there's no way you would have landed on it again Because mazos is not offering it to anyone. So again, like replace is destructive, right? So replace is basically saying like every volume unreserve it destroy All right, it's not unreserved for the volume. It's just destroy. So and same deal replace Has it's much less destructive cousin restart So you see here just started it again I can do a restart Now if we look at our recovery plan We'll see that that's where those were showing up. Yeah, you can see there the restart goes through we can see it just started again Go in look we'll see it's Bootstrapped nicely Okay, what can't the SDK do? Yes, sir. Yeah, Carlos would be a great way to do it. Yeah No, nothing built it and I would say the The the SDK scheduler should not do that right like separation of concerns like you know use Jenkins use Chronos use Cron Like the scheduler are we are fairly opinionated our view is that its job is to be a very good maze of scheduler and Schedule tasks, right like that that is what it does. It is not an API server for Connecting all sorts of stuff like it is it is singular purpose and it is very good at the thing that it does So speaking of what what does it do right? So so what do we do today? We have horizontal scale out meaning hey I want you add more instances of a pod at vertical scaling I want to go up or down for CPUs and memory of my pods right if I increase CPU or memory it's going to do a rolling restart with whatever the update plan is rolling out across the whole clusters Service discovery right you get automated DNS entries that are very predictable and reliable Easy to consume within the framework for orchestration Virtual networks, so DCS supports virtual networks basically via CNI We interface nicely with that. There's a built-in one in DCS. You can try out. It's just called DCS It's just the DCS overlay. So we have readiness checks and health checks So readiness checks are I am not ready until this check returns thumbs up health checks are Excuse me. I am bad and should be killed when this health check is unhealthy a certain number of times right so Pretty much anything production eyes. You definitely want a readiness check right? Cassandra a great readiness check is like does no tool think you're part of the ring as an example Health checks can actually kind of hurt more than they can help sometimes because it's right. It's like do you You know you a great example is like elastic we were finding the health checks were actually really annoying for elastic because You might have a bad GC like your GCs might over time line up and then suddenly you don't have like availability on the data thing And then like wait it starts killing nodes and you're like wait a second. No, it would have been fine. Just leave it so in terms of Yeah, health checks can be kind of a double-edged sword. You can do custom recovery So in terms of you can write some Java to have custom recovery around specific node types things like that I gave the example of Cassandra Cockroach DB for theirs. They have a bit of extra logic around like are you a The master or one of the replicas and then the same deal for like initial start I think is a little bit different than if it's being brought back Resource sets I talked about but didn't show an example of so resource sets are the idea of you and to find a Resource set at the pod level and then tasks can share that resource set They cannot use it at the same time But it's essentially a way to have lots and lots of tasks with a relatively small resource footprint for a pod Operator friendly tools so you saw the apis around sort of pod management plan management in points. Oh, I should show in points Let's see So in points, right if I want to actually connect and do something meaningful How I do that are you sad? Thank you. I'd like to think I would have eventually figured it out, but I probably wouldn't Here tell me what the commas are okay All right, so in points. You just have one memcache So you can see here provides all of the IP addresses and the DNS Oh a feature that I didn't show so DCS has a concept of a VIP like a virtual IP It's just a load balance IP address You can use one of those like obviously a lot of data services don't necessarily consume those because data services The clients are pretty smart They want to know all the IP addresses of everyone they can talk to and then they do their own client side logic But you can definitely use them if you want to Get cool Sidecars the ability to find tasks that can then be run in their own little plans as additional maintenance or operational procedures placement constraints, so we have full support of the sort of marathon style placement constraints so in terms of like Hostname unique is kind of the default one that we always have because like hey data services like shouldn't land on the same host You can do things like have it match a regex of host names Let's say you have specific storage instances You want your database to land on you would just put in a regex that'll match the right host names You could always have the brute force regex of like this hostname or this one or this one Configuration templating as you saw the ability to template out configuration files, just like you would with chef or puppet or others So rolling updates, right? So any time you update the scheduler with a new configuration. It is going to diff it It's going to see hey, how do I move to that new state? It's going to proceed in a safe manner to that new state basically dependent on whatever the update plan is Or if there is no update plan, it'll follow the deploy plan Rolling upgrades, so that's binaries right same deal where the diff is that like hey the binaries change like it's got a new URL It's supposed to download Java from so let's roll out the new Java to Cassandra We have support for GPUs. So if your cluster has GPUs, we can use them. Yay fine-grained plan control So you can both define relatively complex plans. There are APIs I didn't show for interacting with plans around both sort of getting yourself out of sticky situations where you can force complete past steps Or stopping and starting plans things like that There's some different there are two additional strategies. I didn't show plus you can write your own custom strategies for those strategies are Canary serial and canary parallel and the strategies there basically mean do one and then wait until I tell you to do the next one Right, so that's a great candidate for like I'm going to do configuration update of Kafka I should try one right like I did it in staging, but like let's try one in production before we hop on forward In the E we have deep support for secrets in DC us. We also have support for security So DC us enterprise has strict mode, which is basically Enforces a bunch of ACLs both in mazos and DC us land And then finally in E we have automated TLS provisioning right so that's to basically be able to say like I want a TLS certificate with the right Tlds for my task and that kind of thing So what did I not talk about so there is no horizontal scale in yeah, I know Yes, you guys never get to leave all right horizontal scale in so we Horizontal scale is tough Hard to scale down right some services I don't even know what that means to say scale down some are like Kind of easy and like we'll you know at a future date the SDK will build the primitives to make this possible It's not something we support today Racks support for racks is in mazos. We're working on like I think support for racks and DC Us is like a PR that's going to close and then like we're going to add racks. I mean great Graceful shutdown we like sort of have it done but not in a way that people should use yet So then sorry grace will shut down is like send a signal That is agreed upon and then wait some amount of time and then send the other signal that means like no you actually die now External volumes so like Rex ray CSI so CSI That we're going to sort of skip the Rex rays of the world and go right to CSI right like when CSI is done So container storage interface from that kind of stuff is done when port works and others implement it We will consume it to allow you to provision volumes on the fly for persistence and things like that Stateless pods is sort of the idea of like having a pod that is a bit more ephemeral That's kind of coupled in some ways to scale in but it's useful for things like analytics stuff like that We also don't integrate with the maintenance primitives of mazos yet. I think no one sort of does supposedly one framework does But maybe we'll be the second one. So yeah, I think that's oh, yeah, the SDK team is real big This is actually the order in which people joined the team. I Haven't done that much of the SDK Gabriel right there if you want to also ask him questions after this He is the first person on this list just to put him on the spot He also had a talk earlier if you want to watch the recording of that He's also got to talk from last year kind of talking about like his talk from last year is like here Are the principles of what we're going to build and then this year it was like we built it So yeah, I think that's that is it Anybody have questions Or do you just want to leave because this was very boring Right Right, right. Yeah. So the question is will the SDK at some point? Support some sort of format that would let you run it on vanilla mazos Right So you get pretty close today Like you could just do it on merit, right? So what does cosmos do cosmos basically just lets you define these options files that then get templated into a marathon app If you just sort of hand write the marathon app that has the you know sort of schedule library The part that is missing is DNS, right? You aren't guaranteed with vanilla like vanilla mazos great doesn't come with a lot like it's a very good hardware abstraction But it's kind of missing like DNS. That's the big one CNI maybe I don't think we would integrate with that right today Yeah, I mean I would say if the community comes up with a solution that would definitely be meat It is not on our near-term roadmap Because everything right like it runs on top of open DCOS, so yeah other questions Yes, sir Yeah, rack awareness the ability to say like hey like so the question was like what do I mean by racks? So with Cassandra, right like you don't want Cassandra nodes on the same rack for sufficient availability So you would want to be able to define like what does a rack mean? Mezos has support for being able to say like this agent is on this rack that support we just need it to exist in DCOS and then we'll consume it and Obviously have like some default constraints of like don't put it on the same racks Yeah, yeah, yeah, you can do it today like right like attributes are sort of the shim and then like mezos has added Racks as like a first-class concept other folks Okay, I release you to leave