 We can just let people come in as they need. So thanks for joining us. My name is Catherine Erickson, and I work on the product team at DataStacks. And I'm Ravi Yadav. I work on Platform Partnerships Administration. Today we're gonna talk about why it's hard to build these production quality, stateful services. We're gonna talk about what it took to build the one that we GA'd this week. We're gonna talk about the evolution of the framework, what worked and what didn't. We're gonna talk about the current state, where we are today, how a couple of our customers are using it, and then we're gonna jump into demos. I have some recorded demos. I have a live cluster set up. We'll see how it goes. I know you're not here for a vendor talk, and I don't wanna give you a vendor talk, but we're talking about stateful services. So you need to understand what DataStacks is, and then as needed, I'm going to explain how it communicates and what makes it a complex framework to build and maintain. So DataStacks Enterprise is a peer-to-peer, no-SQL database, and it's based on Cassandra at the core. And on top of Cassandra, we add search, analytics, graph, and other enterprise features. It's often used for personalization apps, messaging apps. We see a lot of sensor data and internet of things applications, and then fraud detection and often enough playlist as well. Come on in, guys. So let's dig into what it really takes to build these production quality services. I'll turn that over to Ravi. Thank you. So we wanna talk about what does it really mean by a production-grade stateful service? So when you're trying to build a stateful service, you have to think about different things like how would a service deploy? How would it be maintained over time? How would you find host for all the tasks? How would you go from zero to one? So for example, if you're trying to do a configuration change, how would the service behave in your entire cluster? How would you do resource accounting for reserve resources or persistent volumes? Then you need to think about things like how would you reliably recover data? When a task goes down, how do you put it back to where it was? So the point that I want to convey is writing complex stateful services like DSC is hard. So you have to think through all the challenges and the caveats on how it behaves in a container orchestrator like DCOS. So for that, to solve for that, what we've done is the team atmosphere has built an SDK called DCS common, which provides a shared patents and design patents for deployment maintenance and resource accounting. And solves most of these complex operations for you. So here is a quick example on how an update works with the SDK. So it's based on a goal-oriented design, which basically means you think about two states, a current state and a target state that you want to achieve. So in this case, if you see on the left, the CPU is two and the memory is four gig. And what you want to go to is CPU one and memory of eight gigs. So you define the configuration change. You change the CPU from two to four and memory from four, sorry, two to one and memory from four to eight. And based on that, the scheduler that runs for DSC, it unreserves one CPU and reserves four gigs of memory in your cluster. And the default scheduler for DSC that runs on DCOS identifies that and restarts the node with the new configuration. So for DSC, we want to dive a little bit deeper into what the different terminologies are in data stacks, enterprise and consumer. So at the simplest unit, you have a node. That's a single instance of data stacks, enterprise. You usually wouldn't run the database as a single instance, but instead you would run it as a cluster and a cluster could consist of one or more physical or logical data centers. And these clusters communicate using the gossip protocol. And so gossip is used to keep track of a cluster and the individual node's health. These nodes talk to each other every second or so using this gossip protocol. But it isn't a one to one communication. These nodes are talking to their neighbor and they share that information that they've learned with the rest of the cluster. So over a very short period of time, this propagates throughout the entire cluster. So if a node is in questionable health, then we'll route those requests to alternate nodes. To dig a little bit more, I wanted to show you the right path. So when a right comes in to a Cassandra data stacks node, it's written to a sequential commit log and also to an end memory table. When that end memory table fills up, it's flushed to disk. And then on a periodic basis compaction runs and those tables are read back into memory, sorted, are written back to disk in an immutable fashion. Right? So that's the communication on a single node. But the other piece of the stateful app, and we're gonna refer back to Ravi's config slide a couple times in the presentation. So data stacks enterprise, depending on if you're using search, analytics and graph and some other components, can have upwards of 30 configuration files. And within the framework, we needed to expose all of these or we needed to expose these in a way that customers could use any of the settings that they frequently change. As an example, the Cassandra YAML file has upwards of 1,000 settings that could be changed. And each one of these takes about five to 10 minutes to expose in the framework. Of course, you have dse.yaml as well. And then we expect production changes to the configuration depending on what season it is, what the shopping trends are, what's going on with your customers, you may change, heap settings, you may change, other tunable settings in production environments. And so we expect that. And so to expose all of this and to provide a seamless customer experience, we use the SDK from Mesa Sphere. So Ravi's gonna talk a little bit about that SDK. So the SDK aims to address all the challenges that we talked about initially of writing a framework. So what it does is provides a set of design patterns and common implementations, which can be used to deploy something like DSC on DCOS. So these are the different components which makes up the SDK. So it provides robust deployment and maintenance of services for things like installation, configuration, software updates. And then it achieves all of that through a declarative or goal-oriented design. So now we're gonna talk about the evolution of the framework, how we started working together, so data stacks in Mesa Sphere and how the framework evolved over time. What did we learn over time in terms of building the framework, how we deployed it at different customer environments and how we fixed those issues in the framework to solve for the customer. Thanks. So when we take on technology partners, we want the low-touch interaction that satisfies all the customer requests, right? We thought we're gonna build a service that's easy for us to develop and easy for us to maintain. If you think about this in the concept of a Docker image, the things that you would expose as environment variables are the things I wanted to expose within the DCOS service. And so we did this pretty quickly and we deliver it to the first customer and they said that's actually not what we want. We don't wanna customize something and we don't wanna extend what you've done. We want you to expose everything that's possible and then bring that back to us. So they're saying deliver us a plane that we can fly because we don't wanna build one. And so that's how we got to version two which required some really big changes for both companies. The first was that we needed a dedicated engineering team for this integration. So we have one full-time person dedicated specifically to developing and maintaining this framework. We work hand-in-hand with Mesosphere. For months we met every day and then went to a couple days a week and now is needed. As our kind of center of excellence has grown, we reduced our alliance on the partner. But it took a lot of work up front. We've gone through five beta versions of this framework since March and we really beta the heck out of this thing. Use cases, scenarios, everything that we think customers are gonna wanna use it for. And the other one is a big one, the joint support agreements. So you can think about this as kind of three levels of support. You have support from Mesosphere for DCOS, support from data stacks for data stacks enterprise, and then the joint support supporting this framework. As a real example, we had a customer come to us with a support ticket and they didn't know who to submit it to. They were trying to deploy a cluster. They had this error that they were seeing in DCOS but remember they see their data stacks logs in DCOS now as well. So it turned out to be a really simple DNS communication error but because this integration is so tight, customers don't know who to call. And so what we had to do is make it okay for them to call either team and establish a process that we could work together to solve these tickets. So by doing this, we're providing this service that really does have some value add on what our platform offers today. One of the big ones is configuration management and the ability to roll back to previous configurations. So if you think about a retail company and on Black Friday, they expect a burst. They need more CPU, more RAM per server. They have slightly different heap settings and on this first Black Friday, they work really hard to get these settings just right. We don't wanna go through this exercise again the next year. So what's great about this integration is that you can have your Black Friday configuration and the week of Thanksgiving, you can roll that configuration out. You can go back to that previous good state of that configuration. And when that burst is over, you can roll back to your normal running state. So then you have what comes with that is the vertical and horizontal scaling. So data stacks and Cassandra customers are very used to the horizontal scaling concept. The guarantee of linear scalability. But what helps these customers with burst is that they can add not just nodes but also add resources as needed. And then the uniform deployment of these apps. We don't really think of ourselves as an emerging technology anymore. NoSQL is the standard. And so our customers have a higher level of expectations from us at this point. They're building these smack stack architectures. There's a lot of upfront work. And once they do that upfront work, they don't wanna do it every time. They wanna be able to deploy the apps and the glue that connects them in a uniform fashion. And so that's what you get from the integration. All right, so let's talk about the current state. So we GAID the 2.0 release this week. And we have two services in the catalog. We have data stacks DSE, that's the database. And we have data stacks ops. That's the ops center management GUI. So you need the database, ops center is optional. But most of our customers use it of course to get better insight into the cluster to manage repair, to remanage backups and to see some deeper metrics of what's going on with the system. So at this point we have full platform support. Things that you would expect like advanced replication or multi-DC. All of the platform features are there. Plus the new features that Mesosphere has exposed in the last couple releases of DCOS. So you have no placement and no task failure recovery. So a node goes down and it'll attempt to do a health check to bring it back up a couple of times. But of course you don't want nodes flapping. So it does it a couple of times and then assumes that node should be down for a manual intervention. We have strict mode support that lets customers run on air gap networks. Multi-tenancy, so you can have multiple data stacks clusters per DCOS cluster. Pod replacement with local storage. So should you lose a pod, we want to deploy a new pod in an automated fashion and reattach that storage. We heard a lot about CNI during the keynote but that's what gives you multiple network interfaces per DCOS host. So you can have multiple data stacks, nodes and clusters deployed in a more dense fashion across a smaller set of host. And then monitoring to be able to see metrics and logs within the DCOS interface. All right. So some quick notes on customer deployments and I think we've seen a lot of these throughout the week. The first was an example of repeated deployments of a smaller unit of microservices as an application stack. So they have a cloud dev and test environment and they need to deploy that to their on-prem entities in a very repeatable fashion. They have small windows to get this deployed and then it's difficult for them to remotely manage them. So they need to be sure that what they do in dev can be easily deployed, easily managed in a very uniform way. And the other one that we're seeing is for a platform as a service. For enterprises that want to build their own cloud that want to make their resources available in a more efficient way. They do this by pulling in these services and understanding how they can work together and then deploying them for their internal customers in a consumable way. All right. So we're 15 minutes in and we're gonna jump into some demos. And the first is just gonna be a simple installation of data stacks enterprise on DCOS. And I have these recorded but if we wanna see something more advanced we have the cluster up and running as well. So I walk over here. So we go to the catalog and we see the certified service and we wanna configure it. So you can see the cluster name. You can decide if you wanna search analytics or graph data center or all three, how many nodes. And then this is where you can start to see your memory settings, your storage settings, whether you want root or mount volumes. And then like we said, the ksandr.yaml file, all of the things that could be exposed. dse.yaml where you would expect to expose your more enterprise features like LDAP. And then opcenter. Are you going to use opcenter? What's the opcenter interface and where does it live? You deploy the configuration. You go to the service and then like you saw in this keynote you can see it come up. You can also see how many of these clusters are deployed that day. I'm running these on an internal environment which is a little bit slower. So I do a little pause and play so that we don't have to wait for that cluster to come up. But on average, this cluster takes about five minutes to come up in our lab. All right. So now we have a data stacks cluster up and running three nodes in really just about 10 minutes to go through the settings if I understand the system pretty well. You can also do that same deployment from command line just a one line command to deploy as well. So the next video is showing us how to install opcenter using the same technique. So we see that data stacks is running in the services. Now we want to install opcenter. So we want to configure that. There's a lot fewer options here. One of the takeaways though is that with this GA release we started with a single opcenter node which isn't how you're going to run this. So in the next point release you will see opcenter running as a cluster just as data stacks enterprise does. And then under services you can see these stage and become active. There's also a nice CLI command. I think it's DCOS, your service name plan deploy. And that'll let you watch these come up on a more granular level. All right, so I run DCOS and I see that I don't have my CLI yet. So you have a CLI per service. So I need to install the CLI for data stacks ops because that's what I've just installed. So now I'm running DCOS, data stacks ops endpoints. And what I want to see is the endpoint task within those containers. And the one that I want to look at is opcenter itself. So I want to see the URL for the opcenter endpoint. So I add opcenter to the endpoints and then I can see the URL that's available. Then I have to remember how to switch windows. There we go. If you're doing this on your own and I have an exercise I'll show you after this you will have to wait a minute for opcenter to come up and recognize the service. But it's pretty quick. And then you can see that you have a three node cluster running in DCOS now viewable through opcenter. The only reason I redid these videos is if I had a message pop up, I dropped my laptop, I spilled something. This is pretty easy to do and pretty easy to record. And it'll be pretty easy for you to try out after this as well. Did anybody use the CLI in DCOS 1.8? No? Okay, well if you did try it again because it's vastly improved at this point. And so this next video is gonna show you a little bit about the CLI. And next time I give a talk I'll use a larger text in my videos. So the first thing that you have to do is install the CLI. You go to that little arrow, you grab a copy paste and my system I have to add sudo to the curl. So if you see a difference that's why. Then you have to think what was my password for the last six years? Once you get that you're ready to go. At this point you might notice that I'm actually using a different cluster than the previous day. I was just waiting to get 110 installed on my system so I was using one of my suspicious clusters till mine was up. So now we're installing the CLI for OpCenter just like we did before. I type faster when I'm not recording I swear. All right so I wanna look at the end points for data stacks enterprise. And I've installed graph and spark and solar so I have quite a few. But the native client is the actual data stacks nodes and those are the ones that I want the end points for. I wanna see the IP addresses but I also need the name of this node that I wanna go to. So now at this point I'm saying give me a terminal on this node, which it does pretty easily now and if you've used one God 8 before that's pretty awesome. SQL SH is what you use to interact with a Cassandra data stacks cluster. So I'm just SQL SH'ing into a node. And at this point we're going to create a key space and a key space is just our concept of a database. I created the key space customer and I'm just gonna paste in a create table command and then we're gonna load some data. These are just some, I'm gonna give all these to you as well. These are just some simple insert statements. And then as easy as that we can query this database. Cross your fingers, don't worry it's a video. There we go. All right. So the hardest part of that was remembering to put a semicolon at the end of my commands. It's really simple to use. All right. So we stated that one of the values here was in being able to add a node within the interface. And so I'm gonna show you that here but then we're also gonna look at that live in a few minutes as well. All right. So we're gonna go to the data stack service and we're gonna say that we wanna edit. If you're used to the old way of doing things where you needed your JSON file, this is still super friendly. You have it all right here. But what I wanna do is add a node and that's a capability that I've exposed within my service. So I know that that's the add a pod. I go there and I'm gonna increment from three to four. I'm gonna review that. You can see all of the choices that you've made now and previously. And then we're gonna run the service. So what I should have done is done some node tool commands to show you that the cluster's not going down during this process. DTOS is saying, find me a set of resources that satisfies this request. It bootstraps the node which is a Cassandra concept. And then the gossip protocol is going to discover that node as it's bootstrapping. Give it a subset of the data and distribute the data that was on three nodes, not a four. So I install the CLI again because I assume that I've uninstalled a package between the last demo and this one. So I wanna get my endpoints. I'm getting a little nervous here. There we go, she pulled it off. And so now we need to get the IPs again. So we wanna look at the native client and the goal here is to see four nodes instead of three. And so there you have it. We have four nodes now instead of three. All right, so we're only at 25 minutes and we have 40. So I wanna show you a few things that we've done before we open up for questions. Oh wait, my video personality has other ideas. So we're gonna look at the off-center endpoints and we wanna see an off-center that we've gone from three to four nodes, right? Because we have this management GUI, we've added a node. It should be apparent with an off-center as well. So I go through the steps again to get this IP address because apparently I didn't write it down. Oh, yes I did, there it is. So I'm just gonna refresh. And this does take a second and I actually did pause this video a second so that when I refreshed it would show as four. But anytime you add nodes in a data-sex cluster they will be discovered and made available with an off-center as well. So what I've done is taken the steps that I did in this exercise and made them easily available and consumable. So, can you see this here? There we go, we're mirroring. All right. So in my GitHub account which is just 012345 you will see Mesoscon. And in Mesoscon you will see each of these videos that we've gone through how to do the deploy. They're the exact things that you've seen. Oh gosh, now you see how many kid videos I look at. Wheels on the bus guys. All right. And then it's gonna take you through the exact steps that we went through here. I'm just so glad that wasn't full of Taylor Swift. That would have been worse. All right. I have a couple commands for setting up your display and then this one goes even a little bit deeper. So I have some short descriptions about each step that you're doing. You'll need to remember to change your IP address. I have a note in there but if you're like me you're not gonna read it in that much detail. So you create your key space, you add a table, we insert some data. And then if you wanna stop there that's what we did in this demo and you're good to go. If you wanna keep going, I have a primary key exercise here that's based on Star Wars movies. I think. Now I'll switch it out to the Star Wars one. That's more fun. And then some hands on with looking at Cassandra consistency and being able to trace your queries. And then a little bit more hands on fun with solar to be able to create your core and do some solar queries through CQL. And so this is pretty simple. If you deploy DCOS now you'll see the data stacks package under certified. You can very quickly deploy it, go through the steps in the video and be able to run through this tutorial on your own. All right. So one thing that we didn't show that I've been asked about a lot is not just adding a node but changing configurations. Is that something that we wanna see in more detail? How to do configurations? Okay. All right. So we have our environment here and let's see. Do to do or just do like this. Okay. That's the agent CPU. And we want the CPUs for the node. And so at this point we cross our fingers that my cluster is not under allocated. And so I'm saying that for every node in this cluster I wanna move from two to four CPUs. And we really will do this live because I don't know what's provision in the cluster. All right. So we reviewed it and we'll run it. You can see it go through. And if there's not, we'll be able to see Mesos say resources unavailable. It's a very clear message but it looks like the resources are there and up and running it will be. So that's an oversimplification of what's really taken a lot of hard work by teams of people to build. I think for anybody taking on the framework now it's not going to be as hard. We work through a lot of issues. We work together. We're seven months into the SDK which has matured a lot and very quickly. So hopefully we work through some things so that you might not have to. But we're happy to answer questions about how people are using this. Yeah, it looks like I didn't have the CPU. But we're happy to answer questions about how people are using this, what they're using it for, deeper questions about the framework and service that I didn't go into. We're happy to answer any of those. Good. So point upgrades are pretty easy, right? Because we're not making major changes. And even rarely are there changes to the SS tables. But in, was it one nine or one 10 that upgrades were exposed? One 10? Yeah. So we've just released our first GA, right? So we haven't exposed an upgrade process yet. But it's there. You expose it in DCOS, the steps of that upgrade process. And then it rolls through the cluster to enable that upgrade. One limitation right now is Cassandra has this concept of rack awareness where if you have 100 nodes, you don't wanna upgrade one at a time. And so you wanna upgrade one rack of nodes at a time. And a rack will guarantee you that the same replica isn't stored on two nodes that are in that rack. So if you have three racks and you can sustain the performance loss of taking down an entire rack, then that would be your upgrade unit. But more likely a subset of those nodes in the upgrade unit. So right now we do use placement constraints through DCOS. And those will work with rack awareness in one of the next versions. Safe harbor. Yeah, it'll be through DCOS and we'll expose it. Do you wanna talk any more Keith about how it's exposed? We have things where a part are like data stacks. So I'll repeat all this because maybe it's recorded. So this was released with 1.10. It's an enterprise feature of DCOS and we've added the plumbing so that a partner can do an upgrade. We have safety, we have checks in place like where you can, or on a package, you can define like these are the versions I can upgrade from or downgrade to. So we have support for downgrades as well. So we have guardrails in there so you can make sure that a customer doesn't upgrade from a to an incompatible version as well. And we have other niceties that we've built in there and we'll eventually plumb that through to our UI. And then we have, like you were saying, we don't have support for fault domains yet or racks. We'll add that and then we'll have more advanced strategies for upgrades so that you can upgrade three nodes per rack. Yeah. Then go to the next set. The other thing we get asked about a lot is the upgrade path of the two independent services. And if you've used DataStacks Enterprise, the way it previously worked was that our agents weren't backwards compatible with the different versions of OpCenter. But with a newer release of OpCenter, all agents which run in the DSE nodes are backwards compatible. So as you upgrade OpCenter, you aren't forced to upgrade the database at the same time. And if you think about your production use case, while upgrading that management GUI for some new features in goodness is a pretty easy decision. Upgrading your database is something that you usually plan for in more detail. So that's something that's there and ready. Sure. Any more questions? Anything you wanna see? All right. Well, we appreciate you coming. If you have any questions, you wanna know more about the framework. If you wanna dig in deeper, we're happy to hang around and answer any questions that you have. All right. Thanks, everyone. Thank you.