 All right. Hello, everyone. My name's Ted Young. This is Caleb Miles. And we're going to tell you today about a research project we've been working on at Pivotal about running services on Cloud Foundry, specifically managing persistent data on Cloud Foundry and what would be involved in doing that. So to kick it off, let's talk about the current state of what you can run on Cloud Foundry. Currently, we only allow you to directly run what we call elastic compute nodes, basically stateless web processes. And the best way to codify what those are allowed to do was best stated as the 12-factor web app manifesto that was released a while back. Kind of describes a very high-level overview of what the contract is between the developer and the platform when you're running these stateless services. And after running stateless applications for a little while, we've become a little bit dissatisfied. That is kind of difficult to write very compelling data service products using the platform today. And a lot of that has to do with the fact that you have to manage the lifecycle of these data services outside of the Cloud Foundry to create service lifecycle requests. And you get a lot of complexity in your service brokers. There's also some inefficiencies in having to segregate your services from your applications. And so you leave some resources on the table, which no one wants to do. And you also have two ways of setting up access credentials to your service partition and your application partition, which also introduces a bunch of overhead that you may not want. So if we're going to extend our application model beyond just the 12-factor web app, how much of this needs to change? So this is the 12-factor app in its completeness. It's kind of a lot of things there. And at first, we were sort of like, wow, how much of this gets thrown out the window once you start running data services? And the answer, it turns out, is not very much. So these two points change. And everything else pretty much runs the same way. Processes before, when it was just stateless web apps, you could say you are just executing stateless processes that are indifferentiable. You can't tell them apart. And we're going to change that now. And we're going to say, for data apps, what you do is you expose persistent data via one or more addressable processes. So these processes are individually addressable now. And the other aspect that you change is disposability. So before, disposability was very vague. Just said, these things should start fast and fail fast. And what we're saying with data applications is you now need to add a concept of an interrupt and a drain signal to manage shutdown and replication. So some nuance attached to these two points. So digging into the first point a little more deeply about processes, exposing persistent data via one or more addressable process. So each individual member can now be directly addressed. Whether you expose that direct address publicly or not is still up to you. You may only want these things to be individually addressable by, say, a service broker or some other client that's only running in the cloud, as opposed to externally outside the cloud. But nevertheless, any client that wants to talk to this data service is probably going to have an algorithm where it needs to know which member it's talking to. It can't just round robin between the individual members. Each member has a unique state, which we're going to persist in a volume. But otherwise, the members are homogenous. So we're still talking about a homogenous cluster. So for example, if you have an application that has, say, data nodes and then some kind of service discovery or cluster manager node, you would deploy that as two applications. So one application would be all your homogenous data nodes. The other application would be the service broker or cluster manager application, which again would be homogenous. And this is where the individual addressability really comes into play as stitching these clusters together via whatever orchestration is right for that particular data service. But you're still not saying when we start this thing up, instance one, you get started with this set of configuration. Instance two, you're being started with some completely different set of configurations. So we're not saying we're going all the way and supporting every possible way of deploying services. So in handling the disposability, really the ephemeral nature of resources in the cloud, we need to expose two kind of signal handling and interrupt in a drain. We're thinking that the interrupt is kind of the cluster would be telling this data node instance that, hey, you need to stop your work for a little while to inform anyone that's monitoring you that you're going to be gone for some short period of time. Maybe we're scheduling you on a different node. We may be rolling the underlying host in terms for an OS update. We don't know. But we think you're going to come back pretty soon. We also have this idea of a drain, which is like tell, again, those people who are following you, that this node is going to go offline. You need to take some actions in terms of rebalancing the data that used to be on this node. So we know that we want to be designing distributed systems that have the self-healing, self-moderating processes. And we need to expose these lifecycle hooks to allow that mechanism to work. OK, so this is a sort of imaginary, hand-wavy example of what this would look like, a walkthrough using the CFCLI. So what we're showing here is the steps that would be necessary to create a Redis service that you could expose through the Service Broker API while running all the components on Cloud Foundry. So the first command you would say would be create a floating volume, and we'll get into it in a bit about what a floating volume is versus a fixed volume. You're going to say create a floating volume of Redis data. You're going to tell how big this volume is going to be, each instance of this volume. And then you're also going to mention some reserved memory. So this is telling it, when it's thinking about scheduling these volumes, these applications you might be binding to these volumes might be a certain size now, but you want to make sure that there's more resources available later if you want to attach additional or larger processes. So you aloak some memory when you talk about your volumes. Then you would create your Redis data cluster. This root FS directive is a hand-wavy way of saying, let's just use the default Docker Redis image. During our spike implementation, we actually got that working on Diego. And you say, mount volume the name of your Redis data up there, which actually should be Redis data, not my Redis data. So that's saying, when you push that application, only schedule that application with the Redis data volume mounted to it. Then you would create a separate application called your Redis broker. We're again imagining that we've created a Docker image that contains everything needed to run this Redis broker. And then we hand-wave around whatever the command would be to allow that application access to the Redis cluster application with all the permissions needed to be a broker for that application. So that would allow it to see the addresses of the individual members of the Redis cluster. And it would allow it to scale up, scale down, have access to administrative ports, whatever things you need to actually be the broker. Once you set up this Redis broker, the system notices its broker. And you can then use your create service to create a service instance off of that broker and then bind that surface to an application. So that would be the way you would walk through this whole thing. So now you have an application using Redis, the Redis broker, and the Redis cluster, all of that running on one platform. So that's the goal here. Quickly, I want to point out, we were only focused in this research grant looking at how volumes affect the scheduling. But you can see once we've got that, there are these other questions we want to answer next, which is, how do things like network security permissions, things like that, how are all those things going to work in this new world? And so we want to get this volume stuff at the door quickly, because we think that beta is what's going to allow us to address these other questions in a little bit more of a real world scenario. So real quick, we're going to go over an example of how we implemented this in our spike. So this is just going to be a description of the components we ended up making on Diego and a walkthrough of an example scheduling, just so you can get a visceral feel for what this would look like. So real quick, if you don't know Diego, a couple of concepts you need to know about. A cell is an individual virtual machine running in the container farm. So this is a virtual machine that then runs Cloud Foundry applications in containers. An application is a set of instances that are running in containers across the cell cluster. The scheduler is a daemon that keeps the current state of the system in its head, along with the desired state of the system. And when it notices this discrepancy, it schedules work so that whatever isn't running, starts running, whatever shouldn't be running is stopped. So that's the scheduler. So to those concepts, we're going to add a couple of new concepts. We're going to have the concept of a volume, which is a resource that the scheduler knows how to schedule. So individual hosts can offer resources, via volume manager, which is a cell component that abstracts away all the different platform-specific stuff around how those volumes are implemented and only exposes them as a much simpler API that only contains the information the scheduler cares about. And when the scheduler schedules a volume, it thinks about it as a volume set. So when you are scheduling individual volumes, you're scheduling them against an application that is actually running many instances. So your Hadoop cluster is running like 50 instances. And so when you're scheduling the volume for that cluster, you're actually scheduling a set of 50 volumes. And it's important for the scheduler to think about that as a set because it's those volumes that it needs to make fault tolerant relative to each other. You don't want it putting all 50 volumes on the same host or something like that. So it needs to know that these represent resources that need to be scheduled fault tolerant relative to each other. So in our experiment, we only really got to examine host-based storage. But we're thinking that there's really two broad categories of storage. Storage that's in some way attached to the host in terms of its lifecycle or availability or replacement and another type that can float between one or more hosts. This is your SEF, your Gloucester, your EMC storage appliance that is able to be attached to more than one host, maybe not at the same time. So we have these fixed volumes, which again are connected to the lifecycle of the underlying host. And we're imagining these fixed volumes would provide storage that's very local to the process that are running. So we're imagining things like Cassandra might run on top of this that needs a lot of throughput. And so we didn't want to use local disks. And again, we have this floating type which can float along the cluster. So we see that our floating disks, they appear on all the cells because they could be attached to any of the cells, whereas the fixed ones are cell-specific where they were created and actual scheduled. Cool. So again, we're talking about this drill into the scheduling a little bit. So we see that fixed volumes, as I was saying, are scheduled on the cells where they're going to be used because they're very close to the actual processes that are going to be consuming them. So we see that the fixed volumes are scheduled on each individual cell. And like Ted was saying, it's very important for a availability and fault tolerance perspective to not land all these volumes on the same host. So the scheduler will try and splat them across your available hosts. So you have a highly available fault-tolerant solution. Floating volumes on the other end because they could be backed by some kind of network storage, they can appear on any of the cells. So when the schedule is going around, looking at how to schedule things, schedule the applications that are actually going to run on these cells. It has a lot much wider choice because all the floating volumes are available on any of the cells. Cool. So now we're gonna walk through an example of scheduling, okay? So just to explain this diagram, we're gonna walk through in several slides. You've got a cell cluster. So those squares are all the different cells where we could run containers. They're divided into two availability zones, availability zone one and two. You've got the scheduler, which has a square representing what it's currently thinking about scheduling. And then next to it you have the database which shows the state of the system. So we've got some desired state and actual state. And in this first slide, you can see we've got some desired state which consists of an application set, so a grouping of six instances, and a corresponding set of fixed volumes. And we're not running any of this currently. So the first thing that happens is the scheduler notices that it's not running stuff that you want. And so it loads it into its brain and tries to decide where it's going to place it. Placing the fixed volumes is the most important thing because wherever those land, it can't move them later. So it thinks about those first. And so it places them equally across the two availability zones. And you notice it doesn't pile them up on any particular cell. Then it schedules the application that wants to run against those volumes. And this is easy. It has no choice about where it can put them. So it just runs them right where it placed the volumes. And now our actual state matches our desired state and we're done. Then we come in and we change our desired state. So now we're desiring a new application that has two instances that wants to be attached to a floating volume. So the scheduler notices this, loads it into its head and first it schedules the floating volume. So notice what happens here. When it schedules this, one instance becomes available on any node in the first availability zone and the other instance becomes available on any node in the second availability zone. So this is the volume manager saying that the way these floating instances work under the hood is you can attach anywhere you like but you can't attach processes to the same volume across availability zones. Won't let you do that. But once it's placed those, when it comes time to place the actual instances, it's got a lot of choice here, right? It could place them on a number of cells and so it picks two free cells that aren't running anything else. And there you have it. Cool. So to quickly sum up, when it comes to operating this, there's a shared contract between the developer and the cloud operator and we just wanna quickly review what these new operation requirements are for these two groups. So for the developer, you need to be responsible for implementing drain and interrupt handlers. So how your application drains or shuts itself down properly is entirely application-specific. So how Cassandra is gonna do this is different from Redis cluster, different from Hadoop. So whoever is building those packages has to figure out what drain and interrupt mean for that application. They have to understand the life cycle policy. So the life cycle policy is what dictates, for example, how many instances might be told to drain at the same time. So the developer needs to know what that is so they can ensure that their application is going to be tolerant to whatever the cloud operations people might be doing to it. So your application, if you can say it can handle having two instances be draining at a time, then when cloud ops is rolling the cluster, they understand that they can shut off at most two cells at a time. So that life cycle policy is a very important policy that binds the developers and the cloud ops people. And as a developer you also need to understand the differences in availability and performance between fixed and floating volumes. So fixed volumes are going to be much more performant because they're local and they're not doing any kind of replication or backup for you. Floating volumes are going to be much more available but they're going to have slower throughput because they're going to be replicated or remote. Cloud ops, conversely, they need to know about the life cycle policy because they're going to use that as cover for any maintenance that they want to do. So anytime as a operations person you want to figure out how to migrate your cloud, this life cycle policy is going to tell you what you can do without breaking your OLA with the services that are running on top of your cloud. Fixed volumes in particular interact with the host life cycle. So when you're thinking about rolling your host you have to think about how the fixed volumes have been allocated on top of those hosts. And of course, over the course of doing this you're going to encounter services that are not draining or interrupting properly their hanging. And so while we want to keep cloud ops and DevOps totally separate, cloud ops is going to be in a position to notice when things have gotten stuck and take the appropriate response to resolve the issue. And that's again where the life cycle policy comes in about empowering these two groups of people to not have to have each other sitting at the same screen while like a cluster deploy is going on or while a new version of the developer software has to go out. So it's about the developer's understanding what the constraints are on their use of the cluster and about the operations team knowing what their freedoms are in terms of how they can maintain the cluster. And it's not mentioned on this slide but one of the things that really helps is moving a lot of the policy decisions away from the cloud ops people and more to the CF administrators. Right now, if you're running services using Bosch the Bosch operations are pretty tied to the cloud operations. And that means any policies around where you're running your services and how that's going to work that's now cloud operations who has to think about that. And what we're hoping is once all that stuff is on the other side of the fence running on our platform you can use all of the administrative tools like organizations and spaces, things like that placement pools in order to control policy. So that's just putting the three tasks back in the right stuff. Developers caring how their apps work, cloud ops caring how the platform works and administrators caring about policy. So that's pretty much it. Just to cover this once again the high level stuff that affects the interface of cloud foundry is a life cycle policy, drain and interrupt. Those are the new concepts. We're excited about this because our spike was not actually very hard to implement. So we believe that it should be possible to roll these changes out fairly quickly given a concerted effort. Another thing that we were happy about was the volume manager allowed us to keep the Diego platform decoupled from Bosch or the IaaS. We're a little concerned once you start talking about managing volumes and stuff you start having more sticky IaaS specific stuff starting to show up in Diego and it doesn't seem to be the case. Yeah and that's again where it's very helpful to have two distinct types of volume concepts to fix in the floating. Because you can have them both implemented in whatever way you as a cluster provider like infrastructure provider think is the right way to do it. Right and lastly we think a beta version of this say rolled out on top of lattice is very important because once you've got volumes as a first class resource you can start working on the other parts of the problems with how do you run a service broker? How do you make the network secure? It's easier to think about these problems if you've already got a basic service up and running on top of Diego. And that's all we got folks so we're gonna open up to questions. See one right there? For the two signals. Okay so the question is some examples we could give for the two signals like why you'd get one versus the other and what you would do, okay? Sure, again so like if you are a infrastructure operator you might send, you might have like an OS update that's fixed some critical kernel bug and you need to roll that up to your cluster. So what you want this to be is minimally disruptive to everything that's running on top of the platform. So in order to affect this update you'll send interrupt signals to every update and that will allow the applications running on top of the platform to, well in the usual case to tell phone home and say that I don't need to be monitored for the next five minutes or so because I'm only gonna go down, come right back up and my data's still gonna be here because we're not gonna shuffle you across the cluster. Alternatively you could have the cluster need to actually move an application instance to one that's overloaded or something. In that case you'll wanna send a drain signal to the application instance saying that you need to rebalance my data because I'm being rescheduled and I don't know how long it's going to take. So this is two concrete examples where an application instance will want to handle two different signals differently. Now your application may choose to handle both of them the same. It may choose to handle one, not the other. It's really dependent on the application. What kind of replication you're doing internally. I have two questions. One, just to understand, when you talk volumes, you mean block storage? Yeah, well the way our spike work implemented we did implement these as block devices that would be exposed to the underlying container technology as a bind bound. And then you can put on whatever you like. It depends on what you want. And I was wondering that in many cases your IAS will have quite sophisticated services in terms of dealing with volumes such as replication, storage and so forth. Do you think that changes to the CPI might be needed to take more advantage of them? So again that depends on your cluster infrastructure operators. So you may, if you're running your Diego on AWS for example, you may want all your fixed volumes implemented as instant store for your VMs and your floating volumes implemented as EBS or their elastic NFS server. It all depends on what your infrastructure provides or what your organization has to provide a storage. Hi, in the picture with all the geometric shapes the floating volume, right, that is attached to all the cells. Does it mean all cells can simultaneously access the volume? In that case, what about all access issues, right okay? For coherency or any locking issues? Right, so this slide implies that the kind of volume access being made available here is one where you could schedule it on any cell. So if this is EBS, that would mean it would pick a cell and at that point when it placed it the EBS volume would actually mount on that cell at which point it would no longer be available on other cells because EBS you can place it wherever you want but once you've placed it it doesn't place anymore. So if this was EBS, what you would see is on this slide the availability of that volume disappear from the other cells. So that would be the volume manager saying, oh now that this is mounted here you can't have it anywhere else. If this was NFS storage of some sort where it actually literally does allow you to access the same volume for multiple hosts then you would see those other offerings remain up. So you could potentially attach two application instances to the same volume on different cells. Sorry, could you talk into the mic? Oh, no mic. Sorry, how do you make those volumes float? Is it through NFS or do you use something else? So that again depends on what you're, you as a cluster infrastructure provider want to provide. You could have your floating volumes implemented with NFS, you could have them implemented with CEP, you could have them implemented with an EMC appliance. It's really up to you when your organization decide what's the right fit, what's the right tool for you. Yeah, and the volume manager is the part of the cell that encapsulates all of that platform specific stuff. And hides it from the developers who don't really care. What about multiple volumes per application? For example, Cassandra usually you want to have a separate commit log and data directory for performance reasons. So can a single cell or application have multiple volumes? Yes, yes, you absolutely can do that. For this research project, we didn't look at that because as soon as you start talking about multiple resources, the constraint problem becomes more complicated. I think the interface you end up at that point would be something more like Kubernetes pods because you need to tell it, at the same time here, I want all this stuff. So figure out all the constraints that need to be met. So that would change once you start adding multiple volumes. But there's no reason you couldn't, right? And so, yes, striping, all those things that Cassandra and HDFS want to do, you should be able to do with the system. Anyone else? It's not state-structured, it's a plug-up. Is it plug-able, you're asking? Yes, so the idea here was the volume manager would be the plug-able component. We don't want the whole system to have plug-ins and lifecycle hooks and all this stuff everywhere because that stuff is very dangerous and hard to pull out once you want to change how the system works. And so the volume manager represents the component that would encapsulate all the platform-specific stuff. So you could see that being where the various plug-ins would go for having a SEF implementation or AWS, NFS, whatever it is that you're going to offer. Sorry, I couldn't hear you. Could you shout it? So that's what we're talking about, the addressability. It's really up to your data application to figure out, given the ability to talk to any particular instance, how do you organize that? So whether you want to expose that to console or environment variables or just like a plain configuration file, it's really up to you. And we just offer these base primitives and you can set them up and what makes sense for your application. So one example here with the Redis cluster, you have a Redis broker, right? So what makes a Redis cluster a cluster is every Redis instance has an admin port. They're all started in clustered mode, but then you have another service know where they are, know about their addresses, and tell individual members, hey, join this cluster. Here's another member or two in the cluster. And once you've linked them all up manually like that, then they form the cluster and they all talk to each other. Likewise, when an instance comes up, the Redis broker says, meet to that. And then if an instance comes down, it tells the rest of the cluster, hey, that's gone. So you have the Redis broker who's managing all of that. And I'm getting a sign that we're out of time for questions. So if you have any further questions, please find us after this. We're happy to talk.