 All right, good afternoon, everyone. Thanks for coming. I hope you all had a great lunch break. My name is Eric Mullum from Pivotal, and I'm the product manager for the CF Runtime Diego team. And today, I'd like to give you an overview of the new Diego Runtime system. So what is Diego? Diego is Cloud Foundry's new container runtime. And as you may have noticed, if you've looked at it, it introduces a few new components. Well, maybe a few more than that. Actually, there's a lot more components. So I'd like to give you some sense of how we can make sense of this entire new system that's going to live at the heart of Cloud Foundry. So even though there are a lot more components, they each have very specific responsibilities. And we can organize them into coherent subsystems. So what we think of as the core of Diego actually is composed of a few different components collaborating together. And they have a few dependencies. And then a lot of the other components that we've introduced as part of Diego to integrate it into Cloud Foundry have been to translate the rest of the system to operate with this new core. So for example, there's a whole set of services that collaborate with this core of Diego and Cloud Controller to make sure that CF can run its workloads in this new system. And as part of that, there are a number of modules that know about how to run very specific types of applications, so build packs, but also Docker images and Windows apps, and integrate with that CC bridge to help us run that workload. And then some of the other components that we've introduced have been to interact with and to extend the notions of routing on the platform. So let me give you a visual depiction of how these subsystems are connected together in an existing Cloud Foundry deployment. So you can see on the edge here, we've incorporated all of these subsystems with the existing ones that we're used to from Cloud Foundry, in this case, the Cloud Controller and the Go routers. I mentioned the Diego core, its responsibilities are to take containerized workloads and to determine how best to place those across a distributed system and how to manage their lifecycle. And to do that effectively, they need a consistent distributed store to collaborate their components across. But the core of Diego itself is not responsible for creating these containers. Instead, it delegates that to its garden dependency. So that is the component that knows how to actually create a container and how to execute inside of it. So that started off with Garden Linux, but it's also a point of extension in the system so that we can support a notion of containerization for Windows applications as well. And then the last of these dependencies that the Diego core has is on console to allow its own components to coordinate in sophisticated ways. So all of this subsystem is operable entirely independent of the rest of Cloud Foundry. And then these other layers help us integrate it into the existing Cloud Foundry deployments. So I mentioned the CC bridge and these app lifecycle extension points. They cooperate with Cloud Controller to allow it to run its existing notions of build pack-based applications, Ruby, Node, Java, but also to extend that to support workloads based on Docker images and based on running .NET apps on Windows. And then likewise, the router-mitter component, we think of as a bridge to the existing go router tier to allow it to receive the same kinds of route registration messages that it is expected from the DEAs and the previous architecture while still operating with this new backend. And then likewise, this has given us the opportunity to introduce entirely new routing layers to Cloud Foundry. One of those is SSH access to containers which we just view as an SSH routing system. Okay, so I'd like to give you a picture in more detail of that Diego core and how we arrange it in a typical deployment scenario. So in that core, there's really three types of VMs. And down to the bottom, we have a large group of VMs that we call the cells in the Diego deployment. And these are where we're actually going to be creating containers and executing workloads in them. So on each cell, there's a local garden process running. But that can be Garden Linux, it can be Garden Windows, and we're all looking forward to the coming Garden Run C as the next generation of what Garden Linux means. But Garden itself, although it knows how to create containers and how to run processes in them, it doesn't participate in this broader distributed system. And so the other Diego component that lives on the cells is something that we call the rep. And this is responsible for broadcasting the presence of that cell to the rest of the system to receive work from that system and to report it back to the rest of the system as well as to drive Garden to actually make the containers. Okay, so each one of those cells is ready to receive some work, but we also need to know how to specify that to the system itself. So that's where this top VM, the database comes in and that hosts another Diego service that we call the BBS or the bulletin board store. So this is what presents the public API for clients of the Diego core to specify the workloads that they wanna run in the cluster. And the BBS also knows about different types of lifecycle policies for those workloads and enforces them. And then we've also co-located the BBS on these database nodes with the LCD that it uses for persistence. Okay, so even though the BBS knows a lot about the lifecycle policy for these individual units of work, it in and of itself is not responsible for making the placement decisions. Instead, it delegates that responsibility to a service that we call the auctioneer. So this is responsible for communicating with all those cells that are registered in the deployment and making optimal placement decisions, telling them what work to do in what order. And then last but not least, we have the converger. And this is responsible for periodically analyzing the global state of the work that's running in the Diego core and taking any corrective actions. If there are missing instances, then it will recreate them. It'll tell the BBS, you're missing some stuff, please make this. If there are extra instances, it'll tell the system to stop them. And if there is work that is ready to be rescheduled, it'll sweep through and tell the system to do that as well. Okay, so I mentioned that the BBS is in charge of enforcing different types of workloads. And let me tell you about the two that we support in Diego. So the first of those is the notion of a long running process or an LRP for short. So these are the characteristics that we desire for these kinds of processes. So the first aspect is that they should be continually running. We expect this workload always to be running when it's desired. So if it stops, even with a successful exit condition, we consider it to be failed and we reschedule it to restart it. We've also structured these LRPs to be scalable. So you have one specification of the unit of work you wanna run, but you can scale it up to an arbitrary number of instances. And then in terms of its operation, Diego wants to make these instances as available as possible. So it may temporarily run more than the desired number of instances in order to have a seamless transition, say during an update to the system itself. In contrast to that, the other type of work that we schedule on Diego is the notion of a one-off task. And this is almost the exact opposite of a long running process. These are expected to terminate at some point. And the data that we track around those tasks will detect whether or not they've terminated successfully or not. And if they've terminated successfully, can convey a small result back to the client. These are inherently non-scalable. We intend them to be one-offs. And we're also more careful about trying to schedule them consistently to make sure that we're only running them once. So there are more opportunities for the system to say I can't be sure that I scheduled this exactly once, so maybe I'll defer to scheduling it zero times and let the client deal with that. So these two types of work are abstracted from the types of work that we know that we run already in Cloud Foundry. So these long running processes are an abstraction of the notion of the app process that Cloud Controller wants to run. To run a Build Pack app, but also now a Docker app or a Windows app. And these one-off tasks are a representation of the kinds of staging tasks that Cloud Controller often wants to run as a compilation or a vendering step before running that Build Pack app, or to extract some metadata from a Docker application. But as these concepts have solidified in the Diego runtime, we found other use cases for them as well. And in particular, these one-off tasks are very appropriate for running one-off tasks in the context of your own application. So this is a new type of workload that the V3 version of the Cloud Controller API is exposing. Okay, so let me show you in a little more detail some of the contents of the long running processes that arise as we're trying to run CF applications on Diego. So let's take the example of a Build Pack app. Let's say it's a Rails app. So one of the things that we need to know to run this type of work as a container on the back end is what are the file system contents of this application? So if you're a Build Pack app, then all of them are based on the same common consistent root file system. This is something that Cloud Controller refers to as a stack. So it knows it solely as CF Linux FS2 for the trustee-based stack. And when it's running on Diego, we have an option to preload those onto some of the cells. And so when we're identifying that root file system, we use this preloaded schema to indicate that this should already be present on some of your cells and they will know that. But that's not sufficient just to run this particular type of application. We also need to load in the contents of that application into this base root file system. So that's the notion for a Build Pack app of a droplet. So for this Build Pack app, the file system contents of the container are the union of those two sets of information. But then we also need to know what to run in that file system context. And in this case, for a Rails app, maybe we wanna run Rails in server mode with the correct gem context. But finally, we need to put some limits on this to run it efficiently and effectively in a multi-tenant container environment. So that's where we can specify that maybe this application should have no more than half a gigabyte of memory. But in terms of its disk usage, we only need to account for that extra droplet layer because we already have that preloaded file system available on the cells that can run it. So maybe we only need a quarter of a gig of disk space for the differentiation factor for this app. And then finally, maybe we wanna run three instances of this. Or we can use the same set of language to specify how to run a Docker image as a container on the platform. So for example, let's say we wanna run the library Redis image from Docker Hub. Well, that tells us all of the information we need for the contents of that container. The whole thing comes from Docker Hub. And then the action will be whatever the entry point specifies in that Docker image, which we can extract during a staging task before we run it as an app. And then likewise, maybe we have different limits that we wanna place on this container when it's running. We might need less memory for this because we know how much data we intend to store in this Redis instance. But it might need more disk because we're having to take the entire contents of that Docker image. And we don't know that it's necessarily shared with any other containers, like we do for the preloaded root FSS. And then finally, maybe we only wanna run one instance of this because these Redis's are not clustered together. And so if we run multiple copies and load balance across them, we would get inconsistent behavior. Okay, so let's focus on this Rails-based app. And let me walk you through how Diego actually schedules it to run in the core. So here's a sample small deployment of Diego. We've got the BBS and the auctioneer. And then we have a few cells that already exist in the deployment. So in particular, we have two Linux-based cells. They're running Linux operating system and then Garden Linux on top of that, or maybe Garden Run C. And they're already running some existing work. A couple containers on the first one and a single container on the second one. And then below that, we have a Windows VM running Garden Windows. And it too has some work that it's performing. Okay, so the first step is for Cloud Controller through the bridge to specify to the BBS, I'd like to run this Ruby-based application and give me three instances of it. So the BBS records that in its database. It records the specification of that work entirely. But it also makes slots for each one of those instances, index zero, one, and two. And it has them grayed out because it hasn't assigned them to anything yet. Nothing in the system has claimed them as running. So its next step is to communicate to the auctioneer, there's some outstanding work that we need to place somewhere to try to start running. And the auctioneer receives that request and batches up requests for new work to place. So it could be in the middle of placing some existing work and it's not gonna step over itself trying to place new work in the middle of that. Let's say this is the complete batch of work that the auctioneer has been told to run. So here's how it places them. Its first step is to contact all the cells that are registered in the deployment and to get a snapshot of their current state. And from there, it can make decisions about optimal placement and update those in real time as it makes its decisions. So it's first gonna decide where to place that index zero instance of this new LRP. And it's looking across the capacity of those cells and it identifies that second cell as the best place to put that instance. Cause it's running the fewest containers, maybe it's also using the least amount of memory. Okay, and now it's gonna place that index one instance and it's looking now in its updated representation of what these cells are running. And one of the constraints that the auctioneer operates by is that it wants to spread these instances of the same LRP out as widely as possible. So even though it looks like those two Linux cells are tied, it's gonna prefer the first one for placement cause it's not running any other instances of this LRP. Okay, so now it's got one final instance to place, the index two one. If we were going solely by anti-affinity, it would wanna place it on that last cell cause it's not running any instances of it. But that cell is ineligible to run this workload because it's a window cell and not a Linux cell. So instead it decides to put that second instance on that second cell or that index two instance. So now that the auctioneer has made all of those decisions, it transmits them just to the cells that need to run new work. And those cells immediately reserve space for these individual instances. And now the auctioneer is done. And the cells are in charge of managing the local work that they've been assigned. So their next task is to communicate back to the BBS and to double check that nobody else has claimed that workload while this is happening. So let's say cell number two does that first and it tells the BBS, so I'm trying to claim these two units of work. Nobody else has said the BBS says, great, you're good to go. And at that point, the cell can start creating those containers and then once they're created, run these processes inside of them. So cell number one does the same thing, claims that in index one instance and starts it up. So let's say that cell starts that container really fast and it detects that this process is healthy. So it then reports back to the BBS, not only have I claimed that but I'm actively running. And the BBS updates its representation of that. Okay, let's say cell number two does not have such a good time and ends up crashing that index two instance. So it also reports that back to the BBS to say this one crashed, sorry. And then it clears up its container. So if this is the first time that that instance has crashed, the BBS's policy around how it schedules LRPs will say, all right, you'll get a free restart. We'll give you three free restarts. But if you start crashing continually, then we'll back off from that. Okay, so we've seen that many of these components even in the core of Diego are crucial for it to operate successfully. We can't get any work placed if we don't have the auctioneer available to make decisions about it. So that means in any kind of real deployment, we want to run multiple copies of that. But we also need to recognize that it has this important global responsibility and we wouldn't want multiple copies of it running simultaneously and independently and making decisions that conflict with each other. So this is where these components coordinate with each other to make sure that even though we have multiple ones deployed, only one is active and the rest are dormant. And they do that via our console dependency. So these auctioneers contend over a particular key in console that represents a lock. So they'll both try to grab it, but only one of them will succeed. That's where we use the consistency of the RAF-based key value store that console presents. So now auctioneer number two is grab that lock and it's going to broadcast to the rest of the system that any requests for placement should go to it. So when the BBS shows up and has new work to place, it talks to the active one. Well, let's say that we're now updating the deployment. Maybe this is a controlled update. We're using Bosch to update all the software in the cluster. So we're gracefully shutting down the auctioneer. And as part of that, it'll release its lock in console's key value store. So now the standby auctioneer, which is continually trying to grab it, will succeed and become active. So then when Bosch brings back that second instance of the auctioneer, it'll be in that dormant mode. Okay, well let's say there's something more catastrophic happens and that first auctioneer just crashes. Well eventually, within seconds, console will release that lock from the system because nobody's maintaining it. And then that second auctioneer can come in and grab it again and become active within just a few seconds. So we use this pattern for the auctioneer to coordinate them. It turns out that the BBS and the Converger also have that global responsibility. And so they also all hold locks that are represented in console. And then some of the components that are in these bridge layers also hold these locks because they have similar global responsibilities. So in general, if you wanna know how to run an HA deployment of these services, you need at least two of these to be sure that one can fail over to the other appropriately. And in general, if you're deploying by availability zone, we recommend that you have at least one per availability zone. So we also use console for other purposes. We use it to broadcast stateless services through console DNS and also the cells store their presence records in there so the auctioneer can find them. Okay, so that's a detailed tour of the core of Diego. And let me give you some context on what some of these intermediate layers do as well, the CC bridge and the routing layers. So if we look at the CC bridge, there's a lot of little components on there, but they each have very tightly focused responsibilities. So for example, the Stager component is responsible for receiving staging requests from Cloud Controller and translating them into those Diego tasks to submit to the BBS. And then it also registers itself as the place to receive a callback when that task is done. So if that task succeeds or fails, the Stager is notified about it and then let CC know about it. Likewise, when Cloud Controller wants to run an app on the Diego backend, it doesn't talk directly to the BBS, it talks to this in sync listener component. And it tells it to start the application and that translates it into a long running process for Diego to manage. And then when Cloud Controller wants to know how that process is doing, how its instances are running, it'll talk to the TPS listener and that'll ask the BBS about the LRP state. Okay, so there are also a couple of components on this bridge that have more global responsibilities. And the most important one is what we call the in sync bulker. So it will periodically ask the BBS, what LRPs are you running on behalf of Cloud Controller? And then it'll also ask Cloud Controller, what apps should Diego be running on its backend? And it'll reconcile those, treating Cloud Controller as the source of truth. So if there are any missing LRPs, it'll make them. If there are any extra ones, it will delete them. And if there are any ones that should be scaled differently, it will rescale. So because this is a global responsibility, it's a lot of data and it could be potentially making decisions that conflict with itself if it's running in parallel. This also has one of these global locks that's managed through console. And then last and maybe least, we have the TPS watcher. And it has one job. It's going to watch for instances that crash and translate those into crash events for Cloud Controller to report to end users. So because we don't want duplicate events, this also has a lock that's managed in console. Okay, so in the last few minutes, let me give you an overview of how we've integrated the routing tiers into this new Diego backend. So we have the familiar go routers down here in the lower right. And then we have this component we call the route emitter that's deployed alongside that. So the route emitter is paying attention to all the work that's scheduled in the DBS. So when we specify a new LRP, not only can we specify what should run in the container, but we can specify some networking and information associated to it. So in particular, we can tell Garden, I want to expose ports in this case, say 8080 and 99.99. And then as part of that specification, Cloud Controller or any other client can record that it wants to associate some of those ports with certain types of routers. So in this case, for the go router, we have a tag called the CF router, and it can specify the domain name that that's associated with, and then the port that traffic should come into the container on. So the router emitter detects this new work as desired and starts building up its routing table and memory. So it knows at this point that something is interested in routing foo.com requests to instances of this app or this LRP, but it doesn't have any instances yet. So when one shows up, some additional information that's present when it's running is the address of the cell that that instance is running on and the port mapping. So if something outside of the cell wants to communicate with this container on container side port 8080, it should instead talk to this other port that's exposed on the cell. So the router emitter gets that update too, and now it effectively doesn't interjoin on the LRP's ID and that port to register that new endpoint with this domain. And then when it gets these updates, it sends that information out to all the go routers. Well, this is now an extensible mechanism that supports other types of routing on the platform. If we wanna do TCP routing, now we can associate some TCP routing data to that 99.99 port instead. And so there is an analogous component in the routing release called the TCP emitter that will do the same kind of watching for desired routing data and instance backend information and make those associations. So in this case, it knows it wants to associate with this LRP traffic destined for port 99.99 with a particular routing group and a particular external port. And then it can update those TCP routers dynamically with the backend information. As those instances come online or die off. Okay, so that is a large part of the functionality of what's in Diego release and how it works. And let me tell you a little bit about the next steps that we're planning on taking. So if you went to Jim and Luan's presentation this morning, they were telling you about what we've done in terms of performance analysis of Diego. And we know that it runs great at 10,000 instances. It will run adequately at, say, 50,000 instances. But what we really want to hit is 250,000 instances or maybe more, maybe up to a million containers managed. And in order to do that, one of the bottlenecks that we've identified is at CD itself in version two. So we're almost done switching Diego over to run with a relational store, which is a much better fit for the kinds of data operations we need to do on the workloads we're managing. You've also heard a lot about the persistence effort. So in fact, we saw a demo this morning of WordPress running scaled up on Cloud Foundry, which is pretty amazing. And the persistence team is now working on ways to attach different types of storage instances, such as individual block storage for instances. We also have a container-to-container networking team that's integrating with Diego to provide richer types of networking between containers and policy to enforce. So these are all things that are happening on teams that are adjacent to Diego. But on the core Diego team itself, we're very interested in exploring notions of active rebalancing within the cluster so we can make greater use of the available space on cells by doing more sophisticated placement decisions. Right now, we give absolute primacy to running instances. We never move those, except when we're draining the cells in, say, a Bosch lifecycle event. But in some cases, that can mean that we temporarily end up with poor anti-affinity for instances just based on the current capacity of the cells. And if we can gradually start massaging those to make better guarantees long-term about the health of these instances and their placement, then we're very interested in that. And then we're also committed to making Diego an operable system. So improving the metrics that are coming out of it and improving any other additional tooling that may be useful for operators as they interact with and inspect the system. So we've got a couple minutes left. I'd be happy to field any questions. Yeah, so the question is, what are the release plans in terms of organizing these? So we've intentionally avoided incorporating Diego release directly into CF release. And our manifest generation scripts are the primary uses to create a Diego deployment that integrates against a CF. And there's a broader plan to take CF release and break it into these kinds of subsystems, Cloud Controller, UAA, the routing releases. And so the release integration team is working on their efforts to produce a better system for manifest generation that will incorporate the existing CF release components as well as the Diego ones. So we'll have a unified manifest generation system coming soon. But for now it's a separate deployment manifest. Yeah, so right now we don't support any notion of cross locality for that scheduling. All those scheduling decisions are independent. I think that's another interesting placement direction to go in to know maybe you have a system, especially in the presence of this container networking policy where two different LRPs should be communicating with each other. So that's one direction that we could look at taking placement policy depending on the use cases. Okay, well I think we're about out of time. So thank you very much. If you're interested in more.