 Hello, everyone. Welcome to the session about highly available Cloud Foundry on Kube. I'm Vladivanov. I work as a technical lead for Cloud Foundry at SUSE. And I'm going to talk to you today about SUSE Cloud Application Platform, how we've containerized it, and how we're trying to make it highly available. So what is it? It's a platform as a service that's based on Cloud Foundry. Runs applications has a large number of components, just like Cloud Foundry does. And it's deployed on Kube using help. So probably everyone here knows a few of these components. CAP is made of the exact same stuff as Cloud Foundry. It's actually built from the same sources. So we have here the exact same pieces as Cloud Foundry. And this is a view that shows them grouped by their function. And every container in the system is actually represented by a green box. So again, same Cloud Foundry that you know and love, just containerized. And we can see that there are quite a few components. So containerizing them is only part of the story. That can happen automatically. And once you do that, it becomes easy. But actually running it and orchestrating it can be difficult. And making that happen in a highly available fashion can be even more difficult. So next, we hope you realize that some boxes have turned red. So it's useful to think about the critical pieces of the system that you want to be highly available more than anything else. So in this case, you can see that the elastic runtime is red and routing is red as well. So basically, these are the things that run your applications and also route traffic to them. And this is very important. You want the system to gracefully degrade, but the application and, of course, traffic flowing through them should still be online. So this is the most important part of Cloud Foundry, let's say. You want the applications to stay online. It's OK if you can deploy applications for a few minutes or for an hour while you deal with the problem that occurred, but you want applications to stay online. And being aware of these critical pieces is important because it drives us to make specific decisions on how we locate components. For example, we have special affinity and anti-affinity rules for the router and for the Diego cells that make them run better separately. So how are we containerizing Cloud Foundry? We all know that Cloud Foundry is usually deployed using Bosch. And Bosch is a tool chain that allows deployment of highly complicated systems on top of VMs. But we wanted to do containers on top of Kubernetes, so we had to develop this tool called FISAL to essentially convert Bosch releases into Docker images. And we're still using Bosch. So Bosch takes you from your source code to your end-to-day deployment in your environment. But a lot of Bosch is how you package your sources, so packages, jobs, stem cells, and so on. So we use all that release information, the spec for the packages, for the jobs, how you compile things. We use all that information to create the container images. And we actually build the stem cell just like you build a normal Bosch stem cell. So at some point during the Bosch stem cell VM creation process, there's a split happening. On the one side, you actually convert the OS image into a Bosch stem cell that's specific to the CPI that you're going to use, or Azure, AWS, and so on. And for us, on the other hand, we turn that OS image into a Docker image. And that's the basis for all of the containers that make up our cloud application platform. So essentially the same, we just skip the CPI parts that kind of turn each Bosch stem cell into a YAS-specific VM image. And since we actually use the same exact sources, so we use the CAPI release, Diego release, and so on, we actually believe that we can be a certified distro by the foundation. OK, so going back a bit to high availability, we want to think about the mechanisms that we need to make each component of Cloud Foundry highly available. And we kind of have two flavors that we work with. Things that can be load balanced, things like the Cloud Controller, or the routers, or the Diego cells. And then you have things that cluster. So for example, MariaDB, or LCD, or console, these are a bit more special. The actual replicas that you instantiate need to be aware of each other. So for example, replica number one of MySQL needs to have a specific address that you can reach it where you can reach it. And if it goes down and comes back up, it needs to come back up as MySQL one. So other replicas in the network need to be able to identify it. And it needs to be able to identify itself. For the load balance pieces, you can just add more routers, put a load balancer in front of them, and it's all OK. You don't need to specifically be able to identify a router. And you can also start them up all at the same time. So it's a bit easier to run these load balanced components rather than the clustered one. And next to these two flavors, we also have components that follow an active passive model. So for example, I think the Diego database does this. You can run multiple replicas of it, but only one will be active at the same time. So they will connect to console. They'll grab a lock from there. One of them will be elected as the master, and the other ones will be passive, meaning that whenever the master goes down, one of the passives will be promoted, and they'll become the active component. So we need to be able to support all these flavors, all of these configurations. And this is where Kubernetes comes in. And we have various Kube primitives that allow us to configure a deployment of Cloud Foundry. So we have the fissile tool that I talked about earlier that actually turns Bosch releases into container images. It also creates Kube configs and Helm charts. And Helm charts are basically templatized Kube configs that describe everything you need in order to stand Cloud Foundry up. So we have services, storage classes, deployments, stateful sets, probes of various kinds, pods. And I'm going to talk about a few of these in detail so you can understand how we use them and what they offer. So first, we have Kube services. And Kube services describe how we want to talk to a component. So for each of those components that you saw at the beginning, we have services that describe how you talk to them. And that includes the port, the protocol that you use to talk to them. And after you describe a service and you kind of turn it on for a pod, you also get an address for that service. For example, you get console.cf.svc.cluster. So that's a well-known address for that service, in our case console, and for that port, 8500. And this will actually also load balance. So you create a service for the console component. What you get is an address. You get load balancing for it. So we define multiple of these for each component and for each port and protocol that we need in the system. So if you were to look at our home charts, you'll see a bunch of these services pop up. And you can imagine that the load balancing is very useful for our HA requirements. So next, stateful sets. These are very important because they allow us to support clustered components like MariaDB. In a cube deployment, pod replicas have no real distinction. You can't tell one from the other. They get a random host name when they start up. And that's basically it. And you have no control over when they start. So if I want, say, 20 routers, they'll all start up. They'll be deployed across Kubernetes nodes. And they'll eventually come up in whatever order. And with a stateful set, you actually get an index. So all of those troopers there are different. They're the same, but they're different. You can identify them by an index. And cube will also make sure that each index is started in a particular order. So say you have, again, three Nets replicas. Nets number one will not come up until Nets number zero is up and ready. And we're going to talk about what it means to be alive and ready in a second. That's very important. So they don't all start up at once. Kubernetes will make sure that they start in an ordered fashion so that you can do database migrations in index zero. And then when index one comes up, it can talk to index zero. It knows that it's there. It can share information with it. It can talk to it. So it enables us to deploy clusterable things like MariaDB. Furthermore, with deployments, when you create a service for pods, you get that load balancing that's basically round robin across all the pods that you deployed. And you can't really target an individual pod by a well-known address. But stateful sets give you this capability. So here, if I have a stateful set for Nets, I actually have Nets dash one, so the index of the replica. And I can be sure that it'll always target Nets number one. So individual components of that clusterable component can talk to each other by a well-known address. Now we have probes. So there are two types of probes that we use. There's a liveness probe that basically detects when a container is running or not running. And based on this, we restart. So for example, if something bad has happened in the container, the liveness probe should be able to detect that. And then Qube will actually take action and restart the pod. And since we automatically generate these configurations, we use the, I think in almost all cases, we use Monnet to figure out if something is wrong inside the container. So Monnet has an API. Like I said, if you look inside a container, it's basically the same as a Bosch VM. So you'll find Monnet there. You'll find VAR, VCAP, directories, and so on. And we use Monnet to tell if the container is OK and alive or not. And then you have the readiness probes. And these are very cool. Basically, the readiness probe will tell Qube if that container is ready to accept traffic. So this is exactly what we want for the active passive model. So if one of the containers grabs the lock from console and becomes the master, it'll actually tell Kubernetes via the readiness probes that it's ready to accept traffic. All the other ones won't be able to do that, and they won't show up as ready. And that's OK. I mean, if you do Qube-catil get pods for your namespace, you will see a bunch of pods that are not ready. But that's OK. That doesn't mean they're not healthy. They're just not ready to accept traffic. And if you kill the one that it actually is ready, you'll see that one of the other ones will grab the lock, become ready, and then Qube will start routing traffic to it. And of course, this is also important for clusterable services, clusterable components. You don't want them to accept traffic while data is still being migrated. So very useful capability there. OK, so now that we know the primitives that we used in Qube to be able to achieve high availability, I want to talk a bit about how we're exposing it with Helm. So Helm is where everything comes together. In the end, even though this is capable, we can do this. Even though this is possible, we also want to make it easy for the operator. So when someone is managing their CF deployment, it should be easy for them to scale their cluster up and down, move from a basic deployment to an HA1, and so on. So I'm not sure if you can read this, but you can see there the count for NATs, for example. For the operator, it's as easy as changing that value from one to three. And what will happen in the background is that the Helm templates will pick up the change in value and will actually change the replica count of the NATs stateful set. It'll change a bunch of environment variables. And the operator is not aware of all this complexity. They just change from one to three. And then a bunch of stuff happens in the background. Pods get restarted. Components that need to be aware that NATs has gone from non-HA to HA will be restarted. And essentially, the cluster will transition from a basic deployment to an HA1 seamlessly. And this is all possible using these Helm templates. So what have we achieved so far? We can horizontally scale the critical pieces to make sure that user apps stay online and they suffer no downtime. We can actually make all of the CAP components HA. So when you do an upgrade and you do rolling upgrades, all the components stay up and running. And the service doesn't degrade, essentially. We talked about the fact that it would be OK if it degraded and your application still stayed online. But we actually want to make sure that when you do an upgrade, there's no degradation in the service. And then we want to survive the chaos monkey. Little disclaimer here is not the actual Netflix chaos monkey. We just wrote a simple script that gets a random pod from the deployment and basically just kills it. So I'm going to show a video here because this has actually happened. This was recorded across three hours. This is a full deployment. Oh, you can't actually see this. So this is a full deployment of the Cloud Application Platform, runs on Kubernetes. It has an application deployed on it with four instances. And we have this chaos monkey script that kills something every three minutes. And we will see that we have some scripts that constantly monitor the API of Cloud Foundry and the application that's deployed there and counts how many times it succeeded and how many times it failed. And this was done over three hours and compressed it to a few minutes. So hopefully you can read this. So on the left here, you see the actual list of all the pods running in the system. You can see a few of the components that are active passive. So you have the Diego database there. One of them is the master. One of them is passive there, so it's not ready. And as we move forward, we're going to start seeing things being killed on the right here. And on the right, you see application requests that have happened, how many of them were OK and how many of them failed, same for the API. And I'm going to fast forward through this a bit. So the movie does accelerate, you'll see here. And essentially you'll see things popping in and out of existence as things get killed. So what we want to happen here is that we don't want to see application requests failing or API requests failing. And we just run this for about three hours. And you'll see that we killed essentially everything. There are no special pieces here. We kill MySQL, we kill NATs, we kill the cells, the API, all of these things. And if you look at the pod list, you can actually see which ones are deployed as stateful sets. So they actually have an index, 0, 1, 2, versus the ones that are deployed as a cube deployment that just have random strings in their name. So in total, over these three hours, we had 54 killings of various components in the system. There was a 99.5% application availability and 99.8% for the API. This is actually the other way around. We actually want more app availability than API availability. Like I said, we don't really care if you lose the ability to deploy a new application across a few minutes, but you care if the application is down. So we still have some work to do, but I think we'll get there. Because so in conclusion, I'd like to say that cube is an awesome host for Cloud Foundry. It can make it run in a highly available fashion. It can do this seamlessly. It has all the primitives required to do this. And we've actually deployed the same exact bit, so our product is open source. You can see it at github.com. slash susus slash scf. We have a better release there that you can deploy by yourself. And the reason we think it's an awesome host is that we deployed the beta bits on Google Container Engine and on Azure Container Services with very few changes, just enabling some kernel parameters on the VMs that run there. So the exact same Helm release that we deployed on top of our own container management stuff on RAND-Fine on Google Container Engine and Azure Container Services. So Cloud Foundry loves cube, I think. So I'm going to open it up for questions. If anyone has any questions, I'd be happy to answer. Yeah, please, go ahead. So the question is if we tested loss of an entire host. We haven't done that yet. We're getting to it. This was a small deployment. It was not across multiple AZs or anything like that. It was actually just sitting on one VMs. We just wanted to test the actual concept that all of the cube pieces work as expected so that the services are correctly routing, that stateful sets come back online without any loss of data and so on. How does the deployment compare to a Bosch deployment? Speed of the deployment. So depending on how fast your hardware is, after the images get downloaded from the registry, and you can think about this like a Bosch compiled release because the container images have all the compiled binaries inside them, about five minutes you'll get stuff running. So if you were to deploy on Azure, on medium VMs for the cube nodes, everything will stand up in about five minutes. So the question is about a parallel between how Kubernetes will act as a host versus compared to OpenStack and Bosch being a host for Cloud Foundry deployments and how the host system that you deploy on has various configuration settings like timeouts that can affect the health of the system. So a cube gives you a lot of control on how these types of things are set up for the deployment that you're doing. So timeouts for readiness probes, for liveness probes and so on. We haven't seen an issue moving from Azure to Google to our own cube deployment so far. We do have some specific requirements for the cube that's supposed to host us, things like enabling privilege mode, of course having an overlay network available, having some specific kernel parameters enabled. But overall, I think it should be easier because Kubernetes exposes timeouts and the network topology to us so that we can make the decisions instead of the actual infrastructure. So it should be okay, we think, or at least better. So the question is if we were able to do an upgrade, like a helm upgrade, while the system was running. So we're working on that right now. The one reason we can't do it with the beta release that's available there is how we treat secrets. Basically every time that we do an upgrade, all the secrets get rotated. So of course, most things will come up but then they don't because all the secrets got rotated so. The question is if you create a secret inside cube, is it available in Cloud Foundry? No, it's not. So you don't, from your application, you don't see the environment of the hosting container. Just like in a VM, if you have an environment variable in VM, the app is shielded from all of that, Kubernetes. So why did we choose Kubernetes? So first, I think you mentioned the CPI. We don't actually use a CPI for this. So this is not deployed using Bosch, it's deployed using Helm. There is a CPI I think available for cube from SAP but I think that's still in an early phase. Why did we choose cube? It's because it has all the features that we require and I think that the cube community has a tremendous momentum behind it and it seems to work great. So we looked at all its capabilities and specifically around these stateful sets and probes, we realized that we could deploy, we can deploy on par with what you can do with Bosch in VMs. Can you please repeat? So the question is what's running inside of the Diego cells? The same exact thing. So like I said, we built from the exact same sources, the Diego cell is one of those components that has to be privileged in order to be able to run containers in containers. We realized that that sometimes sounds scary that you're running containers in containers and it kind of sounds like VMs in VMs. That's not actually the case because you're sharing the kernel, you're not running a kernel inside another kernel. So it's just C group namespaces at the end of the day. So we haven't seen a performance degradation or anything like that. So it's garden with a run C backend and we actually use Groot FS. If you know that's a new project in upstream Cloud Foundry that allows you to use a butter FS or overlay instead of the old AOFS which helps us greatly when containerizing Cloud Foundry. We haven't done those tests yet. You're looking at application performance or infrastructure performance. Okay, yeah, we'll take a note and look at that. Yeah, so how do we keep our helm chart essentially on par with the Bosch manifest? So Fisal will take up Bosch releases. It uses the same exact descriptors as the packages and jobs specify. So those spec files consumes all of those, compiles everything, and then we wanted to expose configuration to the user in an opinionated way. So we don't expose everything that's available from the Bosch releases. We only expose things via environment variables. So technically if you wanted to, you could grab all of the Docker images, run them manually with specific environment variables and you'd get a Cloud Foundry out of it. So the helm templates only expose the environment variables that we choose as a distro of Cloud Foundry and the other ones are basically automatically generated. So you'll see, let's see if I can go back. You actually see something like this. This doesn't contain the actual environment variables that I was discussing, but imagine you have nv.domain there and domain would be the domain of the system or Cloud Admin password. And then we use those environment variables and feed them into the actual Bosch properties and we feel that this makes it much easier. You don't have to repeat yourself as much and you have one way to configure the entire system. One values file is essentially in helm. So we don't expose it one to one. Hopefully that answers the question. Thank you.