 All right, everybody, welcome to this talk about platform availability. Schedule says, Cloud Foundry Availability. The ideas presented in this talk can be basically applied to any distributed system. So, platform availability is as good as a title. So, in this talk, what you're going to see is the idea of a continuous uptime improvement and in order to be able to continuously upgrade the availability of any system, you need to somehow measure its availability and in order to measure it, you have to define what availability means and how you'd like to calculate your metric. You don't see any slides, do you? Well, that's kind of strange. You would have said anything, right? Well, let me check the monitor settings. All right. Yeah, try to relaunch it again. Oh, you got to set the mirror if you want, you could do a mirror. No, I don't want to do that. Okay, I think because it's a, you may have to. Oh, just restart it. I think what it does is because there was a recorded session, so you can export it into a QuickTime movie and whenever you have a recorded session, for somehow you start the presentation, it doesn't show on the second screen. I don't know why, but it happened to me again. All right, so in this talk, continuous uptime improvement, and therefore you need to be aware of what availability means to you. It seems to be an easy question to ask, but it actually is a bit more complicated than it appears in the first thought. More than that, we're going to look into a framework on how to measure and calculate platform availability. There are many ways on how you can define availability metrics that are meaningful to you, so we are just looking into one example here, one methodology. Of course, once you have thought about what availability means, you still have to measure it, so we'll have a quick glance at that. In total, we are going to have around 25 minutes, I guess. I might be a bit longer than that. Last talk, so I hope you'll forgive me a few minutes longer. I'm trying to present all examples in the context of a cloud foundry, obviously, and the last topic, because it is related to availability, often asked by our customers is what are maintenance windows in the context of platforms and how does the availability and maintenance window concept play together. For example, we used to take down services for, let's say, a few hours or even a day to maintain them. You can obviously do that for more than platforms. We'll have a look at how this is affected today. A few words about myself. I'm Julian Fischer, CEO of NNINES. We are specialized in building tailored platform solutions based on cloud foundry and Kubernetes. What happens there? That's strange. The idea of the talk is, as I said, to enable a continuous uptime improvement, so wherever you start with your platform, it may change, your environment may change over time. For example, you get more customers, you introduce more data services, you whatever, deploy new versions of your platform, new cloud foundry versions. You had Kubernetes there. Depending on the size of your organization, this may be a simple monitoring metric you put together, so you'd be like, yeah, why talk about this at all? Or you are in a large organization where the management tells you what the availability should be, so you should have whatever, three, four, five, nines, and you'll be asking yourself, all right, and what does that mean? How do I actually come up with a number? That's meaningful. In the end, what we would like to establish as an organization is a process that continuously monitors measures and monitors the availability of the platform and then derives measures that could be beneficial or at least ensure that any other changes are not decreasing the availability. That learning loop pretty much looks like what the lean startup build measure learn looks like just for uptime. As I said, the question on what is availability is not as easy to answer, but the most common, let's say, way to express availability is by applying nines to it, and there's a table here using the example of the availability per month. Obviously, if you use that relative number in percentage relative to a year, you'll be having other downtimes on the right side, and you can see the more nines you have, the less downtime would be tolerated under a particular uptime requirement. People always ask us, where does the company name from? You can see one nines, two nines, three nines, and basically any nines. That's where the name comes from. I believe that you can design a system to a certain availability, and you should always have that in mind. So what is your uptime requirement? So, all right, let's pick one, maybe 99.9% for example. The question is, what are we actually talking about? Are we talking about a Cloud Foundry runtime? Are we talking about five of them? Are we talking about different environments? So if you really ask the question, what is your platform comprised of? You will see that, and we've been building and operating platforms for an eternity for customers. Each environment is different, and we have to, first of all, describe what the platform is about, and therefore, affecting the question and measuring availability is affected. So we need to define the system. We need to describe it somehow. What's the best way to describe a system with regards to availability is one of the questions. And we've been operating services, managed services, way before Cloud Foundry, databases, applications, application service, using Chef. That's strange, never seen that before. And one of the methodologies that seem to be a straightforward thing to do is just draw a graph. What are the components that comprise your system, and how do they depend to another? There's also a notation called reliability block diagrams. I think it's from electrical engineering. And the idea is that you will have serial and parallel compositions of components, and, of course, hybrid combinations of that. And whenever you bring that into the shape of a graph, or, I would say, in simple cases, a tree, you might end up looking at something like that. So this is, for example, how one of our Inunite's platform environments look like. Just you see that there's a platform. There's a platform. That platform comprises several platform environments, and each platform environment is comprised of several subsystems. So in this case, a base system where some of the shared components are. The Cloud Foundry subsystem, Cloud Foundry runtime. And in our case, our data services, which comprise eight different data services, their service brokers, their SPIs, and several other components to that. So we're looking into an environment with more than 100 components. And these components are organized in a set of subsystems. As you can see, the Cloud Foundry environment splits into the API, the Diego subsystem, the monitoring subsystem, and the service broker subsystem. And the data service themselves, they contain the actual automation that will be controlled by the service brokers. Now, I would try one thing. It's maybe like a screensaver that causes that. So I'll put in some electricity. I'll see whether this fix it. All right. So once we have figured out how the system actually looks like, and we have drawn such a graph, we can see what are the components we would like to look at, ideally down to the components that are atomic services. For example, a Postgres backing up the Cloud Controller. As you can see, for example, Cloud Controller is there, but there are subsequent components to this. I just removed that level of detail from the graphic, because otherwise it would have been too bloaty. There's already a lot of components in there. So we've been looking for a way that we can apply to every platform environment. And as I said, these environments might be very different. Let's say one customer, for example, he uses Cloud Foundry to deploy one application, and he uses Cloud Foundry to separate tenants. So each organization represents one tenant and has the same application in it as every other organization within that Cloud Foundry. So for example, for a customer like that, the availability of the Cloud Controller API to deploy new application versions is very different in meaning and importance compared to a public-facing developer platform where developers will start calling you in case they cannot push their applications. So we've been looking for a way to express both the composition of a system as well as the importance of their particular subsystems. So the first aspects of it describe how do you express dependencies of, well, let's say, within one of your subsystems, more elementary subsystems, such as Diego, for example. So in Diego, you have a different cells and depending on how many of the cells are there, your applications will be up and running or they won't. Another type of dependency would be the serial dependency where you have two components and both components need to be present. So there are some calculations or formulas you can apply to that. I didn't fix it. And in this case, you can see that assuming 99.5% availability of each component, if you have a serial dependency and you look at that dependency over a longer period of time, that a serial combination of those dependencies will result into a lower availability in total. So in this example, you lose about half a percent. So of course, because if one component fails and it's a serial composition, the overall service won't be available. And because the other component fails too, the combination of the two is less than. Actually, that's a pretty nice feature. It keeps me going, right? So a failure of one component implies the failure of the entire service. So obviously, there's a second version of it where the components are in parallel. And what this means is that each component assuming the same availability, 99.5%, the service will be available if at least one of the components is present. So for example, you have deployed your application and you have two instances and let's assume you don't utilize both instances to a full degree but distribute incoming requests. If one application dies, your application request will still be served is one of the two application instances survive. And there's a formula for that too which basically means that you only decrease the availability if both of the components have been failing together and the combination of the two will increase your overall availability. So assuming that you had 99.5% in the beginning, the parallel composition on that nearly gives you 100%. So the compound availability of these increases, obviously, you want to have situations where you have a little bit more complicated topology where you put together some things in serial and other components might be in parallel. So in this example, you can see that you'll have the serial dependency which is the outer structure and one of the components is comprised of a parallel structure. So all you do is just take the formula and put it in braces into the formula. I think that's basic mathematics. And the result is that as you can see, the overall availability is pretty much decreased because of the dominance of the serial sequence by the four components being in a sequence. So each assuming 99.5% availability, the dominance of the serial composition makes up the decreased availability. So there are situations, for example, with database clusters where you actually want to express a partial availability where you can lose a certain redundancy and still your servers would be able to survive. So a MongoDB replica set or a Postgres asynchronous replication with three nodes and a quorum-based leader election may survive the failure of one of three nodes. So in order to express the availability, you could, for example, apply an equation like that, providing an example here. So n would be the number of the nodes in the cluster. It's three replicas. And you need at least two of the nodes for the service to be available, which would be m, m is equal to two. Assuming that each node has the average availability of 99.5%, you will see that the overall availability is also close to 100%. So you can increase, for example, the number of replicas to five, then we'll have five nodes and maybe only need three nodes to be available. And you can see that that gives you the possibility to survive even more node failures and still remain available all the time. So as I said, the serial parallel and composite availabilities is usually applied to more basic components, components where there's a direct relationship and the nodes do somehow equal or serial independent stuff. In database systems, usually the nodes are kind of equal that formula applies. But as we can see in that diagram I showed earlier, the dependency graph, that the subsystems are comprised of very different components. And the question is how do we actually come from the service availability of, let's say, a database or a Diego cell to the availability of the entire platform. So you need to put in some thought and we came up with the kind of an obvious thing to do. You model the overall platform availability by recursively going through the dependency graph and determining the best applicable formula to each of the compositions you will find there and then subsequently build up your metric from it. So I try to be visual about that because that sounds like a lot of theory but it actually isn't. So we're always coming again to that graph that shows you the platform, looking at one platform environment with a base system, a cloud foundry, and the United States data services and the miscellaneous components we'll just neglect for now. So basically you determine the availability for the sources in the dependency graph. So at some point you need to measure the availability of one of the components in the graph before you can then compound them into and compose them into compound services. And you would then, yeah, somehow bring them together and we'll have a look at how this works. So you basically go through the graph, well you could do that bottom up, that's basically how the data will flow, but in this talk we'll go the other way around and look at it from top to bottom. So why do we do that? Let's look into, you have one region at this point of time, let's say AWS, US East and there you have your platform. The platform usually comprises of several environments. So you have a staging environment where you try out new releases of your platform and you have a production environment where your applications are that are important to you and your customers. So how do you express the fact that these environments are not equally important? And you could apply other formulas to that, I'm pretty sure, but one simple approach is to use weighted averages. So you basically sum up the availabilities and divide it by the weighting factor. I prepared an example, so for example, if you have a development, a staging and a production environment, you could arbitrarily weight them to express their relative importance to your overall availability. So for example, this is very neat if your boss tells you you need 99.9% availability and you can come back to them and ask, well, all right, what do you mean with that? But that question, I guess, will be without an answer. He will just be angry because it's a question that's not as easy to answer. But if you come back to him and you tell them, well, relative to your production environment, how important is staging? You would say, well, I don't care about staging, but maybe your developers are or your platform engineers do care about that. Maybe there is impact to your overall development if this environment isn't there. So this will allow you to apply relative weights to each of your environment that will be then multiplied against its availability. So you calculate the availability relative to the importance of each environment. So to be more visual, put their calculations, if you want to download the slides afterwards and go through the calculation in greater detail, you're free to do so. So basically each environment has a weighting factor and therefore you can determine the overall availability. As you can see in this example, we had a very good production availability of nearly 100%, but the platform availability has been dragged down by the relative low availability of the staging environment and this is highly weighted. So maybe you want to change that weighting factor to show off towards your boss. So in this illustration shows that we have come up with the weighted combination of the individual platform environments as the way to determine the overall platform availability. So we can now step down one level and we just look at how can we combine the subsystems within a cloud foundry environment, which comprises of the pink cloud foundry runtime and the NNINES data services in our case. Because in order to applications to be available, you also need the backing services so that these two are equally important can be expressed in a similar way, in this case, a weighted average again. So as you can see, you're then here on level two and you can basically repeat that, looking into the availability of the cloud foundry runtime, which is comprised of the API, Cloud Controller and UAA, the monitoring subsystem, additional components to monitor the individual components, the Diego subsystem, as well as your service brokers, for example. And then applying the same strategy again, you have your weighting factors and for example here, you could express that use case I was mentioning. So the customer who has this one application and just uses cloud foundry as a multi-tenancy enabling tool might, for example, weigh the cloud foundry API way lower than a customer with a public facing platform. So the idea was to come up with a quite simple tool that allows you to express the individual requirements of the platform and come up with an availability metric that has meaning in the particular context of that one platform, instead of coming up with a calculation that may be not meaningful or may not be meaningful in the broader context. So yeah, as you can see, we basically step down one level by the other and you can repeat that. So the weighted averages are actually interesting if you have subsystems with different components that may be of different weight in different scenarios. So you go down that path until you hit one of the components where you actually want to use that serial and parallel dependencies where you possibly can also measure the available instead of just calculating them. So the question is how do you obtain availability from atomic services and the answer is obviously you measure them. So the monitoring of availability is very important as part of your input. And there's another aspect beside of having availabilities of components that you can then put into your fancy formulas is also important that the availability monitoring service input for the support team to diagnose platform failures. So for example, if you have trouble with your cloud controller, this could be caused by the postcards behind the cloud controller. And if your availability monitoring is good enough, you can find out that you'll have a problem with your database instead of your cloud controller where actually the first error came from. So you do not only know that you have a problem, but you also would like to have an indicator where this problem might come from and subsequently tell your platform operators what to know in order to conclude a informed decision. So in Cloud Foundry, we I guess all use a Bosch and therefore we use Monit. So that's a pretty neat tool and it's very good to monitor atomic services. It allows you to have process cell feeling. It monitors the existence of processes which is absolutely insufficient because if you're pretty sure you've seen data service doing their processes present, but they're not responding or web applications could be the same way. So it might be necessary to ask the question for a certain component, how do I determine the availability of the component? And the conclusion usually is that you need to perform some kind of functional testing, for example, connecting to the database, querying whether a certain schema is there, just to see whether at least heuristically the system is available. And that leads to the question how much load do you want to expose your system to just to perform that particular check which also determines how often the check can be executed. So there's a certain optimization problem here, but I mean that's really depends on the component you're looking at. So while monitor is pretty interesting, the bigger picture usually comprises a monitoring tool such as Prometheus could be anything else serving the same purpose where you actually go back to that diagram. And for each subsystem, you've got that fancy formula yourself and you translate that into a markup that will represent that formula. And that will result into a visual dashboard shows you the availability of the systems and subsystems that represent the craft if you've seen earlier. So the good thing here is that you can basically traverse your dependency graph in your monitoring system, but just clicking at those availability metrics, and you will again then go to one level deeper and you can see the dependencies of that particular subsystem, and so on. So that helps you a lot to determine if an availability seems to be odd to find out where it actually comes from. So over the time by applying that technique, you will have more and more dashboards assigned to those subsystems. So again, that you can do that until you come to the more elementary services. And I'd like to highlight one thing that is the availability of a particular subsystem, let's say the Cloud Foundry runtime as we've seen may go down to the question on are the availability of the subsystems available? But at the same time, maybe users will say like, well, I don't actually care about your fancy metric. What I care about is can I deploy an application or can I create a service instance? So you may want to look into a monitoring of availability more in the context of use cases than in the context of just determining a fancy number, because that's actually what the value for the customer is. Does the system fulfill what it's supposed to do? So the example here, can I create a Postgres instance, may involve, do I have the service broker, does the virtual machine as it has storage attached to it or their local resources or the dependencies of that service present? So you can basically look at the use case and give a use case based monitoring by looking at that dependency tree for that particular use case instead of the overall subsystem and conclude the healthiness of the use case. This is especially important when you have scenarios where particular use cases are way more important than others because it's a special use case. Having knowledge around the applications running on the system, for example, is one of them, where you know that that particular database is more important than the other thing or we have a CI CD pipeline running at a high pace and I really need that CF push experience to be there. So another interesting thing is to perform availability monitoring during upgrades. So I'm giving you an example. We repeatedly run upgrades here using the availability of the data service subsystem and it's interesting to see what actually happens during the upgrade. How does the upgrade affect the availability of the overall system? We do those tests and staging, for example, to conclude and report to customer what the actual impact might be. As I said earlier, the question on how do maintenance windows look like today is a question that we've seen repeatedly with customers. So while the definition says it's a period of time designated in advance to perform preventive maintenance that could cause disruption of service, I think the question is, what does it mean for the application platform? Because one of the reasons to move there is to avoid outages and we're on the cloud and the cloud never fails. So you already see where this is going. Platform wide outages are absolutely unacceptable and they are unnecessary too. But then you've got those fancy forms in your company and that 300 page handbook, that's about whatever, service security and standards how to run software systems. So you need to fill out that paragraph that's about maintenance windows. I would say the reality of maintenance windows nowadays is that we are looking at platform environments with 3000 virtual machines managed by Bosch, for example, and how would you perform an upgrade in such a system? I mean, you have several subsystems as seen earlier and for example, updating Diego would mean that you'll just take away a virtual machine and those applications being affected would be created somewhere in the cluster. So in that particular example, you can see that you can take away up to two cells before the system will actually be experiencing significant problems from failing Diego's. So an inflight limit of one would leave enough capacity for applications to be deployed, while an inflight limit of two would already fully utilize the system. Wouldn't be a good idea to apply that in that simple example. On the data service side, you can see that a rolling upgrade through the data service instance for clustered instances, such as Postgres, you will see failovers towards the application, but no outages. If you upgrade a single node Postgres, recreating that virtual machine during the process, you will see an outage that might be a five or 10 minutes depending on the upgrade. So what this is saying is that the maintenance window is about setting your inflight limits and setting the redundancies in your system accordingly and that you should design your system to your uptime requirements and that within that process you have to determine the redundancy and weigh it against the infrastructure costs you're looking at. So yes. So summing it up, there is a systematic way to define platform availability. You still have the possibility to tailor towards your specific environment needs and you can actually make promises towards your manager to have environments 99.99% available and you can also make sense of such a requirement and come back with some more explanation on what this actually means by telling him how you weighted the individual components, verifying that the way you weighted it represents the interest of your customers or your organization. So the platform dependency graph was a graphic we've seen repeatedly and so that step of analyzing the dependencies in your system is one of the most crucial one. So you then define and measure the availability of atomic services. You compose and weight the subsystem availability and derieve the metric from it. Implementing your availability system in a way that it gives you a differential diagnosis is meaningful and as I said the calculation of availabilities of subsystems is usually a composition of other subsystem availabilities a recursive path towards something you can measure. So in order to come into a learn loop you'll have to you know derieve insights from what you see for example looking at the upgrade see how it affects the availability during the upgrade and maybe come up with a better idea so that less impact is seen. We see in large system a continuous maintenance going on so you're just done with one update and basically the next update is already to be prepared so there are waves of updates constantly going through the system so it's meaningful to design a system capacity accordingly so that these update waves won't interfere with your customer workloads and that actually completes my talk for today. I'm ten minutes over my time. No not so many people left so I hope you enjoyed it. Here if you have questions feel free to ask. Regarding the serial dependencies you mentioned this is basic probability right but one prerequisite is that the events are independent is this always the case because one outage might influence another outage of another service or even both could have the same root cause and then the calculation is wrong. Well yes the question is always to what degree do you want to calculate the availability in the what degree do you want to measure it so I simplified and said you you measure basic components and you compose the availability of subsystems if you have an ability to measure the availability sorry of the subsystem instead you can actually reduce that error you were mentioning also we didn't talk about the infrastructure availability for example network and bandwidth and how this may affect your cluster availability postgres a good example if you have a network trouble that causes a split brain then even if all the three nodes are basically healthy you could have an outage of a subsystem so yes you're right there are mistakes in that procedure that you'll be you'll have to be aware of it is an answer to a question that we have this specific requirement where the management said we want to have that availability of the platform and the answer was or the our job was to determine a way how to give them the availability and still provide them a meaningful explanation a way that you can explain to the customer what this actually means does this perfectly give you all possible scenarios surely not but we have at least a continuous monitoring now we have a history of data and we can see for example also applying different techniques doing upgrade and see whether we can have been able to reduce the impact of upgrades and this is the purpose of what I have been presenting but there are many ways you could improve that all right then thank you very much