 All right, here we go So yeah, welcome to the stock about delivering a production Cloud Foundry environment with Bosch My name is Julian Fisher as mentioned. I'm CEO of a Platform company called any nines. We've we are a consultancy around Cloud Foundry So we started running Cloud Foundry back in 2013 offering a public Cloud Foundry in Europe So we've been through a learning curve in the last years And I'd like to share some of the experience with you today so sadly, it's only a few 20 minutes and You know, you can't just hit my play button and I will keep on talking about this for hours So the most challenging part actually was to cut out the right pieces so that you get a coherent coherent picture or out About what production readiness actually means so let's give it a try so the agenda for today is basically Once activated my trigger So the agenda is going to be a few words having a few words about Bosch I think Bosch has been introduced very well in several talks today. So we just have a few Emotional words about Bosch then we are looking at how to actually deploy a production-grade Cloud Foundry runtime and Closing with some words about the data services part as it is part of a Cloud Foundry And an essential part So looking at Bosch You know any Nines is a brand so the company behind that other tech it exists since 2008 and we followed a business model comparable to engineer where we take physical and virtual servers and Automate the hell out of them using Obscote Chef so since 2010 we didn't we did a lot of Chef automation So when we actually started using Cloud Foundry our our guys were like, oh, yeah a new automation technology Do we really need it? Why the hell Bosch and so there was some resistance and that resistance Was mostly about you know changing the tool set and because it's a different approach So after using Bosch to deploy Cloud Foundry and getting used to it a little more people slowly changed their mind and After recognizing that you can do so much more with Bosch than just deploy Cloud Foundry They actually fell in love with it. So today is the preferred automation technology over at our company So why is that? Obviously Cloud Foundry has as a distributed system large challenges around maintaining state deploying loads of virtual machines, but also configuring a Distributed system so making up though making those components actually collaborate and work together So the challenge of the automation tool Toolchain here is not only about you know spinning up a virtual machine and configuring a database but also taking care of the life cycle of such a system because you want to update you want to you want to scale you want to Do loads of things lots of things During the operation of such an operating of such a distributed system And I've not seen something that really comes close to Bosch Capabilities to deal with those Challenges with a single tool set. I mean you can use several and put them together and have a similar User experience, but what we really found appealing is that it every it does everything we actually need Together fair enough. There's a learning curve. That's a little more Let's say it takes more effort to really get going once but after that. I think the Values really pay off alright, so one of the Very interesting things is the infrastructure independence that's built into Bosch and at any nines. We've felt The impact of that firsthand We started the past offering as a proof of concept Because we've done open-stack before that so with open-stack Diablo We thought we could become an infrastructure company and Went through the value of darkness with open-stack learning that it is a nightmare to operate and everything So we want to ensure that we are not repeating that story with Cloud Foundry and Started on a rented VMware as at that time coming from a VMware's background Cloud Foundry Seemed to be most stable on on VMware so we bootstrap Cloud Foundry in just a few weeks and After a while we recognized that our open-stack could be good enough to run a Cloud Foundry top of it. So we took the entire Platform with more than 200 apps or 150 apps at that time and Migrated it to open-stack with less than 30 minute downtime so downtime it was two weeks of preparation, but the downtime of the apps were like kept to a minimum and The reason why that was possible is because Bosch's abstraction from from the infrastructure and we kept our toolchain In a way that we'll keep those promises and Recently where our open-stack finally melted down and we recognized that we shouldn't run open-stack at all We might create it to Amazon and that transition also was Just possible because of the infrastructure Independence coming with Bosch. So we know how to run Cloud Foundry. We know how to break open-stack We know how to reassemble everything and most of that's only possible because of Bosch So we have a data suite built on top of Bosch and you know customers were asking whether they could you know use that on premise So they have a different Linux operating system. They have to use because of internal policies So with chef I you know that would trigger a nightmare because you have to go through all the cookbooks and ensure the consistency between Those different operating systems because you know every operating system with their package manager ends up having a different directory structure And with Bosch those problems are not really present because it does a proper task on abstracting from all these things We're looking at the data service part One thing that's also very important comes also in handy with the Cloud Foundry runtime is that you have a separation between the abstract blueprint of describing the the distributed system, but also This having it separate from the actual deployment because you might you may know that Sandbox environment is most likely very different from a production environment But you have you want to have one set of Bosch releases and and then actually keep them separate from the deployments Which also comes very handy when dealing with data services where you want to have several data service plans and resulting You know deployments should be should be differently, but more to that a little later When whenever you have a Clustered component such as the DA and you are performing upgrades You also want to be sure that you don't take down all DA's at the same time So with a lot of automation technologies you can't do that within your cookbooks and have you know some Smartness in there, but with Bosch you just don't have to do that. It's it's done for you And also it's not a fire and forget automation where you spin a virtual machine up and maybe Perform some changes around that, but it really takes the responsibility to take care of the babies it actually spawned by Taking care of the jobs is actually spawned and will recreate jobs and and virtual machines for you so obviously you've been through that learning but take all that together like the VM provisioning the management of persistent disks and And you will see that There's actually the whole life cycle of such a distributed system already covered and all you have to do is use Bosch It just takes away a lot of functionality. You have otherwise to rebuild and put into your code so as I said Bosch we fell in love with it and It has proven to be the right choice many many times So let's get back to the cloud foundry topic because we are here about you know Bringing production and cloud foundry together with Bosch So what does a production cloud foundry mean? Yeah, thank you very much So there should be there should be a mobile off-sign somewhere However, production What does it mean? I mean, there's so many great companies here They they may operate plot cloud foundry at even a larger scale than we do So I'm pretty sure that this will actually mean something very different to you than it means for us So what I can say as a common denominator is that everything fails I mean, you've seen so many distributed systems promising you highly will high availability and But haven't you all seen these systems fail as well I mean, I've seen every system fail in in my career So I think it's just natural that things break So and from our experience our infrastructure was the major source of or failure whether it was Because of physical server failed or the infrastructure layer itself had some issues. It's just these technologies are not perfect. So failures occur so when Asking or defining production readiness in our team We came to a definition that is as easy as that that is a system is production ready If nobody has to get up when ordinary failures occur So, of course the magic word here is the word ordinary. So what's ordinary? It is when failures actually happen within a availability zone So we come to that concept a little a little later The idea is to change the expectation of we can get rid of failures at all even when using clustering We just scope a certain set of failures and we expect them to happen But it will always it will always the case that Failures will happen that we don't expect. So how can we actually deal with that? so one approach to To deal with this is to design your system to fail again You'll have to scope it in order to make things work one of the most essential things and I actually thought that Nobody really needs to tell people that But I've seen customer doing that horribly horribly wrong is that you need infrastructure availability zones so Whenever you build the infrastructure, you should be aware that they are that there's Should ensure that there's at least three different Availability zones they should be separated as much as they can so that they are protected from physical physical events such as fire You know dedicated Dedicated energy sources networking Different ideally also service from different batches. It's one of the things We just recently experienced as we had all the open-stake service. We had they were from one batch so we had a A issue with the Intel network driver and it took down 20 out of 24 servers at the same time with a kernel panic So at that moment your availability zones won't worth won't be worth anything. It's just because the problem you have is at a Entire infrastructure scale So ensure that you have three availability zones. They should be separate separate switches separate networks and At the same time you have to ensure that they are low network latencies between them So let's have a look at those two attributes three Why the number three is just because whenever you use a quorum-based service that one available Availability zone goes down. You still have a majority and you know, you'll be ready to create a quorum and the low latency is important because whenever you use Let's say a synchronous replicated database cluster for example that you don't run into a split-prane situation just because there's You know a network network latency So how do you actually do that? I mean building that's surely infrastructure specific, but all the infrastructure softwares out there They are somehow managed to do that. So How do you do that with Bosch? Before before they actually the recent versions you had to create resource pools and Put the availability zones in the resource pool and then you assigned the resource pool to a job Which was kind of messy because you you mix You know Availability zone and resource pools everything in your deployment manifest which makes them a little verbose so with recent versions they luckily redesigned it to use a cloud config file where you can Separately handle all those, you know infrastructure specific configuration. So you have your availability zones being described in a separate In a separate file so you can then refer Just to the availability zones which always kind of neat thing and makes things more readable so with that in mind your infrastructure being structured well and You'll have the possibility on the level of Bosch to Distribute virtual machines across availability zones and therefore influence their placement We can look at how to make a production cloud foundry again, this is more an example as it is You know a precise science at the moment you might be different for you, but the runtime Looking at the runtime There's a generic strategy. You can actually apply to to all Distributed systems you want to you want to operate most important thing is that you eliminate all single point of failures Because whenever one availability zone goes down you can be sure that every singleton could be affected By chance could be that it's luckily not exactly that virtual machine But it could be so how do you defend yourself is you cluster everything at least three replicas and spread across three availability zones So the generic strategy is that you create a list of system components and their dependencies and you go You go through every component and check whether it's a single point of failure You identify this component you you dig into it and you will just find out whether you whether it can be clustered and If it can be clustered you cluster it it would take some effort Maybe you have to change your boss release or whatever, but it would definitely definitely save you some time So if you find yourself identifying a component, you cannot cluster I definitely prepare myself for night shifts because you'll have to get up and you know fix things in the middle of the night when exactly those components went down So looking at a cloud foundry At our runtime a few months ago We found ourselves because we bootstrap From early versions that we have a plop store based on NFS, which is a bad idea because NFS doesn't scale it's not redundant and We have we we had a postgres behind the cloud controller database as well as the UA as a single point of failure So what what did we do to to get rid of that? The first step we did were we made we've made some pull requests against the cloud controller a few years back So that we can use OpenStack Swift as a blob store or you know as an alternative Amazon as three and Also, we took some time to cluster a postgres and use use it as a database for cloud controller and UA So it's worth looking at that component in in more detail To be mentioned that in the meantime, there is a very nice postgres Sorry a very nice Bosch release of my sequel Galera out there. I think it's made made by pivotal, right? So this can be an alternative to the postgres solution But in our case we already had postgres on the map and spend some time And it's a good example to see how you actually use cloud foundry cloud foundry's Bosch in in the context of You know making something a single point of failure and turn it into a cluster So we'll walk through that example to learn a little on what are what kind of challenges? You might be facing during a journey like that. So the desired Goal was to have a three-node database cluster. It's a little small. I'm sorry With a cloud controller and UA databases being spread across three databases across three availability zones So I gave a talk about the postgres cluster in more detail But a few things should be mentioned for those who haven't been able to attend that Of course we deployed and monitoring that cluster by by using Bosch So you got this Bosch director and it's going to deploy three virtual machines for you and you know turn that into a cluster so As you can see there are several components running on each virtual machine one of which of course is the Bosch agent and There are two other there are two other components the replication manager as well as the console agent So why do we need that? the reason is because Postgres has built in Capabilities to perform replication. It's an asynchronous replication called streaming application But it does not come with with fail-over detection or automatic fail-over Facilities so you actually have to find a cluster manager that would help you doing that Earlier days we've used pacemaker to do that But it's very very poor to the choice on to put that on a on a cloud So we found the cluster managers doing a great job and cluster manager will Will actually recognize a failing failing databases will trigger appropriate promote scripts that will help us to then perform the actual fail-over So let's have a look at some of the challenges doing that fail-over scenario and again those are The complexities you have to deal with while building a Bosch release right so we have to wrap these components and we have to find a way that We do a fail-over Specific to Postgres, but still being operation system and infrastructure agnostic So the challenges we've seen is that you have to provide credentials to access your database server So with asynchronous replication what you can do is just use You know every node any of the nodes or load balance them because you have to ensure that you're always Right against the database master So with that With that requirement you can use IP addresses because they will change during fail-over So if you if your master server goes down you want to write to the to the slave so you can't use the IP addresses and Well, at least at the point we've been looking at this Bosch internal power DNS was a single point of failure So you don't want to use Bosch internals DNS name to do that either So we've decided to go with console and use use its internal DNS You know with a five-note console cluster spread across three availability zones That's actually a redundant solution. We can live with so the resulting architecture then looks somehow like this So you have a console cluster and the console cluster manages a DNS name So this DNS name this DNS has been has to be registered in your in your runners ATC resolve obviously So and with that you can actually use this DNS name to always point to the master so and in case fancy animation, isn't it so in case this master goes down Like what happens then because all right you have two replicas two other servers But you have to get up and you know fix it. No, we don't want to do that We want to have this happen automatically So this is where actually the rep manager comes into game So it recognized the master servers gone and at that case it will trigger a promote script Which will also tell the remote console to to change the DNS entry to the new master So one of the challenges with that approach is that you shouldn't have DNS caching activated In your runners or your applications, which might be troubles cause trouble with some default Java JVM configurations But it's definitely a infrastructure agnostic way of dealing with that situation However, so far it's three o'clock in the morning. One of our servers went down SysOps are asleep and we are in the graded mode So what actually happens now is that the Bosch self-healing come Self-healing come into get the game the health manager of Bosch will recognize that There's a virtual machine missing and the Resurrector will instruct the director to Boots reboots wrap that virtual machine and because of the way this solution is built at that time when the rep manager comes up, it will already recognize that there's a new master and In turn the new node into a slave So with Bosch built in functionality Not only that you will have to be able to Create an automatic failover using using console Wrap that in a Bosch release, but Bosch will also help you to automatically recover from the graded mode So now you are you're actually fine. And if you fix the physical host, maybe Which failed behind that you don't have to do anything All right now we reached the checkpoint where every every component of our runtime is clustered so the runtime will actually survive failures from a single availability zone which Leads us to the inside that a cloud foundry is only production ready when your data services are Because the problem or the challenge with data services is that apps often strongly Depend on the data services. So now you have this wonderful runtime that gives you a Scalability for you can scale your app from one instance to a hundred during a peak And then you just have a single database cluster that doesn't really make sense because there's an unhealthy ratio between application instances and a physical Database cluster. So at some point this cluster might fail either because it's overloaded or Because it you know it fails even clusters may fail so that case what happens is that a lot of application instances are affected and From experience and as I said, we've broken everything at some point that the customer will give you a hard time So what we derive from that is that shared data services are not an option I don't want to I don't want to have a postgres or any data services where a service instance is managed by The data service itself. So postgres databases are bad service instances It's I'd rather go with dedicated postgres servers or clusters at service instances So our conclusion was that it's more healthy to use on-demand provision dedicated services instead So with that being said the situation would look like this you have a hundreds of application instances and Whenever one of these service, you know service instances goes down. Maybe because they are a single server postgres and The appropriate availability zone went down The problem is just way more contained which is not solving the problem because still things fail But you know you have less customers yelling at you So we actually want to contain problems and of course, how do you on-demand provision data service instances? So let's let's do let's have Bosch Do the dirty work? So a resulting architecture might look like this where a cloud controller talks to a generic service broker Who will then trigger the issuing of of deployment attributes using a data service specific component called the SPI Which will then talk to a deployment component slightly abstracting from Bosch Generating a deployment manifest. So let's say you have a Bosch release for postgres And you want to create a service instance for a single postgres server? This will do this deployment manifest will defer from a let's say a large postgres cluster So and that abstraction is actually made on the level of the deployer Using different templates will which will then be generated into deployment manifest and Subsequently deployment tasks will be started and virtual machines will be provisioned for you So there are two major Advantages using that strategy first of which you have a stronger isolation between service instances because you're using means of infrastructure So there's a clear contract between you and your customer because I ate gig Postgres Server with eight gig of rent of memory I mean, that's that's that's up to you how you use it And nobody else will influence that instance unless you do heavy over commitment on the infrastructure level All right, and good thing about what we found out is that you'll actually only have to do two things when adding a new data service Which is creating this SPI whose main? Responsibility is issuing the credentials and managed credentials when creating service bindings as well as to provide a appropriate Bosch release So the communication flow looks a little like this. So whenever you create a service instance, this will be Delegated to the service broker who will then prepare the deployment against a service data service specific component This is about the placeholders you will later actually fill in the deployment Templates which will be generated to deployment manifest So with that you actually talk to the deployer select a template Let's say a single server template fill in the appropriate attributes and generate a Bosch deployment from it The cloud controller then keeps on polling asking whether deployment has finished and once the deployment has finished the service broker will store some service instance specific Values such as you know credentials to access it but because when you later want to create a service binding You have to know which database cluster to connect to So as I said, this solution can be used for arbitrary data services. We in particular use it for Rabbit MQ for Redis for MongoDB and Postgres as the more at the moment and estimate the effort of adding a new data service In production for to around four to eight weeks because of you have to do a lot of testing and learn about the data service All right, so wrap it up. How does the system then in the end look like? So you have three availability zones you have your runtime on the top and a service down below Of course, you can add several services, but you know for the sake of having it on the slide. It's just one most importantly is that you'll have You know the the availability zone zones configured into your deployment manifest So that you influence the placement strategy that mission critical virtual machines will be distributed across the availability zones and By using Bosch to deploy the service instances You just apply the very same strategy and can ensure that when deploying a Postgres cluster Which then you can deploy an arbitrary number of it The their virtual machines will be Distributed across the availability zones as well. So with that a design you can actually Survive the outage of a availability zone with any impact on your on your apps beside of a little hiccup While reconnecting to the database So how can we sum this up? well and Thu the astically I can say Bosch is a great companion for all cloud foundry related automation challenges and this does not only Include the runtime but also includes data services and while containers are fancy and hip-style of solving Everything, you know, I think when you got a hammer everything you see as a nail I think Bosch does and did a really great job on on deploying applications To cloud foundry, but also I mean deploying cloud foundries as well as solving the data service problem So yeah, that was it and if you have questions feel free to ask