 Good afternoon and welcome to my talk. My name is Johannes Breuer I'm working at Dynatrace in the role of a technology strategist and guess what the technology I'm working on is of course cloud foundry Even though I'm relatively new in this area I already got really good insights into monitoring a cloud foundry infrastructure and today I want to share my knowledge especially in order to make your application applications running smoothly To get this talk started. I actually flipped the title and I put it application complexity and application health in the foreground but just for now and Based on that I will then do a deep dive until we reach a certain level where we can talk about cluster health when we consider when we take a look at the complexity of Applications then I like to point out this statement that says that on average a single transaction uses 82 different technology 82 well this number sounds pretty intense But when we take a closer look on the different on the journey a transaction has Then I'm pretty sure that we can come up with a justification that says that 82 is a valid number just to Give you an example here on the left side. We have the end user who is using a Mobile device a notebook which an operating system on it and with a browser on it and there it's entering the the URL of the application this UL is then yeah routed to Network interface where it hits the public network then there are switches Wi-Fi transmitters LTE transmitters and We also reach at some point a satellite where the transaction goes on But I want to stop here because it now becomes a little bit talk too complex But I just assume that the transaction hits the data center at a certain point and this data center can be either Hosted on our infrastructure or on the infrastructure of a public vendor Along this journey from the end user to a data center We have to rely on certain technologies for example We cannot control the Wi-Fi or the wireless network But what we can control is what's going on in our data center and to break down the monitoring Stack into its different into its different abstraction levels. I will show you here the monitoring stack It shows you that Application health is on top of this stack. It is the highest abstraction level because it's actually phasing the end user Then underneath application health. We can see microservice health In other words, this is the part where our microservices live and there we want to understand how they behave Each microservice is then running in a container. This is the next level. It's the the level of processes and All processes are hosted somewhere on a virtual machine and this is the the level of the cluster And to a certain point we also get or should be aware of what's going on within our data center And this is the lowest its abstraction level here on the bottom. I Just want briefly mention a few points to each of these layers on the top. We have the application again At this part we are interested in how is the response time of our application Is the client somewhere facing a 400 or 500 error and how is your their application working? From a more advanced perspective, we also want to understand With what's the distribution of the browsers using the application and from where is the person navigating to the Application in other words, we want to get a breakdown of the geolocation in order to pinpoint certain problem to a region or area All in all understanding the application health is strongly related to user monitoring behavior When it comes to the microservice level here, we have key metrics like the CPU usage the throughput the failure rate and The response time of each service But it's also important to understand how the services interact with each other in order to understand Where bottlenecks or where should I start to scale up or scale down? And this is here shown but a picture on the bottom where we can see an end-to-end communication That's going on within our micro service environment When you reach the container level here, we are interested in the CPU memory Usage and the IO operations on the disk and last but not least we have now reached the level of Cluster and we cannot talk about cluster health itself and here We should be taken into consideration how our hosts are behaving how the hosts are behaving that are hosting our application But in context of a cloud foundry infrastructure, we should also Take a look at the components that keep the cloud foundry foundation up and running and Because there are a lot of hosts out there that are necessary for cloud foundry in order to create the containers to Distribute load to do the routing and so on This now brings me back to my previous slide where I say that application helps build on cluster health But what's actually a class in context of cloud foundry? To answer this question. I borrowed this Architectural overview from a cloud foundry infrastructure. It's available on the cloud foundry Documentation and this diagram shows you each the components and how the intact with each other Here you can see for example the Diego cell then the Diego brand the Diego database and also the The cloud controller. I don't have enough time to talk about all of them But I want to show you two scenarios in order to get or to give you an understanding How the parts can I intact with each other? The first scenario is from the few point of a developer Let's assume we have a person who wants to see if push an application Then the first entry point is the cloud controller The cloud controller receives the artifact and then moves it over to the cloud controller bridge This is the place where the Where the build backs drop drop in and where the droplet is generated? After that the droplet is registered at the bulletin board system, and it's also prepared for rolling it out to a Diego cell But before rolling it out to a Diego cell the BBS Sansa auction request to the auctioneer and After the auctioneer accepts these requests it then builds a garden container and let this container run on a Diego cell Finally, Diego also registers the application on the go-router in order to make it publicly available And to allow traffic coming in This brings me to the second scenario This scenario from an end user Let's assume we have a person who wants to visit our application then the person enters the url This url Comes to the go-router and from this point the go-router Forwards the traffic to the application Yeah, that's pretty much it And now we have these two scenarios the one from a developer and an end user and based on that we can conclude that the auctioneer and the go-router are very important components when it comes to deploying an app and to Keeping the communication of cloud of a cloud found infrastructure up and running Since the auctioneer is such a mission critical component I have here summarized the main tasks of an auctioneer First it holds an auction for each task and each application Then the auctioneer is responsible for distributing work using the auction algorithm and For this auction algorithm it considers for example In which availability zone the cluster is running or how many Resources are available or how many instances of each application should be allocated and Last but not least the auctioneer also maintains a lock in order to ensure that just one auctioneer can handle an auction at a time and We can now conclude that the auctioneer In case the auctioneer fails then we are not possible to deploy an app to a Diego cell In order to characterize an auctioneer Pivotal has re-commanded three KPIs The first one you can see here on the left top corner is the average fetch duration This is the time it takes the auctioneer to get the status from or the state from a Diego cell He needs the state because the algorithm the auctioneer algorithm must decide Where is place available and where should I put the next application to and Then we can see two other KPIs that just tell us how many apps could not be successfully Deployed and how many apps were successfully deployed on a Diego cell Then the go route again. He has just one simple task The task is that he has to keep the communication to a cloud foundry and within cloud foundry up and running And when a go router goes down then the entire communication comes to a still stand For characterizing a go router Pivotal has also recommended for KPIs The first one is the total number of requests This number is telling us how many requests are currently processed by the go router Then we have the average Response time that tells us how responsive the go router is this KPIs depending on the application behind and also depending on the workload that is currently Facing the end of the go router and Then we have two additional KPIs The first one is the the number of 502 errors and the number of all the other 500 errors in order to monitor a cloud foundry environment We are dinadrace follow the approach of a full stack monitoring Approach and this gives us full visibility within one cloud foundry cluster and What I exactly mean with that I will now explain to you The easiest way to monitor an application in your cloud foundry infrastructure is by using the the application monitoring approach and This is the approach where you put the monitoring agents side-by-side your application and within one container and Then you have the ability to get all the data and to monitor the application and This allows you to get visibility on the process the microservice and on the application level, but you cannot Get data from the host level and this is the reason why we follow full stack monitoring approach because this approach instruments all hosts that are running in a cloud foundry foundation and This then gives you fall full visibility into each layer. I just mentioned before to now Implement this full stack monitoring approach. We at dinadrace are using the power of Bosch I'm pretty sure you are familiar with Bosch But to bring everybody on the same page. Let me explain it to you based on this on on this Picture here, we can see that Bosch sits in between the infrastructure and the runtime platform and Bosch is taking over the responsibilities to managing to manage all VMs that are required for their cloud foundry platform you can configure Bosch using the runtime configuration and Everything is is written in yaml. I'm sorry for that But you have to be a friend of yaml when you are using Bosch Then to extend a Bosch or the infrastructure automation layer Bosch provides the concepts of Bosch add-ons a Bosch add-on is also Configured using its own runtime configuration and it implements the same concepts like Bosch itself You can specify a stem cell a release or a job within a Bosch add-on and We develop the Bosch add-on in order to roll out our monitoring agent on each host When you want to have a look at this Bosch add-on feel free to check out our Github repo and to visit Bosch dot IO because they are listed as publicly available Bosch add-on. I have not talked a lot how to monitor a Cloud Foundry environment and to give you more hands-on hands-on experience. I want to show you now a demo I prepared the sock shop app For this demo its monitor using a full-stack approach and it's running on people to the cloud foundry and Then now simply switch over to the application to show you that's working This is the home screen then you have the chance to switch over to a catalog Oh, I'm sorry you can see it here Then you have the chance to switch over to a catalog and here you can order pair of socks in case you are locked in We simply lock in with a user Okay, let's let's add this to the card. It's a normal application, but I want You to take care of this button on the right side on the right top corner Because something will happen in a few minutes Now to see what's going on in the cluster. I switch over to dinadrace for a second In case you have never been in touch with dinadrace This is the the first entry point because here you can see all the dashboards I have prepared one dashboard This one is called cloud foundry infrastructure and here we can see all the hosts that are currently running in our cloud foundry foundation Then you can see here the Diego cells that are that are hosting my my Sock shop app and here on the right side you can see the go route the KPIs I Just mentioned before like for example the latency and the current number of requests that are coming in and In order to stress the go router now. I have prepared A load test which I started now and This load test now That's a lot of traffic or pulls a lot of traffic to the go router and I Simply go back to the sock shop app and I click again on the catalogue and I click again Okay, could you see it? What's now happening is that the button in the right top corner disappeared? Why did it disappear? That's a kind of re-trudation But the reason is that the go router is that much overloaded with requests I'm currently sending to the go router that he cannot not Response correctly and he immediately returned a 500 error and this now Is not the problem is now not based on the application itself It's based on the underlying infrastructure and we can see this in dinadrace by switching back to the dashboard Here you can see that a Tons of Requests are now coming in. No actually not that much Let's refresh the website We simply go over to the go router Here you can see an increase and then a couple of minutes or seconds there We should see a 500 error and this is because the front end could not Load the data from there from the card service, which was responsible for the button on the top right corner I just give it a try Here it is Here we can see that the go router immediately returned a 500 error because it's that much overloaded I Think that you can assume when we now scale up the go route the application to add a go route the instance to two instances then The latency will go back to a normal value because the load is distributed to the go router to to both instances and Now I'm close to reach the end of my talk, but before before And my talk I just want to point out what the next generation of monitoring could look like Because the next generation of monitoring is not just focused on Identifying a problem and notifying someone that there is something going wrong I think and we see in the market that there is thin that the next generation of Monitoring should also consider self-healing and auto remediation actions in order to drive autonomous cloud concepts Because without auto remediation actions in place All occurring problems would be notified or escalated to a person on call who is then taking over responsibility of that But there are certain problem patterns that have come on recommended auto Recommendation actions and why shouldn't should we not Ask a machine for fixing the problem similar like a person would do I Have here prepare a showcase where I have an example or a problem that occurs at 2 a.m. In the morning Normally we would escalate this problem to a person on call but instead of doing that we just Run a couple of auto remediation actions The first one could be for example to check the CPU usage And in case it's exhausted we add an additional instance Then we could also Check how the garbage collection is working in case it's there is a bottleneck. We add more memory and The third remediation action could be to restart a service in case it hangs and At a certain point we still have to check if we could mitigate the problem in case we could not We then do the final Auto remediation action that would be a restart of the host And in case this did not help us then we have to escalate the problem to a person who takes over responsibility to summarize my talk and also this slide I just want to point out that monitoring and Making a class the health available to operators is important to keep everything up and running and we should also think about automating tasks in order to drive autonomous cloud ideas and to make Make us your fit for dealing with multi-cloud or hybrid hybrid cloud solutions We that will come and that will increase complexity to a certain level Now I'm at the end. Thank you very much for listening. I'm happy for answars Now or I'm still at the working at the boost simply stop by and ask me For for for the details or for other questions. Thank you very much Any questions? Okay, then thanks