 Okay, let's start. Can you hear me? Okay, my name is Behan. It's a pleasure for me to be here today. Thanks for coming. I will talk today about ways to monitor the Bosch director. And why actually I want to talk about this topic. I am a member of the Bosch Team Europe at SAP. And we contribute to the Bosch core projects and also we are owning the Bosch OpenStack CPI. Additionally, we operate most of the directors at SAP, the Bosch directors. So we provide Bosch as a service inside SAP to other teams. How many of you operate Bosch in production? Oh, okay, quite a number. Yeah, most probably you know why you need monitoring for Bosch. So today I will talk about why you need monitoring. I think afterwards explain why you can find monitoring data for Bosch and how we monitor the Bosch director at SAP. Additionally, I will explain how our monitoring and logging stacks works like and what kind of alerts we have. Most probably, so I don't need to tell you how important automation is. And most probably your automation depends on your Bosch director. At least at SAP we have automation which depends on our Bosch directors. All the updates and hot fixes, security patches, changes, they go automatically through pipelines and are propagated from landscape to landscape. And here you need the Bosch director to deploy those changes. Another thing is resurrection. Resurrection is when one VM deployed by Bosch is not running, Bosch will try to repair the VM. This works most of the time under the cover. And it's really nice when it works, but of course you need to work in Bosch director for this. And resurrection can prevent outages, customer impacts in your production systems. Another point is metrics. Actually, the director is a source for metrics, important metrics for your operation systems. And it's nice to have them and visualize them somewhere. We will talk later what kind of metrics. The last point is actually the most important for us at SAP. We use the Bosch director to provision services, cloud foundry services. So a cloud foundry user on our cloud platform, when he types CF create service, maybe the service is provisioned by Bosch. So we do not provision all the services by Bosch, but some of them are provisioned by Bosch. And this is of course direct customer-facing operation and puts much more pressure on how Bosch is available, so increases the requirements a lot. That's why monitoring is a very important topic for us. So before we look into how to monitor the Bosch, let's have a short overview of Bosch to better understand how, from where you get the data actually. So the director can provision VMs on supported infrastructures. And this is done through the CPI. For every supported infrastructure, you have CPI implementation. For example, Azure, AWS, OpenStack and so on. There is a well-defined interface between the director and the CPI and all the CPIs implement this interface. On the VMs, which are provisioned on the infrastructure, there is always one agent. And the agent is responsible to install the software which you want to run on the VM and also to monitor the software. And the software, the bits agent gets from the blob store, the bits to install. And of course the director puts the bits there. If you want to do a Bosch bolt release, then the director will bolt your release into the blob store. Also the agent reports the state of the processes over the message bus, nuts, and also some metrics. And here comes the health monitor which is listening on the agent reports the metrics. And if something is not as expected, the health monitor will generate an alert on which the director will react and try to repair the situation. For example, VM is not in running state. The health monitor will get this report from the agent to generate an alert and the director will try to repair the situation. This is how this action works actually. And the director also use nuts to send actions to agents and at the end we have the director database where the state of the director is stored. So this is the short overview. And let's have a look where you can find data sources for monitoring Bosch. Let's talk about metrics. We saw that agents are sending metrics via the message bus to Bosch and the metrics which agents send are about CPU load, memory usage, and disk usage. Of course you are asking yourself, but wait a minute, I want to actually monitor the Bosch director and here we get metrics for all the VMs which are created by the Bosch director. If you have this setup, you don't have any metrics for the Bosch director itself. That's why at SAP we have the following setup. We can deploy our Bosch as a service through another Bosch which we call the Outer Bosch and in this way we are able to get metrics for the Bosch VM itself. Additionally we have here a resurrection which is nice to have if the inner Bosch is not running, the outer Bosch will recreate the VM. And you can use the help monitor to forward those metrics to your monitoring stack. For example, this is a dashboard which we have built based on agent metrics, disk usage in percentage. Through the agent you get information about the ephemeral persistent and system disk and here you can see that the persistent disk usage was over 80% which actually triggers an award on our side and as you can see on the dashboard it took an action and the usage went down. So to say we increased the persistent disk of the Bosch director itself. The agent metrics are a good source to plan the capacity for your Bosch director, for example the size of the persistent disk, but also if we look into the next dashboard where you can see the CPU load, based on those metrics you can plan how big the Bosch director VM should be actually. And we use those metrics to scale the director up when it's needed. Another source of metrics on the Bosch director VM is the Nginx itself. The director is a Ruby rec application and in front of the director there is an Nginx reverse proxy and all the action which you execute with the Bosch CLR are facing the Nginx on the Bosch director. And Nginx itself has a feature to provide metrics for itself and you can enable this feature by default it's disabled. With the recent version of Bosch director you can enable this feature and the metrics endpoint will be exposed locally on the Nginx port. If you want to collect those metrics you have to forward them to your monitoring system. They are only exposed locally. This is very helpful if your Bosch users experience somehow a slow Bosch or many failing requests then you can use those metrics to analyze what's going on. Is the Nginx the issue or the director app behind the Nginx? It's a nice source for user facing issues. Another source of metrics is the local blob store. The local blob store is an Nginx talking the web-touch protocol. Here in the same way you can enable the metrics for the blob store itself and if you face issues with Bosch output release or issues with Bosch DNS because Bosch DNS is using also the blob store those metrics can be helpful for you. In case of an external blob store most probably the blob store provider can provide you metrics. We have a setup with the local blob store and that's why we use this way to get metrics. Another source of metrics is the nuts, the message pass. It has also the feature to provide you with metrics. Again it's by default disabled but with the recent version of Bosch Director you can enable this feature. It will be exposed also locally on the port 8222 on the DirectVM and here the metrics are very helpful with the Bosch DNS. When we introduced Bosch DNS to our production we experienced issues with Bosch DNS because Bosch DNS puts quite a huge role on the message pass and to analyze issues those metrics could be very useful for example but you can experience use them also for other issues if you have. Let's talk about another sources of monitoring information. We use also walks to monitor the Bosch Director. For example the NGNICS again has an access log, the NGNICS access log. We are parsing the access log and we generate information about the latency of the request about how much traffic our Bosch Director gets and how many errors we have by all the requests. The same you can most probably also have with the NGNICS metrics at the point when we implemented our monitoring the NGNICS metric feature wasn't available on Bosch, we introduced it later. That's why our current implementation is based on walks to have dashboards for requests but in the future we are planning to move to the metrics. For example this is a dashboard which we generate of the NGNICS access logs the total incoming request over time by request status here you can see how many failing requests you have, how many are successful how many you get at all and so on. It's very useful. The same you can have for the BobStore because it's also an NGNICS you can have the same dashboards here to analyze what's going on with your BobStore. Okay also these are the sources of monitoring information we use to monitor the Bosch Director. Let's have a look how our monitoring and logging stacks look like. So we use the health monitor to forward the agent metrics to our monitoring system. The health monitor has some kind of plug-in mechanism so you can configure those plug-ins to forward the metrics to your monitoring point. There are plug-ins for Datadoc for example, for PagerDuty. We use the graphite plug-ins so the health monitor is forwarding the metrics via the graphite point in our case. For NGNICS and NUTS we do not collect the metrics at the moment but we have this in our backlog. The plan here is to use telegraph and forward them from the Director VM to our monitoring system. For processing we use Riemann and also we use Riemann to generate our words based on those metrics. Then they are stored in InfoxDB and we use Grafana to create dashboards and visualize the metrics. The logging stack here we use syswalk release to forward the logs. With syswalk release you can forward all the job logs from the Director VM or from any VM where you use the syswalk and also the logs which are logged to syswalk. We are forwarding them to logstash where the processing happens. Also we have the option to generate our words based on logs. In our case we do not generate any words but we use here a last word. In Kibano we use to create dashboards and to visualize the logs. Let's have a look at what kind of words we have. We have only one word which requires immediate reaction and the inner director is unresponsive. To trigger this hour we have a monitoring agent on the Bosch director itself. This is something which we implemented. The monitoring agent queries the monitor to get the status of all the processes which monitors and also exercises some of the director's endpoints. From this information the monitoring agent generates a report and sends it to our monitoring stack. If it states unresponsive then we will trigger a page out. If you have a setup like this you can also react on direct updates. For example when you update your director at the moment the director is down. You may not want to have a page out because the director is getting updated at the moment. With this setup you can react on these cases when the monitoring agent is turning down. On the drain script you can generate a maintenance award and put the director in maintenance and then when it's getting up you can again turn it back again to avoid false positive words. We have also our dashboard awards. The dashboard awards do not require immediate reaction. You have time to react on the next day. For example we have one for persistent disk usage over 80%. We found this is a good threshold and you have enough time afterwards to react on this server. For ephemeral disk we have 60%. You need a little bit more free space here because during an update the ephemeral disk is heavily used by the blob store engineers. That's why we have a lower threshold here. We also experiment with HTTP based awards but we don't have a recommendation at the moment. But they are a very valuable source for our source. At the end if you care about your productive systems you should monitor your Bosch. Because if Bosch is well this can help also your productive system. And the lesson which we learned you should keep your monitoring very simple. Simple as possible. You have to understand what are the requirements on your Bosch direct and implement the right monitoring for you. Because if it's too complicated people will not use the monitoring. And be careful with the awards. If you have too many type of awards they will get also very fast ignored by people. Do you have any questions? Alright if there are no questions from you guys it's again me asking stupid questions. So you showed little snippets of dashboards. Does that mean like I just have APIs and I got to build those dashboards myself? Or is there any plan on like providing some of that goodness to other people as well? Maybe in the future we can improve the observability of Bosch. Dashboards at the moment I don't know of any plans to provide them to people. At the moment this is all we have to monitor the Bosch director. So the question was what kind of issues we have to generate to create the awards or to create the dashboards. We had different kind of issues actually. First our landscape are huge. Maybe we have one of the biggest landscape operated by Bosch. And what we experienced our Bosch users experienced many failing requests for example. They explained I'm trying to deploy something and I get bad get way. And this way we introduced monitoring based on HTTP access logs to understand where or how many requests we get and where actually try to understand where the error happens. Is it on the engine X or is it on the direct web or is it the database unit and different tools. But for the first understanding it was very helpful. We had also issues with Bosch DNS. Nuts, the Nuts messaging bus code and handle the water which Bosch DNS produced. And that's why we introduced the feature with the Nuts metrics for example. So usually when you first repeat the question the question was can you evaluate a little bit more on the Bosch DNS issues. Yeah Bosch DNS will send many Nuts messages when you execute an update. And the updating itself also needs the message bus. When you update a VM the director is waiting on hard bits from the agent for example. Executes and sends an action to the agent and waits a response from the agent also based on the message bus. And when the Bosch DNS puts so much vote on the message bus then the messages get time out. So they will time out and this will case your deploy to fail for example. Okay we don't have any other questions and thank you if you want to reach me. This is my Twitter handle. Thanks for coming.