 Welcome, everybody. My name is Christoph Eichhorn, and I'm here today with Christian Dufel. We are both part of an SAP team in Karlsruhe, Germany that provides the application logging service for SAP's Cloud Foundry. And the title of today's presentation is Don't Fly Blind, which means that we are going to talk about how we enable application developers and application operators on SAP's Cloud Foundry to quickly gain insights about their apps. To get started, let's have a look at a microservice architecture, which might typically look like this. Incoming requests will hit a router, which redirects the request to your application. There might be some user authentication and authorization in place running on a separate Cloud Foundry app. And of course, your application won't do everything on its own. It will redirect subtasks to a couple of other applications. And each of these applications is dedicated to just one of these specific subtasks. And the beauty of this is that you can develop each of these microservices in an agile fashion and scale them individually as required. So the question that we want to address now is what to do when something goes wrong. Say there's a customer using your service. And for some reason, he doesn't get a response to his requests in time. Or even worse, he doesn't get a response at all. And in these cases, users tend to get impatient very quickly and lose the focus and dedicate their thoughts and actions to something else. And this is a situation that you would like to prevent from happening. So in case you figure out that something like this happened, you might want to look into it and see what happened. And you could do so by looking at your logs. But in case you have a live system or your application running on the live system, and it's heavily used, you might easily get blinded by the huge amount of logs that you have to go through to figure out what went wrong. So the questions that arise here are the following. First of all, you have to figure out that something happened. So you have to know that there was an issue. Then you would like to know which of your microservices was the reason for that issue. Once you know that, you would like to know when exactly that happened and also how often. And finally, you must make sure that you've got enough data to understand the issue in order to eventually fix it. And there you have to remember that on the live system, the debug part should normally be closed so you can't use that one. So what you could do, as already mentioned on the previous slide, is to look into CF logs to check what was going on. But CF logs is really nice when you develop an app because it gives you a quick feedback. And that's really useful. But when you have a live system under heavy load, which is producing many, many logs, that might not be the best tool for this because you're limited in searching your logs. You can't filter them by some structure that they have. And also, the log history is limited if you don't pipe them to some file on your local system. And also, it shows only logs which are related to one application. And in a microservice architecture, you might want to look simultaneously into the logs originating from various applications. So what you could do is to drain the logs to a third-party log archive. And the logregator component provides two ways to do so. There's the firehose and the syslog drain, which can be used for an application developer. Usually, the firehose won't be an option because he shouldn't have access to it. And anyway, this should not be a question that an application developer should be concerned about. So what you want is a service that does exactly that for you, that collects the logs and that prepares them in such a way that you can easily get insights from them. So what other expectations that an application developer has towards such a service? So first of all, it should, of course, be easy to use. Then you want to be able to search within the logs for things that you're interested in. It would be nice if the logs come with some context, which makes it possible to filter for specific things. You would like to correlate logs that belong together so you can have all the logs that, for example, are related to one request on one page. And it would be nice if this service could somehow make it possible to you to quickly grasp anomalies or things that you don't want to have. Some metrics would be nice to improve the balancing of your microservice architecture. And last, you would like to have the possibility to look into logs that happened some time ago, or that were written some time ago in order to do post-mortem analysis. So these are the things that we address with the service that we offer for developers deploying their apps to the SAP Cloud Foundry. The service is offered on the Cloud Foundry marketplace, so you can just create an instance of it. The service is powered by Elastic Stack, and we've adapted it to the Cloud Foundry context with some tailored parsing rules. And we also provide some predefined dashboards for these tasks. We run the service on a multi-tenant stack, which means that you can start with a small service instance, which is affordable, and scale later on. And the user management on this multi-tenant stack is closely coupled to Cloud Foundry. That means that you will only get access to logs that are related to organizations or to, what's the word, spaces that you are allowed to watch the logs for. And in addition to this, we have a complimentary open source logging library for Java and Node.js that you can use to further enrich the logs so they come in in a structured way. So how does our service work? What we basically do is collect router logs and application logs. The router logs already come in with a highly structured form, which allows us to retrieve a lot of useful information from them, like response times and many other. And both of these log types are then enriched with context information, like, for example, the app name or the space name, and are then written to an Elasticsearch database. And in addition to this, we offer a couple of Kibana dashboards that allow you to gain insights into these logs easily. And one more thing here is that you don't have to group your apps beforehand, because all the logs are written to the same database. So even later on, you can still see how you would like to group apps in order to evaluate things. So this is easy to use. All you need to do is to go to the Cloud Foundry Marketplace and find the service. Then you create a service instance and bind that instance to your application. And that's it. Once you've done this, the logs are being fed to the logging service, and you can analyze them. And as already said, to do so, we offer six predefined Kibana dashboards, which cover different things. First, there is an overview dashboard, which allows you to get a quick insight of the overall way of being of your applications. Then we've got three dashboards, which are closely related to router logs, and which give you information about the adoption, the usage of your application, but also performance and network traffic. The next dashboard is meant to enable you to dig into the application logs themselves. And finally, we've got one dashboard, which provides information about the logging service itself. So let's get started with the overview dashboard. The overview dashboard that we provide looks like this. You've got a timeline showing the number of logs happening in the observed time frame. And in addition, there are a couple of KPIs displayed here in individual boxes. Like, for example, the number of failed requests or the maximum response time that has been observed in that time frame. And for in order to avoid confusion, the field names that we use contain in the case of a numerical value, the unit of the value in the name. So we can, for example, see here for the maximum response time that this value is in milliseconds. Then we've got three tables on this dashboard on the bottom of this dashboard, which lists the organizations, the spaces, and the components from which these logs that we are analyzing here originate. And by choosing one of these components or orcs or spaces, you can drill down and quickly find out if one of these values that you observe here is related to one of your apps or not. And since these three tables are very useful for a drill down, they are provided on each of the six dashboards that we provide. So the next dashboards are this router log-based dashboards. The insights that we gain from router logs are things like response times, request sizes, and so on. But we can also aggregate them in a very useful way and create some histograms like the one you can see here, which in one glimpse shows you if you have an issue with long response times. So it shows the distribution of response times. And if you would have a component that takes very long to respond, this would show up as a bar on the right side of this graph. So the usage dashboard that we have gives you information about the adoption of your app. You can see how many requests came in for all your apps. That's the first timeline. But also for each of your components and even for each endpoint, you get an idea how many requests came up here. Then we've got the performance dashboard, which shows you, basically, gives you an idea about the responses that we had, how many requests failed, and what kind of response codes were emitted. And we also see, again, now this histogram that I previously explained with response times, which in this case indicates that, basically, the vast majority of the response times was short. But in this case, in the left lower corner, you can see that there was at least one response that took longer, in this case 30 seconds. And this is most probably related to one of these peak here. And so by just looking at this dashboard, we can already figure out quickly that, obviously, there was something going on, and it happened precisely at that point in time. On the next dashboard, we can gain insights about the network traffic. And as you can see, there is also a peak here. And this will most probably relate it to the large response time that we had previously, because this represents a payload which was huge and a response payload. So what do we have so far? We could figure out that we've got some problems somewhere. With these dashboards, we could narrow it down to some service that we've got, and we could also tightly narrow down the time frame in which this happened in order to figure out exactly which request was causing that issue. What's missing now is to fully understand this problem, and for that, we'll have to look into the application logs which are emitted by the application which is responsible for what happened. And so far, we used router logs, and they come in in a very structured way, which means that we can use a lot of information already to figure out things. And usually, application logs do not provide this possibility out of the box, because each developer can write them as he wants. There is no structure in there. So what we do is that our logging service offers the possibility to provide application logs as a language agnostic JSON object. And this would be comparable to such a pilot's logbook where you get told what kind of information to provide to later on be able to analyze what happened. And in our case, the fields that we expect to be provided in such a JSON object are the log level, a precise timestamp measured by the application itself, and a correlation ID, which allows us to correlate different log messages, and also, of course, the message, the log message that we want to have logged. Plus, we've got the possibility to fill in a full stack trace into such a field, which has the advantage that instead of having a log message for each stack trace line, we will have the complete stack trace within one log message. And finally, there is also the possibility to add custom fields, because each application is individual. So you might want to add some fields of your own, like, for example, a tenant ID or session ID, or whatever you need. And if you add these fields to this JSON object, then you will be able to filter also for this value later on. And so this thing sounds like a hassle for an application developer, but actually it's not, because we provide some support libraries, which will do this for you if you add them to your application. So you don't need to worry about that. To set up such a library in the case of Java is simply done by adding the Maven dependency. Then you have to choose for Java, which logging framework we want to lose. You can choose between lockback and lock4j2 in combination with SLF4j. And you have to configure the framework, but we provide blueprints for this, so it's not a problem, neither. With this, we come to the next dashboard, which shows insights about the application locks themselves. That dashboard looks like this. On the upper right side, you can see a table which displays request logs again. And if you've already never done some issue to a specific request, you can choose the request here. And via a correlation ID that we've got displayed on the left side of this dashboard, you'll be able to filter for all application logs that belong to this request. And so you get all the application logs that were emitted during this request that had an issue on one page. And as you can see here on the lower table, which shows the application logs, these are structured in a way that, for example, we've got the lock level in one column and other information as well. The correlation ID that I already mentioned is important to get all the logs that belong together on one page. And it's automatically generated if you don't care about it. Each incoming request will have its own unique correlation ID. And in case that you have a microservice architecture, you might want to propagate this correlation ID to downstream microservices. And you can also do so by adding the correlation ID to the header of the downstream service request. So that also the logs that belong to another application would show up on the same page if you look for one request that had an issue. The next thing that we've got in this library is this possibility to put a whole stack trace into a single log message. And stack traces can get really huge, depending on how deep the method is that went wrong. And for this reason, we also analyze the stack trace for its size. And if it's too big, we will shorten it in a sensible way so it still fits into the logging pipeline that we have, and it doesn't get dropped at some point. And the last feature of this Java library that I would like to introduce to you is a feature that we call dynamic log levels. The use case for this feature is a scenario where we have an application on a live landscape running, which is heavily used and successfully used by many customers. And the vast majority of these customers is happy with the service, nothing goes wrong. But for some reason, there's one customer who's got a problem. For his specific case, the application doesn't work exactly as it is supposed to do. And we would like to figure out what's the problem for this customer. And so what we could do is to change the log level threshold for the whole application in order to get more logs. But then we would be drowned by a huge amount of logs that we get from the successful requests. And so what this feature allows us to do is to generate a Jot token for that customer. We give it to him. He will add that token to the header of his request. In that token, we specify some log level, like for example, debug. And once the request with this token hits our application, for just this request, the log level will be changed. So we get more information about the request and possibly why it went wrong. So we come to the last dashboard that we provide, which is the statistics dashboard. And this dashboard is about resources. So such a logging service consumes resources, which are somehow limited and cost money. And for that reason, we've got some quotas in place. So there's a quota. And for example, for the log volume that you can log per hour, and also a burst limitation in case you have many, many logs within a short time frame in order to protect the service, but also the wallet of the customer. So if his application goes wild, he's not charged too heavily for this. And depending on the quota plan that you've chosen when creating the service, you will hit a limitation at some point in time if your apps are too verbose. And this would look like this. We've got here on the right side of this dashboard a table again, which displays the logs that have been shipped successfully. And here we've got an app that somehow reached its quota limitation. And for this app, logs are being dropped. And you can see this also here in this timeline that we observe log drops at this point in time. And in that case, you either have to change the service plan that you're using or you have to find a way to reduce the number of logs that your applications are emitting. With this, I would like to summarize the presentation in order to not fly blind on Cloud Foundry, on SAP's Cloud Foundry. What you have to do is to get the right tools by creating a service instance of the logging service. You have to open your eyes by binding that service to your application. And then you can switch on the autopilot by using one of our support libraries. And finally, you can enjoy your Cloud Foundry journey with visual flight conditions. And with that, I would like to thank you and invite you to give this a try on our SAP Cloud Platform. Thank you. Are there any questions? Seems not to be the case. So thanks again and enjoy the last afternoon of the summit.