 A warm welcome from from me to to my presentation about the long-end binding route to application logging My name is Istvan Baloch and I work for SAP in in Germany in in Karsruhe Okay, so first let me establish the context of my presentation and I say a few words about the sub-cloud platform You might ask yourself what can the sub-cloud platform do for you and it allows a Possibility to quickly extend existing cloud and on-premise applications and supports new kinds of services like IOT but actually this is not my area of expertise and I would like to refer you to the Expert talk by Ben Krani Which if you have missed you can check out the recordings on YouTube and Also the other talks in this polyglot in memory tech track Recently Björn Gurke or CTO and Cloud platform president at SAP who also gave a keynote here wrote a blog post a positive some game where he summarized The innovations regarding the sub-cloud platform that were announced just about a month ago at the sapphire now event in in Florida and for for me personally one of the biggest biggest announcements was that the cloud foundry environment in the sub-cloud platform is now generally available and If you know the the design principles of 12-factor apps You might expect of a cloud platform to provide several services or one of them is the ability to to collect and aggregate your application logs and Actually the sub-cloud platform supports that if you give it a try and to check out the the cockpit there's one service in the DevOps category application logs and As a cloud foundry developer it works as you would expect it to work You can create a service instance of that service in the cloud foundry marketplace And after you pushing your app you can bind it to that service instance and you can inspect your logs in in pre-built use case focused the Kibana dashboards and This was basically the interaction. I work in a team at SAP in Karlsruhe, Germany Who is responsible for providing the service? Although I joined the team Less than a year ago. I'm here to tell you about our two-year-long journey to this robust and scalable application log ingestion pipeline that we have built More specifically the focus of my presentation Is about the introduction between the cloud foundry components for look for long lock handling and the login ingestion pipeline? two years ago We we leveraged the user-provided services and Stream the streaming feature To the stream application logs to a third-party log ingestion pipeline We faced some Issues regarding usability. It was not possible to filter application logs based on org space or app names So we moved on to the logregator firehose About one and a half years ago Using the firehose meant Instantaneous adoption because the logregator fires component Conveys all the logs from all the applications in the given landscape So it was a good lesson for us to learn to improve the resilience of our stack And especially in light scale light scale deployment we we had to face some challenges and One of the lessons learned was to introduce quotas and burst limits and just recently about half a year ago We we figured that we could offer differentiated service levels for application developers If we use the quotas If you provide different quotas So first That in the presentation I would like to talk about the ingestion pipeline And then about logregator and about the three Stations along this road So an introduction to the ingestion pipeline will follow We are using the the log search Porsche lease provided by the cloud foundry community To to deploy the elastic stack in our landscapes with Bosch The elastic stack is composed of the elastic search Which is a distributed search engine based on a patchy Lucine Log Stash is the is a data processing pipeline that you can use to ingest log messages and Kibana is a visualization platform that allows users to inspect log messages Which are indexed in the elastic stack The architecture looks as As follows as shown on the diagram which shows The path of the log messages So the log messages arrive at log stash instances called ingesters which For which forward these log messages to ready skew as fast as they can so they They do not too much processing to avoid Exercising back pressure to the syslog source On the other side of the ready skew the log stash parser nodes pick up the log messages parse them and write them to the elastic search data nodes In front of the Kibana We have an open-sourced reverse proxy component called a car Which allows to do user authentication with the cloud foundry you a component and Also restricts Parts of the Kibana user interface So for example our pre-built dashboards and visual visualizations are made read-only and This component is also responsible for creating for dynamically creating elastic search aliases to restrict the Access to the application logs that the application developers are supposed to see in the given or sense spaces Okay, so as a next step Let's move on to a short overview about the logarithm subsystem Maybe you have heard the presentation by Adam heaven or the PM of the locator So this will be just a short summary compared to them The locator subsystem provides a highly available and secret stream of logs and metrics for user applications and system components it's can stream application logs in the syslog format and and That can be forwarded to a third-party component just as the previously shown elastic stack This is the standard diagram of the locator components. I do not intend to go through All of these sub components, but rather show just the components that matter that are relevant for my talk so each app is executed in in app containers and the collocated metron agent picks up Each each line in the standard out a standard out and standard error stream of the app Pax it in an envelope and forwards it to the Doppler servers The Doppler servers can stream these logs to assist log train endpoint. So this is one possibility to attach and Look look analysis pipeline Furthermore the application logs are collected by the traffic controller components and together with Component metrics, they are exposed in the firehose the firehose web socket and point streams all the logs from all the applications and You can use the firehose to syslog nozzle syslog adapter to filter only for the application logs and Forward them in the syslog format to the syslog train endpoint So here on in this diagram you can see that if you want to attach your Elastic stack to the cloud found real aggregator subsystem. You have two possibilities either the Doppler endpoint or the firehose to syslog Okay, so about two years ago. We we started with user-provided services. What is a user-provided service a user-provided service allows application developers to use services which are not available in the cloud foundry marketplace and It can be used to trigger the streaming of the application logs to assist log compatible consumer So in this case the architecture looks like like as follows The stream provided by the Doppler server is conveniently in the syslog format which can be plugged into The entry point of the elastic stack stack to the ingestor log stash instances This configuration is fairly straightforward and it helped us two years ago to to start fast and offer our service internally after a short development cycle The documentation in the cloud foundry Website is quite clear and I just have this slide because one of my colleagues Took the cloud foundry certification exam recently and he said that one of the 19 questions He faced was just about this question. So how to use user-provided services to forward the syslog To an entry to a syslog train. So how would you do it? You can create a user-provided service and specify the syslog train endpoint with the minus L Option and after you bind your service Instance to to your app Then the logs start to flow and can be inspected in Kibana If you are not so familiar with the syslog structure, I think it is kind kind of instructive to to dump the syslog messages to standard out and inspect them with with the TCP server implementation like this in Ruby and If you if you did that today Then you would see something like this So the message length the priority the version the date and as you can see here in the hostname field of the syslog Field we can see the concatenation of the org space and application name This is actually a recent improvement in the syslog format But unfortunately due to the limitations of the syslog Protocol and the hostname field it is not guaranteed that the application org space names are Represented as they were specified by the application developer. So we hope that maybe in the future the the better suited Structured data field in the syslog protocol could be used for that So this Brings me to the to the topic that we faced two years ago. This is a screenshot from early 2015 from our first dashboard Well, if you look close closely you can see that In the search field we have a quick because at that time The log message is like the human readable metadata like app name org name and space name So we had to use a workaround to to fetch this metadata periodically from the cloud controller and and create links in a portal app which Prepared filters in Kibana so that application developers can can see the logs for the given application and This was the usability issue That motivated us to try The other approach the firehose endpoint to be able to show to be able to drill down to application logs using org space and app names so about One and a half years ago. We started to use the aggregator firehose in that case the architecture diagram looks like this so the first part is the same and the Log messages are collected by the traffic controller the firehose to syslog nozzle filters out the Container metrics and the application logs are forwarded in the syslog format to the entry point of our stack Otherwise it is the same so there's one convenient feature of the firehose to syslog nozzle That it talks to the cloud controller and it fetches the metadata. I was talking about the org space and app name and It annotates each log message with that metadata so Note that also the firehose to syslog nozzle uses a local Database to store this this metadata which proved very valuable for us because whenever the The firehose to syslog nozzle was restarted or had to be restarted for a reason Fetching all those metadata for thousands of apps could take considerable amount of time So it is better to reload that information from from the disk So with that we could realize dashboards where users could drill down to organizations and human readable names Could be used Using the firehose also meant instantaneous mass adoption because the firehose conveys the application logs of all the apps and It was a very variable experience for us Like kind of like a constant load test and it helped us to understand the weak spots of our stack and strengthen those So what are the lessons that we learned from from using the firehose in production? large-scale deployments So first keep an eye on the log former changes in the cloud foundry platform So like the go go router logs Have changed several times. So using integration tests helps to track these changes Also, we we had to Purposefully reduce the maximum message size to to protect the stack in the elastic search there's a concept of field type mappings and We figured that if the log Documents used field types, which were incompatible then it It could cause errors in the lock's dash parsers, which could even fill up the disks So it was necessary to to configure the field type mappings in a way that they ignored Incorrectly typed values Furthermore, we have I will show it later a set of pre-built dashboards to keep the elastic search aggregations working. We actually had to make sure that the Lock Fields are of the given type as we expect them So if if we expect a string or if elastic search expect a string it has to be a string if we expect a number It has to be a number Also the first we Supported parsing Arbitrary application logs if they ever formatted in jason, but as we had issues with these elastic search mappings we Be limited application log parsing to to application logs that were written with our logging libraries, which brings me to Mentioning that we have open source logging libraries in in Java and in JavaScript for Node.js, which help you to write structured logs as jason documents We also faced the issue that there were some chatty apps typically apps that were not using those logging libraries and were producing exception stack traces in a tight loop and without logging libraries and exception stack trace can Can contain a couple of hundred lines which all Are interpreted as separate log messages. So We could we had cases where we had Thousands of logs per second which were actually only garbage So we came to the conclusion that to protect our stack. We should throttle these chatty apps and we introduced quotas So what are quotas quotas are just rate limiting so that Apps are allowed to log only a certain amount certain number of logs per second or per longer Interval and if if apps Would log more then we intentionally drop those log messages at the entry point of the pipeline and To create a feedback loop for application developers. We emit Artificial messages which conveys this Information and it can be shown in a in a dashboard that for a given app So many logs were processed by us and so many logs were dropped by us So please application developers check your logs if they really make sense for you Otherwise fix the application. So this was a measure to to protect our our stack but later after we had an Stable Implementation of the quota tracking we figured that that this this feature could also be Sold as a new feature and we could offer Differentiated service levels by by offering different quotas and and other aspect was the retention period for logs So what we actually wanted to do is Offer in the cloud fund remarket place a logging service with different service plans and each service plan could Was a combination of different Quotas and retention periods and we couldn't easily do that with The firehose approach. So yet again, we we did something different and this is what we call sleeve the sys log forwarder service So in this case the architecture looks like this So instead of using the firehose We we consume the the sys log train from the Doppler servers only for the specific apps that the user bound his app to our service instance and there we have a forwarder component and service broker component and When the that when the user creates a service By the cloud controller then and and it's later binds a service instance of our service to an app the service broker returns a sys log train URL Which points to our forwarder instance so that the Doppler can in so that the cloud controller can instruct the Doppler to forward the logs to our forwarder endpoint and The service broker also informs the the forwarder instance that there is a new binding so that when it receives logs Corresponding to that binding It can it can track the quota usage according to the service plan and it can also Tag the log messages so that farther down the pipeline We can decide to to write the log messages to one or two and other elastic search index thereby Offering the possibility to have different retention periods So this is well known for cloud foundry developers if you check the the available services in the marketplace Then we have this application logs service currently with the light plan which is available for Every trial user and for production deployments. We would have different plans and After you create a service Instance of that you can bind your service and then the the logs start to flow With different photos according to the plan and also with different retention periods So I still have some sometimes so I can show you some of the Pre-built dashboards that we have which are based on the structured logs So this is the overview dashboard Which helps to understand the evolution of logs and basic KPIs like the number of failures or Maximum response time Another dashboard And you can notice that in the lower part of the dashboard You can always drill down to to the organ space and and apps So on this network and load dashboard You can see the distribution of the payload sizes probably this dashboard is the most useful if you want to track down errors in the application because it it shows the performance issues with with the response maximum response time distribution and also the the number of Different HTTP status codes that are returned and if you have such a long Running Request then you can use the correlation IDs to to drill down on that request and here you then you can have the application and the request logs and You can further investigate what went wrong in the application so With this solution, so what what did I walk you through? so in the past two years we we started using the user-provided service and The application log streaming capability of of logregator. This helped us to start fast But we were missing this this metadata I was talking about Using the logregator firehose. We had the metadata, but we had all the logs and we faced issues in large-scale deployments and We also were not so not able to offer distinguished services So so now we went back to the syslog train feature and of created Service brokers and the forwarder component to to offering to offer differentiated services So maybe what are the takeaways that you could remember from this presentation? to have useful dashboards you need structured log messages and I suggest to to go and check out our open source logging libraries in Java and JavaScript which for example help you to To write exception stack traces in a single line so that they don't end up as a separate log messages To improve the resilience of of the pipeline we figure that it is necessary to to introduce quotas so to check and And greatly meet what the applications can do and also provide feedback for application developers via their logs missing and And Well, so then you might ask so what is the how should we set the quotas and our answer to that is that? Let the customer decide and offer multiple combinations of of quotas and and retention periods in form of service plans Which can then be selected by the user so this is my my walkthrough of the of Of our history regarding providing a scalable and robust application logging Pipeline if if you have questions feel free to ask now or also offline and Thank you very much for your attention Okay, so there is one question The we drop these logs and what we do is as I search sorry, so we just a sec So we provide dashboard for application developers and tell them that You look too much and normally it is the case when When something goes wrong in the application so in production deployments We haven't had that but if the application is in a very early phase of development and something Is just plain wrong and it is constantly throwing exceptions then this is where we end up But we try to scale the quotas in a way that The normal usage is supported. Yes, there's another question In this case the log is dropped so you can't see the exception But probably you have see we have you have already had like hundreds of exceptions of the of the same type that you can diagnose so so this was like A self This is the way how we protected the stack currently Because what we've seen is that these effects mostly apps Which are producing logs which are garbage? Okay, the second question. Yes. Yes. Yes this logging library that I mentioned is Is actually formatting the log messages. It's a single line JSON document that the stack trace is just written as an as an array But if you don't use the logging library then then you end up having several log messages for for the stack trace Which we want wanted to avoid this is why it is part of the Documentation of the sub-cloud platform to use use these logging libraries Yes, there are no more questions. There is one more question. So for application logs, this is Besides the CF logs, which they can of course use This is the offering that that we have currently on the cloud platform Okay, so I mean so we what we get is all the logs that the the locator subsystem is Providing us so we have the logs also from from the router We have a separate stack for the platform logs, but generally if The application crashes, I think We wouldn't have those logs I mean after it crashed, but So I'm not not so sure about that the moment. So thank you very much