 Also herzlich willkommen hier am Tag 3 der 21. Gulaschka Programmier Nacht. Hier im Medienteater. Wir haben heute wieder einen sehr interessanten Vortrag, bei dem es um Observability geht, und zwar mit dem LGBT-Stack. Ich möchte immer sagen LGBT-Stack, und das wäre dann wahrscheinlich wieder was mit Rust und Bloheien oder so. Aber wir werden heute über den LGBT-Stack sprechen, und zwar haben wir dafür heute für euch CD. Einen großen Applaus bitte. Hi again. It's my first talk, so I'm sorry for the guys and girls that watched all three of them. I hope the third one is as entertaining as the first two. This slide, everyone seen them probably once or twice already this weekend. I'm CD, I'm an SRE, I work with big and large systems, do resiliency engineering, but I got life outside of just my computer. So I enjoy some spare time doing Brazilian Jujutsu, and yeah. So I prepared this entire talk, and then afterwards I noticed well I'm talking about a lot of products from Grafana Labs, like an for-profit company, and I realized maybe people interpret this as a sales talk, it's not. I don't work for them. This talk is not sponsored by them, but if someone from Grafana is watching I send some swag. So before we get started on Observability itself, I want to clarify some things, because we're going to talk about them in this talk a lot. So there's a differentiation between Observability, Monitoring and Telemetry. Telemetry means the way how we get our monitoring data from the server to our monitoring system. The monitoring system is monitoring is the continuous observation of our health signals. This can be anything from metrics to logs to events to traces, literally everything. And observability is what we make out of metrics and monitoring. It's the looking into the data, understanding what the system is doing, understanding system behavior by looking at monitoring. When we talk about observability, you often hear the three pillars of observability. It's logs, metrics and traces, and we're going to talk through every single one of them. And neatly enough Grafana Labs has a product placed in every category of this. For the Observability frontend, probably everyone already uses Grafana dashboards, right? You probably have heard of them. They are pretty nice. Sorry. For logs, we're going to look into Grafana logi for metrics and long-term metrics storage. We're going to look into Mimir. And for traces, we're going to look into Tempo. And that's why it's called the LGTM stack for logi, Grafana, Mimir and Tempo. We do need all three of those pillars because every single pillar answers a slightly different question that we might have. When we look at our observability platform, we want to get some answers. And metrics are really, really good for answering the question, do we have a problem? Is there something that's going on in our system? Logs, on the other hand, are really good at telling us what is the actual problem. And connecting the two sometimes is really, really hard. So we need traces to bridge the two of them because traces tell us where is the problem. Or where might the problem be? Bevor we go through the more interesting stuff like Mimir and traces, we're going to start at logs because they're pretty easy to explain. I think everyone's seen logs. They look something like this. They're usually large text files. Sometimes they're easier to read. Sometimes they're harder to read. But there are some very bad examples, like multi-line logs. Great for looking at, great for humans to understand. But we usually don't want to look at them ourselves. We want to have log aggregation systems looking at our logs. We want to have automatic analysis of our logs. And if your log is multi-lining, we can't really do that because it's really hard to pass. So we don't do that. We try to have logs in a single line. Every line represents one entry. Much easier to pass by a machine. Another dark pattern is not using context. Just put something out like error. Yeah, nice. I know it's an error. Otherwise it wouldn't have logged anything, right? So what we should do and what probably everyone already knows is structured logging. Here's an example of a structured logging log message. And it's structured because we have individual fields in our log line. And every field annotates some context. It's still easy to read by humans because you see the log format. It's just a single line. You can jump over the uninteresting labels and just read the last part, the error open, some file. Still easy to read. But now you can pass it automatically by your log aggregation system. And if you use structured logging, you should stick to one of those log formats. I know it's easy. You can invent them yourself. You can just print key value pairs in your printf statements. Don't just use some standard. But what is log aggregation and why do we need it? You can log into your server and just grab or tail or tail and grab something. That works great. It doesn't work so great if you have half a million servers. You can log into every single server and try to figure what's going on. You need one central place where you can go and tail your logs from there. Also in the type of work environment that I'm in, I can't log into every server. I don't have permissions to go to those servers. So I have to get permissions to go to the log aggregation system and now I'm able to view the logs because now my access is very granular just for reading the logs. Also, if you have log files that are like tens or hundreds of gigabytes long, good luck with Grab. It's not the slowest tool in the world, but it takes some time and you probably don't want to do it. So Grafana came up with this amazing tool called Loki and it had some minor hiccups in Loki V1, but now that we have Loki V2, we're really, really good. And if you're somewhat familiar with metrics and Prometheus, it kind of sounds similar. Instead of having an exporter that exports your metrics to Prometheus, you have PromTale and PromTale collects all your log files. It automatically detects all your log files on the system. It monitors syslogd and everything. PromTale takes those logs and sends them to Loki. So PromTale acts as the sender. PromTale ist gonna be installed on every server that we run and Loki is the central place where we go to. And of course we access Loki through Grafana. And this is roughly how it looks. I prepared a small screenshot here. You can see here on the top we have our logql query. That's the query language used by Loki. It is very similar to what we are already used to with Promql for metric querying. And it's really powerful too. You can run log queries that return your log entries just as we've seen here. We have some log messages below. But you can also do metric queries. You can use logql to create a query in your log files but returns metrics. So that's also super useful. For example if you want to do graphs like the one that I show here. Because now you see the log volume and you can create dashboards that show you the log volume for error logs in your microservice. So that's really good. I already spoke about it a little bit why we need Grafana or why we need log aggregation but there are other tools out there like Elasticsearch with the Kibana frontend. But Loki is amazingly easy to use. I've tried Elasticsearch clusters in the past but nothing was as easy as Loki to be honest. I know Heiko uses Splunk for it kind of enterprisey, kind of expensive if you want to do it in your company but Loki is open source. It's free. It's enterprise support that you can buy but it's free, it's open source. And it scales very well. And that's going to be a major theme across all tools. They scale very, very well. To the other pillar of our observability stack, the metrics. Do we have a problem? And I think everyone should already notice metrics are really good to identify if we have a problem. Metrics are usually emitted as time series or you have like some sort of time series name and the value and you record it in a time series database so you can plot them just like on the screenshot here. And probably the most commonly known tool for doing this is something like Prometheus. Prometheus came as one of the earliest cloud-native computing foundation members. It was initially developed as sort of an outside of Google development of something that Google had internally. It became the de facto standard for all metrics or time series data. When we talk about metrics, there are usually two different approaches to it like the push and the pull approach. The other big contender for first place who is the greatest metric tool is something like InfluxDB and I actually know a lot of people using Influx instead of Prometheus but it comes down to how your system behaves. Influx itself is passive it's just there, it's a service that's there and you need an active component for the metrics most of the time telegraph that runs on your server and then sends the data back to Influx. That's why we call Influx a passive system and Prometheus is on the other side more active Prometheus is pulling the data from our targets and I think I visualise this quite nicely here and pull is most of the time it's a little bit better than push I personally prefer it it's way less overhead in your application I don't have to worry about sending my telemetry data from my application all I have to do is expose metrics endpoint and I'm done. Prometheus also gives us the central config location while in telegraph we have telegraph configured server and when we want to make a change there we have to change a bunch of servers at the same time which is sometimes hard to facilitate also in Prometheus metrics, same as in LOX we tend to add labels to it so we can carry context like from which server does this metric originate like the hostname in this example or also for a hypothetical example we can just have one time series that's the deniusquery type and then have both the hostname and the type of query as labels saves us some active time series because the more active time series the more cardinality the harder it gets I can't do Prometheus justice here I only have an hour and I could probably talk an hour just about Prometheus so go and watch those other amazing talks Prometheus itself is great, I love it I use it on a daily base but it also got some problems Prometheus isn't really a long term solution the default retention time that is configured in Prometheus is 15 days and maybe it's just me but I think a lot of people are like me in this regard I want to know how did my system behave last year at the same time so I can make accurate predictions about how it might look next year so I can do capacity planning and network traffic forecasting and those things and Prometheus isn't really good for that to be honest there is the option of federating your Prometheus clusters and only federate aggregated data but it's really really hard and I wouldn't recommend it so the solution is Memia Memia kind of adds to Prometheus it's not a one to one replacement for Prometheus it is a long term storage solution and a multi-tenant solution for Prometheus as an extension so Memia in itself is 100% compatible to Prometheus all your Prometheus queries are still working you don't have to worry about that the architecture behind Memia is slightly different and we will go into that in a little while but for now I just want to mention it's insanely scalable so let's say we do have a bunch of Prometheus servers and we want to add Memia to it so Memia is just this big box but I want to show you why it is so scalable because from Prometheus we add all of our metrics to the distributor of our Memia cluster and we do this using the Prometheus remote write function which is completely built in in Prometheus the distributor then sends the data to the ingester and the ingester stores the data in an S3 bucket so it's good to have S3 storage like in most cases nowadays the nice part is the distributor and the ingester can be horizontally scaled you can have as many of those components as you want if your workload increases just deploy 10 million distributors whatever if you want to query the data it's a little bit more complex you're going to write your query in Grafana query is sent to the query frontend from the query frontend it's assigned to like an worker queue which is called the query scheduler and the query while the query being the active worker and from there it both queries the ingester if there are any time series that haven't been written to the S3 storage yet and it also queries the S3 buckets directly and in each step there we have in cache so this improves scalability a lot and if you run the same query twice and you have a whole engineering team working an incident and everyone hitting refresh with the same query the query result will be in the cache so it's blazing fast and last but not least we have a compactor which is a separate microservice that you can enable so you're not piling up hundreds of gigabytes of metrics but it occasionally compacts the data and removes unnecessary labels and as you can tell this slide took a very long time to build so advantages of Mimia it's massively scalable it is very cheap in terms of storage if you compare it to Prometheus Prometheus storage usually depends on how big of a disk you can put in your server and if you have your Prometheus on a Hetzner cloud VM you have to provision an even bigger server additional drives to it that gets expensive quite fast so using S3 buckets is pretty cheap another advantage is you can have a single alert manager sending alerts for all of your tenants and the last part the last pillar that I want to talk about is traces and a trace by definition the recording of all paths taken from the beginning to the end of the microservice architecture so if we look at this example here on the left we see that our request hits on the edge service A and then we branch into four different microservices to fulfill this request when doing tracing we pass along a context between microservice so we have this context object pass along every time and we can nicely visualize this in the waterfall diagram on the left you see the time, how the time goes and then each block is called a span and the span is the representation of a work unit so using this diagram we can see that probably sub-process is the longest or is the part that takes the longest in our request the nice thing about spans is that as I said they represent an entire unit of work and we can attach attributes just like in metrics and locks to those spans and also we can attach events to them and you will see this later in my short demo and this really helps us to understand where the problem is because if there is an error somewhere along the request flow we would see it annotated in our span that this span is the one that actually failed but when we talk about tracing it's hard to talk about it without talking about open telemetry and Jäger I don't want to go into the details of Jäger because it's like its entire ecosystem with its entire observability of open telemetry which has been born as the child of open tracing and open sensors and is maintained by the cloud native computing foundation like a lot of project nowadays and one of the key things that I really like about the open telemetry framework is the collector binary it kind of works like a gateway you can push some data you can push open telemetry traces you can also push primifious metrics and then you can use the collector as sort of a gateway or a transformer and send them to somewhere else for example you could send Jäger into the collector and export it as open telemetry which is kind of nice and this works seamlessly with Grafana tempo and the most common use case or the way I use it in my scenarios is I take my primifious metrics and my open telemetry data pipe them into the collector and from there I split the requests so the locks go to Loki the traces go to tempo and the metrics go to Mimia which is kind of nice and I would recommend using tempo instead of Jäger because it is very well integrated in the Grafana ecosystem and it is easy to jump from your metrics to your traces because you can actually generate metrics from traces like every trace as we have seen before has like a duration so we can extract this duration and save it as a time series metric same goes for locks you can jump from your traces to the locks to figure out one final piece of the puzzle here is Grafana XM Plus it is part of the open metrics back so I pulled this example screenshot from the open metrics back so you basically annotate your time series data with an example trace you can see the foo bucket here there are some requests that fall into this bucket and one example would be trace ko something and you can just add this at the end of your time series quite easy to do I was able to integrate it into all of my services with relative ease so if we look at how it looks combined I prepared like a short video for it I am not doing a live demo that always fails so I have to watch the video with me it's no sound so I will talk you through I think everyone saw Grafana dashboards at this point it's pretty common, pretty straight forward I want to highlight the latency and the dashboard latency here on the slides because those two graphs are actually generated by Grafana XM Plus and let's look into it we can jump from our Grafana dashboards into the traces really easy and now I just select one of the traces and see the waterfall diagram here on the right I see that there is an error and I can watch all the error messages from there or I click the locks for the spend button and now I am in my locks and you see the lockql query on top here I add the jason parser for it so I can actually pass the lock message and just clicking on the lock message expands it and displays all the fields just like I want them to the other example that I want to make is for the dashboard latency in this example application I have a dashboard querying an API and if I look at the trace here I see the traces are in two different colors that actually means that there are two different microservices involved and I can even jump to the dependency graph here and see which microservice called which other microservice and see all the spans in this sort of node graph now I was talking about everything of this being open source and sure you can use it open source you can run it on your own infrastructure but it comes with some disadvantages and it's not me trying to sell you Grafana in this product but there are some costs involved in running your own infrastructure I mean it's nice on the pro side you can control your data, your data is not leaving your premises and sometimes that's exactly what you want but it's also kind of expensive because you quickly need an entire observability team sure I can run it for myself, by myself and even host some other stuff on it but it gets messy quite fast if I want a performance optimize my LGTM deployment I am struggling also if you run your own monitoring you always risk of losing your monitoring system if you have a data center outage you also lose your monitoring infrastructure because it's in the same data center and if that happens you're essentially und your featured engineering teams they try to debug the problem but they can't because they can't reach your observability stack anymore because it's down with the rest of the infrastructure which is shit so you need additional monitoring you need outside monitoring you need even more monitoring so you go into a spiral of death really fast there is some advantage of buying like commercial products I don't want to sound like the guy who says come into the whatsapp group or something but you might consider if you run it at home yeah run it yourself sure if you run it for your business maybe consider buying it from Grafana Labs directly and let them host it on their cloud yes it's kind of expensive but at the same time your funding open source development you make life easier for people who cannot afford buying Grafana Labs so it's like try to figure out what you can afford and can you afford the observability team or can you afford putting it into the cloud and don't worry about it so because I only had so much time and I just couldn't include everything that I wanted to talk about I recommend you watching some other talks and white papers and research it for yourself and that's about it I was a lot faster than I anticipated but that leaves more time for Q&A yeah thank you okay great so do we have any questions oh come on okay one in front fantastic so I have to talk too far do you have any experience migrating from an existing logging system like Kibana to Grafana somewhat yes I made the migration from Loki V1 to Elasticsearch back to Loki V2 and the best way to describe it set a cutoff date let's say the 15th of next month and then you ingest all your data from that date going forward only in the new logging system and keep the old data around and if someone needs to look into historical logs then send them to the old system and after like a year or two or whatever your retention period is turn the old service off it is super painful to bring logs to another log aggregation system okay more questions I don't know so and it really does scale right I once had the problem that I had kind of like World of Warcraft logs in Prometheus and at one point during a 40 people rate the server just did not collect metrics anymore it was so I showed you the architecture in more detail about how Mimia works it works quite similar for all of the free microservices so Loki works very similar and Tempo also works very similar to that they all store the data in S3 buckets and they kind of have the same architecture with the distributors, the ingestors and then the query frontends and everything tying together nicely looks good to me okay nice, you can always reach out on my socials, I put them here on the slide as well feel free don't use twitter text me on masteron