 I like starting talking about observability with a few basics about how to think about data, how to think about observability, and then we get into the cloud native part. If you look at this, this is a screenshot from Janitzel interface, which shows you how power levels fluctuate over time. And in the circled area, you see a few spikes in voltage. If you go back a little bit, this is the Stahlschlüssel from Germany. Any property of any steel you can commercially buy on Earth, anywhere, this includes Japan and everything, can be found in this one book. And all of those chemical and physical and everything properties have been distilled into numbers. Going back even further, this is a logbook from a whaler where they locked, where they are, where they went, what they caught. This is the oldest letter we know of, 4,000 years old. It is a complaint letter, because the seller offered the wrong grade of copper. The buyer is still making provisional payment. And also, the seller was rude to the buyer's servant. And this is the oldest writing we know of. This is 6,000 years old. It tabulates who owned what slaves at what time. All of those things have one thing in common. Humanity at its core took anything which was of interest, commercial or societal or whatever, wrote this down in a detailed account over time, figured out that this wasn't efficient enough, distilled this into logs or into key events. And then when those became too much, started distilling those into numbers to do actual math with. You'll see why this is relevant in a bit. Let's start from a different angle. Observability and SRE. So who here has heard of observability? Interesting. Haven't heard of SRE? OK. OK. Different crowd. OK. So observability is a fancy term for monitoring, basically. And at least in the cloud native scene within CNCF and such, the term is everywhere. It's like the obvious term, which you must be doing. SRE is Site Reliability Engineering. And this is describing how Google runs their production services internally. Both of them are at least in the cloud native world, where I'm coming from. Absolute buzzwords, and everyone claims to do it, even though maybe they are not. So let's talk about buzzwords for a bit. Buzzwords at some point become more or less meaningless. They lose the initial reason why they were initially useful, but still they have this kernel of truth. So one thing which is super common in human culture, in human nature is so-called cargo culting, where you observe behavior and you want to replicate the success of the behavior without actually understanding what is coming in between. This term comes from the Pacific Islanders during World War II. They observed that soldiers were building small lighthouses and landing strips and such. And they observed that the gods send gift from heaven, as in planes came and they dropped cargo. And to this day, there are proto-religions in this region where they built small lighthouses, where they built small landing strips, as an incorporation in their local religion. Because their desire from their ancestors back then was so strong to replicate the success of the others without actually understanding what that is. Also just to check, who knows what monitoring is. I hope for it. Raise your hand, OK. That's a tough crowd. Monitoring is an automated way to extract data from your systems to make certain that the system state is matching the desired state. So are my service up? Are they replying with an X amount of time, this kind of thing? So I don't have to do this joke on how observable. Whatever. I work with a. So both monitoring and observability at the core are about collecting data and making use of the data. In particular, observability is emphasizing the usage of the data. The initial definition of observability from control theory is how much of a system can I deduce by just looking or a system state? Can I deduce by just looking at the inputs and the outputs? So if you cannot ask why something is not working, it's not going to be very useful. So just stating that something is the case is not completely useless, but it's oftentimes about asking deeper questions. That joke bombs here, OK? If you can't ask new questions on the fly, it's not observability. Another important thing to think about while you provision your services and run your services, in particular in the cloud native world, complexity. I personally like to discern between fake complexity and real system inherent complexity, where you should always try to get rid of fake complexity. Real system inherent complexity is something you cannot do away with. Like if you sell shoes online, you have to store who paid, what shoes you have, you have to deal with this complexity. Also, if you are subject to regulations, like, for example, in finance, those are, for your intents and purposes, real system inherent complexity. And you cannot just do away with this complexity, but you can move it around. You can compartmentalize it. And ideally, you distill meaning from it, so others who are not as an expert in that part hopefully can still get use out of this. So services. Services, to me, are compartmentalized complexity with interfaces to the outside world. They have different teams, different owners, and contracts define those interfaces. Why do I like the term contracts? Because it is telling you that you actually need to sit down. You need to agree. In writing, you cannot just say, OK, no, I didn't think my system would be doing this like it's not my fault. No, you need to agree upfront what the properties of the system are. An internet layer is a super common term where without the different layers of physical and connection layer and IP and TCP, UDP, blah, blah, blah, blah, the internet literally wouldn't exist in its current state. Other examples, even if you cook yourself at home, you won't grow every last cucumber. You always have service interfaces for other services, which you consume and which you provide. And now, again, this is spaced. I thought people here would know what SRE is. SRE, again, is Site Reliability Engineers. That's the name for the people at Google who actually run the services. And it's a hugely successful model. They have written books about this. Arguably, it jump-started quite a large fraction of the cloud-native space. And at the core, to me, this is about alignment of incentives. Where normally, you have your software engineers that are being paid by features shipped. You have the ops people who are being paid by things not breaking. You have the product people who need to ship new features. And everyone is fighting each other. They don't really pull together error budgets. Change this, because you introduce one common currency. There is the service up. Is it quick enough? Is it returning errors? And if any of those are happening, everyone spends the same error budget. So it doesn't matter if the developers are really, really aggressive in their feature development. If product is doing A-B testing, if the thing is hard to run, and so operations people have issues, and it breaks all the time, everyone is all of a sudden talking about the same fundamental thing. SLI, SLO, SLA are, again, like buzzwords in the cloud-native space. They are what you measure, what you want to hit, and what you must hit. It's really simple. A lot of people are confused about this one. And finally, another advantage of this more common approach across the org is a shared understanding, where everyone is using the same tools and dashboards, hopefully, which means that you're looking at the same numbers. You don't have to have a fight when you meet with your manager and that other team. And everyone has a discussion about whose numbers are the least bad. And one thing is green, and the other is yellow, and one is top-down. You don't have those fights if everyone is working with the same underlying data and with the same dashboards. Alerting, who of you run production workloads? Like, are responsible for production? OK. So those are, hopefully, on call for the services. On call is being defined as if something breaks, the computer automatically calls you and tells you about it for you to then fix whatever is broken, whatever you can prevent. And the thing is, a lot of time people do alerts based on symptoms, like that database is down. And that's not what your customers, what your users actually care about. They care about, is the service quick? Is it slow? Am I returning errors? Can I buy this pair of shoes? Your operations people are humans. So ideally, they get rest and they get sleep. Of course, else they won't be doing their peak performance. So this slide is about arguing that you should only be alerting on things which are currently or imminently customer-impacting. Everything else, you can raise a ticket. You can send email. You can deal with it during business hours. You don't have to wake someone up in the middle of the night. Of course, that one machine is down, as long as the end user services are up and running. And the money still flows. No one cares about that one machine. So that's the base layer for getting into Prometheus and other technologies. So who here actually knows what Prometheus is? So Prometheus is the who here knows what cloud-native computing foundation is? Yeah, that's a tough crowd, but that's good, like spread knowledge. Prometheus is a monitoring system. It's metrics-based. And again, like building on what Google did, it is heavily inspired by Google's Borgmon, which is part of what allowed Google to grow in their early years without this Google would not be as large. Prometheus is an open-source re-implementation of Borgmon. We'll see a little bit about this later. It's a time series database. I'll explain what those are in a bit. Everything internally is 64-bit values, which makes it extremely efficient on modern architectures. There is a system of instrumentation and exporters. Instrumentation being defined as you put code in your applications and directly emit data about that application. Exporters translate from MySQL, from Oracle, from whatever into something which Prometheus can understand. It's not for event logging. And dashboarding is commonly done via Grafano. The main selling points, it has a highly dynamic build in services coverage. So if you run Kubernetes, or if you run on any cloud provider, or even if you just have a DNS zone or you have a file where all your services are listed, you literally can just give this to Prometheus and it starts scraping. It starts monitoring all those services, all those endpoints you don't have to do manual configuration or anything. It's all fully automated. Again, if you use Kubernetes as any cloud provider, everything happens as if by magic. It doesn't have a hierarchical data model. It has an n-dimensional matrix. So you can slice and dice your matrix as you need. Because if you have region, country, data center, customer, and then you want to group by customer, everything is wrong. And you need to go up in the tree. You need to do for if, blah, blah, blah. You don't have to do this when everything is attached as key value pairs onto your data and you can just select by whatever key value pair you want. There's one single language which you need to learn for everything, processing, graphing, alerting, everything. It's really simple to operate and it's highly efficient. It's pull-based. So it takes the data from the systems. There's also push-based model where all the systems which are supposed to be alive send data to the monitoring system. Both are valid. They have some fine-grained distinctions, but that's too far from here. You just need to know that it's pull-based. Blackbox monitoring is being defined as I look at a thing from the outside. Does the HTTP server reply, yes or no? How quickly? Blah, blah, blah. Versus whitebox, I actually know about the inner workings of this and that program, of this and that application, and emit data directly. Services ideally have their own metrics endpoints, so you don't have to have this one huge thing where everything is coming from. And you do a partial upgrade and everything breaks and everyone is unhappy and no one likes you anymore. You have distinct endpoints per service. Within Prometheus, we have extremely hard guarantees on stability. So I mean, we know that you want to be asleep ideally and you want to sleep sound because you know your services are up and running. So we don't want to introduce breakage on our end. We are extremely stringent about not breaking anything. And every six weeks, there is a new release candidate. So what are time series? Time series are just recorded values which change over time. And you write down what the value was and you write down when it was. And you do this again and again and again and again and you attach metadata about what it is. Individual events are usually merged into counters, like how many people logged into my website, for example, or gorgeous. How much RAM am I currently using or histograms? How much latency do I have on this and that service? And what is the latency distribution? Like 99% of my users have at maximum x waiting time, so they are happy. I know they're happy because they don't have a lot of waiting time. Yeah, it's really easy to emit, read, and parse those. This is how it looks. I know someone who literally prints F from their C code and he puts this into a file, serves this file on a web server, and that's how he did his instrumentation for Prometheus. And that has been working for years and years now. So how can you use this? If you, for example, want to ask what partitions do you have, all of the ones which are not root and have at least 100 gigabytes of capacity. So you can select node file. Node is the name of the exporter. It exports Unix systems towards Prometheus for my file systems in my bytes total, not mounted on slash, which is root, divided by 1 to the power of 10 by 9. So 1 billion, which is wrong in this example, actually. No, it's not, of course, we have over 100. No, it's not wrong. So this is a billion. And then it needs to be over 100. That is the query. And this would be in a random system, your return. You see this and that device, and this and that mount point, and this and that system, and it has this and that amount of gigabytes of storage or capacity. You can also do more complex things, like for example, what is the ratio of request that's another one of those things. I don't care about if I have one error. I don't care if I have 10,000 errors. I care if I am above a certain percentage of errors, because if I have one error per day, but I only have one user per day, that's really bad. If I have 10,000 errors per second, but I have 10 million requests per second, that's not that bad. So it's always, I want to have the ratio, and that's how I get this ratio. Basically, I summarize by path. Path is this label here. How many requests do I get per second over the last five minutes, which have a status of 500, which is HTTP error. And then I divide this by all the requests per second over the last five minutes. And then a random service might give me this as my response. So you can do really deep analysis. And as you can see, you don't need to know about what systems exist or anything. You just run the query. And no matter if you have one or 10,000 systems, everything, all the computation happens automatically. Those are the main features of the last 12 months, remote right receiver. So you can also push data towards Prometheus, trigonometric functions. So you can do sine wave and cosine and things like these. Of course, the use case we had is someone was monitoring their wind turbines. And they needed to know what angle those wind turbines were into the wind. So that's why we added those. Agent mode, where if you have corporate requirements for opening very few ports on your systems or something, you can use the agent and it basically forwards a lot of data from systems towards Prometheus without Prometheus needing to know about every last one of those. We have long-term support versions. Again, we have new releases or new release candidates every six weeks. The long-term support versions are usually supported for at least half a year, depending on user feedback. But if you don't want to upgrade as often, but still want to have all the security patches, you can have it. And out of order ingestion, where you can basically send data which is older than what Prometheus would normally expect, that's probably too in-depth for this crowd here. And the next feature, the next highlight feature will be native histograms. There you can put a lot more resolution into your histograms. So you can figure out with a lot more detail, for example, what your latency distributions are when customers want to access a web service or something. Prometheus and Kubernetes are the cloud-native defaults. Kubernetes is the open-source re-implementation of Google Spork. Prometheus is the open-source re-implementation of Google Sporkmon. And both of those, again, are the thing which actually allowed Google to grow in the early 90s. Without them, they wouldn't have been able to. So even though they started in different areas, Prometheus and Kubernetes are written with each other in mind by the initial creators, because they knew how it was done at Google. They are also the two founding projects in the cloud-native computing foundation, the two oldest ones, blah, blah, blah, blah, blah, always those two. A little bit about scaling, this just Prometheus. It's absolutely not a problem to ingest more than one million samples per second on this laptop. No problem, long-term. It just works. You come out at roughly 250,000 samples per second in core, which you can ingest. We compressed those 16 bytes, of course, two times 64-bit from earlier, if you remember, to consistently 1.36 bytes per sample. So there's very good compression built into the system. And a Prometheus server itself is reliable into the tens of millions of active series. The largest Prometheus I know of has 125 million active series, as in data which is actively ingests again and again and again, usually every 10 to 60 seconds. So let's talk about other things. Mimir, who knows what Grafana is? So Mimir is a project by Grafana Labs. Based on Prometheus, most of the development for Mimir actually happens in, or a lot of it happens in Prometheus, and Mimir is based on it. Mimir metrics, yes, we have this thing with the first letter. And the history for those who care, basically there was Prometheus first, then Cortex split off from this, then from this we created Grafana Enterprise Metrics, and based on this we have Mimir. Mimir actually scales to one billion active series. So at any time and point, you're actively ingesting one billion different data points from different time series every 10 to 60 seconds. And it scales to this level. It's really, really fast in query performance. You have multi-tenancy if you need it, access control, three-ray replication, of course, Prometheus is its own instance, and if it goes away, it goes away. And it can ingest open telemetry data.graphite influx, other protocols which you cannot do with Prometheus. So for those one billion active series within one cluster, we needed 1500 machines at a cost of 7000 CPU cores and 30 terabytes of RAM, which is a lot, but for that scale, it's actually not that much. Loki, like Prometheus, but for logs, it uses the same fundamental metadata level with or structure with key value pairs, and this is for logs, Loki, yeah? And one thing which you have with a lot of logging systems is, in my opinion, they do the wrong trade-off, where either they have full text indexing and you pay for all the full text indexing, like in the licensing, in the compute, in the storage, in the RAM, in network, you double your storage alone from full text indexing, and most of the things in your logs you will never ever search for. Or the other thing where you have a data lake, which is to me a euphemism for keeping your storage vendor happy, because you just keep buying more storage and no one ever really actually looks at the data, you just keep it and you keep it. So what Loki does is similar to what Prometheus does, it only indexes the labels, those key value pairs which we saw, and everything else is an opaque blob, you can put whatever in. I know someone who puts photos in there because he wants to. It works at a really high scale without all this massive cost which you have with more traditional systems, I won't name them, and you have a flexible schema on read. So you don't have this thing from SQL where you are really tied to specific happy paths of reading, with your indexes and everything, you can actually be flexible in how you want to extract your data and how you want to mangle your data while you get it out and while you put it in your alerts, in your dashboards, in your reports, in your further processing, whatever. And there's a really easy way to turn logs into metrics, which the savings we'll see at the end of this talk. It looks basically the same, you have a timestamp, then you have your label set, and then you have whatever is at the end. The slides will also be uploaded right when I'm done with the deal. At Grafana, as of two months ago, we are ingesting 160 by terabytes per day in our largest cluster. We have multiple, but that's the largest one. Queries regularly see speeds higher of 80 gigabytes per second, just on standard cloud hardware or cloud instances, which means you can create terabytes of data in under a minute. And yes, this includes all the complex processing of the result set. Tempo for traces. Who here knows what distributed traces are? Okay, so distributed traces are basically charting the path through your application, or generally traces chart the path through your application, which on more traditional systems is really relatively easy. Of course, I have one application and I'm running through this thing. If you ever use a debugger like GDB or something, that's what this is, or what you can do with it. Distributed tracing is due to the fact that on more dynamic workloads in cloud, you have more smaller services. You don't have to have those microservice, but even if you have a modern web shop, you probably have two or three different things which do the computation before someone can actually buy their thing. Distributed tracing charts this path through your system so you actually know this one request from that one user took this and that pathway, and it took really long here, and then it erode out over here. Like, things like these are possible with distributed tracing. They are a lot more expensive than logs or metrics, which is why I started with how we always condense, because we are basically at this complete story of something happening again, but they are very valuable. Usually you take fewer or a lot fewer traces than other things. So, Tempo is a system which allows you to do cloud-native distributed tracing also at really cloud scales. One of the tricks which Tempo has, which no one else has, is something else which comes from Google initially when many years ago I was at Google and I had a meeting at Google and we were discussing about merging certain projects, didn't end up working, but whatever, and they mentioned that searching for traces didn't scale. And when Google tells you that searching for something doesn't scale, they probably did the math. So the thing is when you do the thing with labels on your logs, on your metrics, which works on your logs and your metrics, it's actually too expensive to also do this on your traces, but you don't have to, of course. The thing is I already distilled the meaning from my traces. I already distilled, this and that thing is a slow query. I already know that this and that thing is throwing errors. So I don't have to do this analysis. I don't have to do this in-depth search on my traces. I can use exemplars. Exemplars are really simple. They are just an ID which I attach to a trace. And then I store this exemplar, this trace ID with my logs, with my metrics. And if I see that I have this and that error and there is an exemplar attached to it, I can jump directly into that one trace. And I know it is relevant, of course I know it happened with this and that error. I don't have to figure out needle and haystack is this thing actually worthwhile out of the other five million traces. Or if I have that one slow query response, I know this is coming, like I see this in my histogram, I can jump directly into the trace. So that's what the template and exemplars are for. It does support searching by labels and by label sets for those who need or want it. It only needs an object store. You don't need elastic or Cassandra or anything in the background, which makes it a lot cheaper than pretty much everything you can find. It's 100% compatible with open telemetry tracing, Zipking, Yega, everything. And at Grafana internally, we don't do sampling. Sampling is I choose a number, 1%, and I throw away 99% of my data in the hopes that the 1% is something which I can actually use, which is super frustrating. Of course, if I happen to throw away something which I would have needed and I see the reference, it should exist, but it's been thrown away. Super frustrating. This system is cheap enough so we don't have to do sampling. We don't have to throw away our data. Also, as of three months ago at Grafana, we are ingesting 2.2 million samples per second at sustained 350 megabytes per second, which comes out to, or we peak regularly to five million samples, and then we re-instrument, because that's the happy spot for us, this amount of insight. And it's 14-day retention with three copies stored. Sorry, I forgot to put in the compute cost. Send me email later, I can give you the numbers. And the latencies of doing this analysis is super low. Like 99% of all requests come back within 2.5 seconds to the system. 50% of those are coming back within 1.6 seconds. By the way, those P percentiles, those are what you can do with histograms like service latencies. This goes too far for this talk, but again, send me email if you want to know more about this. Flare is for profiling. Who knows what profiling is? So profiling is, I look at my application and I can see how much time, how much CPU time, how much memory, how much real time do I spend in a certain part of my code. And this isn't that function, and this isn't that loop, whatever, do I actually do computation or do I just wait for someone else? Am I stuck in some loop? Am I losing memory? As in, do I have a memory leak? Things like these are really easy to discern with profiling. Again, it's a relatively expensive view of the world. Of course, metrics and logs are more efficient. But it's extremely useful for optimization. So once your systems are up and running and they don't break all the time anymore, cost reduction and efficiency increase and speed increase is really easy with this. Currently, there are two languages supported to send data into the system for go standard pre-prof. Anyone who develops anything in go gets this basically free as part of the Golang development kit. You can just hit the pre-prof endpoint. If you don't know that your go application has a pre-prof endpoint, it most likely does. And it's super useful information which you can just find for free. And for Java, we have a project where you can emit literally the same data in the same format about Java and we'll add more and more over time. So I told you that traces and profiles are really expensive and logs and metrics are way more efficient and cheap. But let's look at just the difference between logs and metrics. Again, this is data from our own clusters with extremely large data sets. When I would be doing full text indexing of 10 terabytes of data, I would come out at roughly 20 terabytes of data with the full text index. That's not what we are doing but that's like the rough scale of what could be happening. If I ingest 10 terabytes of logs into Loki, I obviously have to store those 10 terabytes. But the index, the thing which we actually have to search through is only 200 megabytes of data which is... Three, six or seven orders of magnitude less. Like it's insane this difference. This is why we're not doing full text indexing. But even with this efficient thing, average logs at Grafana come out at roughly 600 bytes per log line, including index and everything. Metrics, as we saw earlier, come out at 1.36 bytes per second as we saw earlier. So just going on those two numbers even in the already reduced index, just going on those numbers, when I can turn a log line into a counter, like for example, I have an error and I can write a log about this but I can also in theory just up my counter for this and that error. Put my labels on it so I know where it happens and everything but I just up this number. This gives me a 99.8 reduction in storage size just for the first log line. And afterwards, within the storage time which is usually two hours, it is literally free because it doesn't matter to me if I'm storing a one or a 10 billion, but it would matter in logs. So you have incredible power to take large data sets and become really, really efficient with your data sets. That's the main message. That's also why I started on this, how humanity did it again and again and again over the millennia course. As humanity, we always came out at the same. So bringing all of this together with this stack, you can jump from logs to traces, metrics to traces, traces to logs and all the other ways because they're all interconnected, those systems on the back end. All of this is open source. You can run it yourself. If you want to pay for goods and services, Grafana Labs is happy to. I like food and shelter but all of what I talked about is completely open source. And those are a few screenshots for those who don't actually know what Grafana is. Like I can show different graphs. I can stack them. If I want to do those calculations, I can have, yeah, totally. I mean, you can see the graphs yourself. I can put key events in here, errors or alerts or deployments of new things. I can also do stuff like geographical overview or heat maps, things like these. Thank you, and we have five minutes for questions. We have three minutes for questions. No one? Be bold. Anything which I should talk more about which fits into three minutes. Any slide I should jump back to. Okay, thank you very much.