 Hi, and welcome to Observability with Prometheus and beyond. We could also call this a philosophy of observability. And if you've seen talks from me before, you might have noticed that I tend to go towards philosophy for just this basic understanding of a few concepts. In case you haven't seen my talks before, which is likely, of course, this is a new audience, I've been asked to tell you who I actually am. I'm the director of community at Grafana Labs. I'm a Prometheus team member, have various roles within Prometheus team. Within Prometheus team, CNCF graduated project. I founded Open Metrics, which is a CNCF incubating project. I am a co-chair of the tech observability within CNCF, maintain SMP exporter, created Modbus exporter. I built Europe's most modern data center, which is monitored mainly or only through SNMP and Modbus. I did tons of work on adopting existing systems to new cloud-native technologies. I also ran the backbone of an ISP for 11 years. It literally be alerting and be being on point on call all the time for stupid reasons, active in various different standardization and networking groups. I did some upgrades, have quite a bit of community work and run conferences for fun. Why am I telling you all of this? The reason is a little bit of proving that this is actually useful stuff, which we are talking about, because I have quite a bit of pain in my past and I have strong opinions on good tools and bad tools. So if we look at today's observability space or the observability truth of many people, you have a lot of disparate systems and every time you need to jump from one system to the next, you have a mental break. You might have a different data model. You might have a different UI. A color of a thing might change. Visualization is subtly different or very much different. You might have a different click path. It's a lot of different jumps and just mental overhead and that's friction. It makes automation harder. It makes training people harder. It makes showing someone who's not on your team that one thing which is currently important harder, you just have friction in the system. And a lot of what I'm talking about today is to try and rethink a few bits and pieces and build up from the ground something which I tend to believe is a holistic thing. We also need to look at a few buzzwords. Of course, there's a lot. And I mean, even started with cloud native technologies and such, there's a few dirty secrets. Cloud native scale of today is basically what internet or ISP scale was two decades ago, maybe three decades ago, which stands to reason, of course, obviously you need the internet to be of a certain size, to have a cloud offering of a certain size and one must trail the other by definition in global state. But many of those principles like having super high resolution events and then distilling them into metrics has been a standard for ages. In power measurements, easily half a century, probably longer. A lot of those modern tools, basically they rely on good engineering practices with modern principles, blah, blah, blah, blah, blah, but good engineering tends to map very well onto old good engineering, which is another way of saying a lot of what is cloud native and modern and buzzwordy is quite likely applicable to you as well, even if you're not yet or not ever cloud native, which is totally fine, like there's a lot of industries where you don't need to have microservices or anything where you're like that one monolith, that one big machine is all you need. And that's completely fine. So observability is a little bit of a buzzword. It's become almost meaningless. We're working on trying to push it back into actual meaning, but as it is so buzzwordy, this term observability, you have the usual pitfall of cargo-culting. People observe effects of things they want to emulate what others achieved, but they don't actually think or even realize why certain behaviors have or why certain things are designed or done in a certain way put differently. Often, if you see cargo-culting, people just change the name of whatever they had, like they have old practices and they call it admin and then they call it DevOps, then they call it SRE, they call it whatever. It's about actually changing behavior. It's about understanding what aspect made this current buzzword successful versus whatever was there before. Monitoring and observability, personally, I used them almost interchangeably. Like I tend to use observability more for the simple reason that this is, it has a little bit more of a focus on actually understanding the data. There is monitoring has taken on this meaning of mainly collecting data, not so much using it. There are two extremes within the monitoring space, also to some extent in the observability space, where the one extreme is that you have to full text index everything or do you have to index everything? And the other extreme being where you have data lakes, which is euphemism for no one looks at it, unless you have like batch processing and such, which is completely fine for analysis, for statistics, for what have you, but not for actually sending alerts that something is imminently customer facing or something. Observability on the other hand is taken on this meaning of enabling humans to understand complex systems. Different definition would be that just by observing the inputs and the outputs of a system, you can discern the internal state of that system without needing to look at anything else other than inputs and outputs. That's the control theory definition of observability. The summary of this is it is about asking why something is happening or not working and not just factually stating this, yes, this is the case. It's a lot more about why is it the case? And funnily enough, that book is actually about identifying different types of wood, yet this is my all-time favorite meme. There's another important topic, complexity. I personally discern between two types of complexity, once you have the fake complexity, which is a nice euphemism for bad design, and you can and you should be reducing this complexity wherever possible. Like sometimes you have certain constraints where you can't really, blah, blah, blah, but by and large, if you can reduce complexity, like actually get rid of the complexity, you probably should. On the other hand, you have real system inherent complexity, which you can move. Like for example, if you have a monolith or you have client-servant system or you have microservices, that complexity is not going away, it's just shifted into different places. In particular, state, like databases and such, everyone runs their stateless microservices, blah, blah, blah, but the truth is you need state somewhere, obviously. Of course, you need to store that this person is attached to this account, that they have this and that much money in the bank or what have you. It becomes a little bit of a hot potato game where everyone wants to try and get rid of their own state and just be stateless, because that's a lot less overhead, but they cannot really, and that's just one of those things where they can move it around, but you cannot really get rid of this thing. You must compartmentalize complexity wherever possible or feasible, and you should be distilling it meaningfully. We look at the compartmentalization in a bit as well. Let's look at SRE, which some argue is an instantiation of DevOps. There are differences, but it's not wrong to phrase it as such. At the core, to me, SRE is about aligning incentives across the org. Normally, you have your devs, you have your operations, you have your product managers, your project managers, your managers, what have you, and they all are fighting each other because they have different and diverging incentives. Devs are paid for feature velocity. Ops people are paid for no downtime. Those don't mesh very well, because they are just basically opposed. If you look into the SRE book, one of the things which is front and center there is the concept of error budgets, because if all the owners of a product have a shared error budget, all of a sudden you are aligning those incentives. Of course, if the devs have really crappy code or many bugs or what have you in this quarter, they must stop shipping new features. They must refocus on stability. So they have an inherent incentive to not ship broken features, but to have stuff which is actually stable. Same as the ops people. They would be even supporting the ops people to make migrations in such easier. Of course, this means less downtime or less chance of downtime. So the error budget is there for A-B testing, for new features, what have you. That obviously also maps to product managers to everyone, because they shared this one budget and they actually need to agree amongst each other how they spend it. And if they have a superb product, which is super stable, they can go all in on the testing and on the features and everything. And if it's icky and if it's brittle, they need to slow down. So you have this shared pool of resources which forces them to talk to each other and align incentives, which obviously is great. Measuring all of this, you have the three magic concepts of SLI, SLO, SLA. SLIs are the service level indicators, which is basically what you measured, the number or what have you. The objective is what your internal target is, what you must hit or must not hit, stay above, below, under, whatever. And the agreement is where you actually have a contract and you need to start paying someone or someone picks up the phone and has a few words about, hey, why did you break this agreement? Another way of building this shared incentive is also building a shared understanding. If everyone is using the same tools, the same dashboards, the same everything, the nice thing is they will start having the same language. They will put their own domain-specific knowledge into the shared dashboards, into the shared alerts. Which means everyone is building on top of the other's works. You have an incident and three different teams look at this literal same dashboard. All of a sudden, you have this super nice situation where they use the same language, where they have the same data underlying their thing. You don't have one team having an alert and the other team is not even able to reproduce that error or even see it in their logs. Of course, everyone is using that same basis for the information, for the error learning. Super nice effect. So now this compartmentalizing of complexity services. Services are inherently compartments of complexity and obviously they have some sort of interface cause else it would be just sitting there in the corner and doing nothing. Services usually have different owners or teams. Sometimes teams have different services, blah, blah, blah but by and large, you can discern services along team lines and contracts define the interfaces. Why contracts? I really liked the term contract in this context cause they are shared agreements, ideally in writing, which must not be broken. Again, SLAs and such. It doesn't matter if the customer is internal or external. They rely on what you build, on what you maintain and you have to make a hard promise about what you can do and what you cannot do. Of course, that's also what you get from your people which provide services to you. Like it's obviously a layer cake. It's turtles all the way down. And if you can rely on the services provided to you and people can rely on the services you provide to them no matter if they're internal or external, you'll all have obviously a nicer or better time. Coming from the networking space, another super common term would be layer. The internet as we know it would not exist without proper network layering. Of course, you can just have a lot of innovation in one layer while keeping the other stuff stable or having independent of each other innovation. But you have those clearly defined interfaces which just keep the thing on working. And you can parallelize massively if you have those hard APIs, those hard interfaces, those wire formats, those, whatever you. There's various other examples. CPUs are highly complex and they're just compartmentalized away. Hardness, compute nodes, even your lunch, like even if you cook from scratch, you will not grow every last cucumber which you're eating. There are interfaces in everything which you do. Alerting is super important. The customer, they don't really care if you have five or 10 databases or nodes as long as their database service is up and running. They care about the service being up, not about the individual components. You care about water coming out the wall. You don't absolutely care about where the pumping station is and if the pumping station is on a genset or whatever you, you care about water coming out of the wall. That's your service interface. A really important point and this is way too often overlooked. SLIs are useful in more ways than one. My own SLIs for my own services, if they come near or go over SLOs or SLAs, obviously I need to alert on them myself or if they're imminently going to go under or what have you. But all the services which I rely on, they also ideally have SLIs and those are great things for me to debug my own services. Of course, if I have a web shop that database or what have you that service provider is down or is having an impact, I can use their SLIs to help debug my own stuff. But you need to be careful. It doesn't really make sense to alert on someone else's SLIs unless it's like super tight coupling. Of course, usually you can't really do anything about it and so you shouldn't be waking anyone up on your own team. To avoid pager fatigue, anything currently or emanating, impacting customer services must be alerted upon and nothing else should be alerted upon. Raise a ticket, do it during business hours after the weekend, it's fine. If it becomes imminently customer facing, yes, raise an alert, but it doesn't really matter, it's not that urgent. So let's look at actual stuff like this was the philosophy part. Prometheus, you will most likely have heard the name Prometheus. Of course, it's made a little bit of a splash. The 101, in case you haven't heard, it's inspired by Google's Borgman. It's a time series database. It has 64 bit values internally. There is a huge ecosystem of instrumentations and exporters, literally thousands of them. It is not for event logging or anything, dashboarding is done through Grafana. The main selling points of Prometheus, it has a highly dynamic built-in services cover you, which interfaces to a lot of different services. And you also have generic endpoints where you can toss your own stuff in, but like for Kubernetes, for various cloud providers, for various normal machine providers or more traditional hosters, you have just a lot of different interfaces. You don't have a hierarchical data model, you have an in-dimensional label set, which is quite nice. Of course, if you, I don't know, you have your region and then you have your data center and then you have the customer and you need to select by customer. Your hierarchy is already wrong because you need to walk up in that tree and then walk over and go down again. If you just have label sets, which are basically key value pairs where you have customer equals X, Y, Z, you can just select by customer equals this and that and you don't have to go up and down again. It's a functional thing. It's not a imperative language. Speaking of the language, Promptio, that is dysfunctional level. We have this one language which use for all interaction with data, with primitives or compatible systems for processing, graphing, alerting, exporting, everything uses the same language, which is incredibly powerful because you can literally take a dashboard definition or there's one query for it, which gave you that table and turn it into an alert or your alert can simply link to the literal query which created the alert and you see the live data or the historic data or what have you. You have one single interface to actually work with your data and that's Promptio. Promptio is a super simple operate. You literally just run a statically combined binary and that's it and it's highly efficient. Promptio is pull-based, which is a little bit of a religious discussion. You are able to also push to it and we are like with my Prometheus head on, we are enabling more and more push-based scenarios. That being said, the recommended path for Prometheus proper is pull-based and that will not change anytime soon. There are several nice properties about pull-based systems. You get some more clarity on a few guarantees like are the systems really there? Can I reach them? Are they stale? Those kinds of things are just inherently easier in pull-based systems but else they are largely equivalent. Two super important terms, black box monitoring and white box monitoring defined as black box. I cannot look into the thing. I do outside monitoring. Does it reply to ICMP ping? Does it really reply to HTTP requests? What have you? Versus white box monitoring, there I instrument the code defined as I put statements into my code and I have this one counter for successful logins. What have you? Where I put code in or where I introspect my code from within the code base itself. Another thing about every service should have its own metrics endpoint which is not more traditional trading models. There are ways to like behind reverse proxies or with an agent or so. You can also tie this up into a single endpoint. The recommendation just to be more flexible and upgrades and such is to have your own endpoint. And we have incredibly hard API and wire format and all those commits with a major versions. Even stuff which is marked experimental tends to be stable basically day one which you could argue sometimes even a little bit overdoing it but on the plus side this vast ecosystem was able to grow without having breaking migrations all the time or anything. Of course, things tend to be super stable. What is a time series? Time series are values which change over time and they're just recorded. The classic example would be temperature which goes up and down. If you have individual events you merge them into counters and or histograms like if you have service latencies you would be doing histograms where you say I have this and that many which are this and that slow or quick. There is counters how often did I go into that one function or what have you. Changing values are recorded as gorgeous those go up and down. Usual examples I already gave them to you. Emitting this data towards per meters is super easy. This is literally how you could emit data towards per meters. I know people who have just print F in their C code they dump all of this into a file put that on a web server and that's how they instrument their own code and it's working. I've done similar myself. I've created these kinds of things even from shell scripts and then I just ingest them into per meters and just works. It's one of those features which make adoption of per meters so easy. Of course, it is really, really easy to turn stuff into per meters format even if you don't use the client libraries. Scale of per meters and now I'm only talking about per meters itself. We'll see larger numbers in a few. Kubernetes you've heard of this is equivalent to Borgman as per meters is equivalent to Borgman. Both are Google internal things which out with Google wouldn't be as large as they are today. And without this tight coupling which you also buy more or less happenstance C and CNCF where Kubernetes and per meters are super tightly coupled. Of course, they just, you cannot run anything at that complexity and at that scale and that automatic changing of all the things all the time through automation without having a system which actually makes this understandable groggable for humans. Scaling of just per meters having more than a million samples per second ingesting is absolutely not a problem on current hardware like you can do this on desktops if you want. Rough guesstimate is 200k samples per second per core. We compress those, if you remember those two times 64 bits from earlier those we compress on average down to 1.3 thick spides per sample which is quite nice. The highest promethys proper which we saw in production was 15 million active time series at the same time active time series being defined as any metric which has been seen by promethys within the last five minutes. That's an active time series. And that is just one single binary running this along with the alerting with the clearing with everything. For long-term storage there is a plethora of different options. Two of them have actual promethys team members working on them that would be Thanos and Cortex. Thanos is historically easier to run but a little bit slower in curing it such. It started with scaling storage horizontally. Cortex is historically harder to run but it becomes easier and easier and also there's like a single binary mode and all these bells and whistles in the meantime or for quite some time already. And that started out by scaling injustice and curious horizontally. Like a year ago Cortex took in the Thanos code to also scale storage horizontally. Whereas Thanos is working on taking Cortex code to scale their injustice and curious horizontally. There is quite a bit of overlap and collaboration between the two. Looking at some numbers which are Grafana internal. We have tenants with 700 plus million active series. Again active series defined as within the last five minutes. One customer is running a two billion active series which we honestly don't recommend but they do it and it's not completely horrible. Like it's not something which I mean I would suggest maybe half or a third of this just looking at the slide but they made it work to some extent and like yeah but the 700 plus is the better number. Also for Thanos and for Cortex there you usually have push-based systems which use the Prometheus remote write system to push the data towards those. One important point here usually the assumption of Prometheus remote write protocol is that the data is already cleaned up by a Prometheus system. Of course there are certain implicit assumptions in backend design that data has already been cleaned up by a Prometheus system. So I talked about hard API comments earlier and I talked a little bit about wire formats. If you remember I come from the networking space initially. So wire formats and hard things like you can rely on the wire are quite important to me. Open metrics is an effort to take the Prometheus standard which exists, sorry let me restart that sentence. Prometheus is the defector standard in cloud native metric monitoring and also quite beyond. Like if you look at ISPs and such it has tremendous uptake and has had for half a decade now. The same is true for the Prometheus exposition format because that's the way to get your data into Prometheus. You saw a few samples of this earlier it's really easy to have to expose stuff towards Prometheus. So you have those thousands of its border integrations. The problem is quite a few vendors and projects were not really happy about adopting something which has Prometheus as the name. That has gotten better over the years but still this is a concern. Also especially traditional networking vendors as such they prefer to support official standards. So reusing this installed base of Prometheus keeping the focus on a thing and remaining opinionated about how to how to do metrics based monitoring properly. We created open metrics. Quite a few people have collaborated on open metrics over the years. Competitors, even companies which are not like end users help create and shape this or shape it not so much great. And we are in the ITF process to have an actual RFC. I need to send an update to this. Honestly it has lapsed but the internet graph still exists. If you read a little bit about observability there is the concept of the three pillars. The three pillars being metrics, logs and traces. All with their own focus point like metrics are perfect for dashboarding AIML even if it's an evil buzzword. Alluding in such logs are more due diligence debugging incident response whereas traces are for debugging and performance tuning. For more than half a decade I wanted to have a label based logging format. And for pretty much exactly half a decade I wanted to have exemplars in the open source world. I literally want to change how the open source world does observability. This is not a joke. So I said that for quite some time I have wanted label based logging. Loki, maybe you've heard of it. It follows precisely this label based system like Prometheus does. It does not have a full text index which means you can have quite some speed improvements over anything which requires to have a full text index. You can work at insane scale without the massive cost which is associated with full text indexing. And you use the literal or you can use the literal same label sets as with your metrics. This means you can turn your logs into metrics on the fly or as a continuous background process to make it easier to work with them to alert on them and such. But also you can delve into your logs and you don't have this break between the two different types of data you can switch back and forth seamlessly. If you have this log and such there's tools to just ingest all of this. You might remember this from just earlier. Basically you have the same label set. You don't have a name for any metric but you have the same label set in the curlies. At the end you have just an opaque string. It can even be a blob. I know people who have pictures like literal photographs in their storage. Loki doesn't care. And at the beginning you have a timestamp which as it's events you obviously need to have a timestamp. That's in contrast to primitives for timestamps are possible but not recommended as something which you emit that's something which is handled on the primitive side. Also looking into under the hood a little bit. We regularly see Curious which exceeds 40 gigabytes per second which means you can curious terabytes of data in under a minute even including complex processing of those result sets where you then have pattern matching blah, blah, blah, on the actual opaque string. So the thing how this works is you select based on the label sets and then you can do additional Curious additional matches processing what have you on the result set on the opaque strings within that result set. Tempo that is a tracing backend and that is initially or mostly designed around exemplars. Exemplars are IDs which you attach to your logs and metrics. So usually if you have traces you have this needle and haystack problem. You need to search for your traces by whatever metadata is attached to them and you need to find relevant traces through other means. Contrast this to exemplars where I already know that this is a high latency bucket or I already know that this and that process had an error or what have you. And an exemplar attached to that high latency bucket to that error state to what have you allows me to directly jump from that log line from that metric into the relevant trace and I keep all that mental state. I know that one thing took 2.5 seconds, whatever and I jump into this trace or into the span and I know precisely what I'm looking for cause I have that context from the other signals. That being said, a lot of people like searching through traces so you can also do indexing and searching by label sets. It's not the recommended path but if you need it or want it, it's there. You don't have any super expensive backends. It's literally just object storage. It's compatible with open telemetry tracing, Zipkin, Yeager, we at Grafana Labs, we do 100% of traces. We don't do sampling, which is also different from other implementers and such who tend to need to use sampling. Again, coming from networking, I really despise sampling. I had too many net flows where I didn't find what I need. I like having 100%. Some data here that's already a few month old. As of July, we were ingesting more than 2 million samples per second at a sustained 350 megabytes per second. At 14 days retention with three copies stored that comes to a cost of 1,140 CPU cores, 450 gigs of RAM and 132 terabytes of object storage. That's it for 2.2 million samples per second, 14 days. The P99s, you already read them yourselves. I don't have to read them out to you. So bring all of this together. You have tools which can actually contain all of this complexity and this ever changing landscape of how your services are built and how they tie together, which is currently being scaled up, down, what have you. And if you don't have those problems today, you can still leverage those tools which are built for that scale and built for those use cases to use them to have more powerful tools to get into your data. You can jump from logs to traces, from metrics to traces, from traces to logs and all the other ways between all of this. All of this is open source. You can run it yourself. I mean, honestly, I like the concept of food and shelter. So if you decide to buy from Grafana Labs, that's also completely fine, but all of this is open source. Just take it and run it yourself. Of course, that means that you don't have this plethora of different things. You can actually jump between all those different metrics, logs, traces. You maintain your mental state. You don't have difference in that one color. You don't have a different click path. You don't have to change mental models from a hierarchical to a different hierarchical or anything. All of this goes hand in hand and is actually designed for each other. This is something which a few of us have been working towards for more than half a decade. There's quite some behind how all of this fits together. So both humans have it easier to understand those things and obviously, or maybe not obviously, but the nice thing is where humans can understand things. Usually you can also automate better. Thank you very much.