 So, our next speaker is Richard, who has one of these bright yellow shirts, which means he's going to have an extra applaud afterwards for all his volunteering work, making this conference come true. For those who just entered the room, after the presentation, there will be a Q&A, and please remain seated, otherwise there's not possible for the speaker to hear questions or talk at all. So please respect the speaker afterwards and remain seated until the presentation is completely done. And that's it. Feel free. Thank you. Mike Tess, you can hear me okay in the back? Cool. Okay. So, I'm Richard. I do a lot of things, and I came here to talk about observability. It's going to be, those of you who are here when I opened the deaf room, it's like two different angles to the story, which Jana and me do, but it's basically the same message just from totally different angles. So let's start with a few definitions. A buzzword is something which may have been useful at some point, and then everyone starts using it, and it loses this original meaning. And this just kind of, it becomes untangible at the fringes of the usage of this word. It may revert to usefulness or it must just be put in a dustbin of IT and of history. There's a very important concept called cargo culting. Who has ever heard of cargo culting? That's way too few. So basically what's happened is during World War II, villagers observed that when the US Marines were building runways for planes or they were building large fires, the gods dropped stuff from heaven. So obviously they wanted to have stuff from heaven as well, of course it was nice stuff. So what they would do, they would start to build runways of their own. They would start to build little towers next to the runway. They would have large fires in the hopes of pleasing the gods and making the gods drop more cargo. So they observed behavior and they tried to recreate the effects of this behavior, but they completely failed to understand why this behavior happened and what was underlying of this behavior. And we have this way too often in our industry and we way too often just copy something someone else did, of course it kind of works, but it doesn't work for us. So be aware of this. Then we have monitoring, it's, I mean the staff room is called monitor and observability. I like this word, but it's totally overused. And it came to mean at least like the lowest common denominator of just collecting random stuff because then you would be doing monitoring and as long as you collect you don't have to actually do anything with it. And you can tell your boss, yeah, we do monitoring and stuff is still on fire, who cares. But I have monitoring. So this is not the right way to approach this. Those of you who have heard of a data lake, that's basically just toss all data in this place and if we need it we can search, which works to some extent, but it would be nicer to actually have a purpose when doing this. Then we have observability itself, which is basically a function where you can observe, understand and act on a system how it behaves. Or if you want to go on mathematical, it's basically you observe the inputs, you observe the outputs and you can make deductions on the internal state of that black box just by looking at inputs and outputs and that is something which makes it observable in the mathematical state. So we're done. Thank you very much for your time and if you have questions. There's more to it. There's quite a bit more to it. But this is like, this is this weird story which, yeah, so we go through several things here and these are the learnings which I hope to enable you to take away from here. The baseline of monitoring, what monitoring data you have, what types of complexity exist, how you contain it, what service contracts and all these things are stacking services and bring it all together with a few BCPs as in best current practices. So monitoring is the bad work of everything in IT. You need power, you need cooling, you need network and that's about it and you need monitoring or something to make you able to debug your systems. Everything else you can fix but this is what you need. And as the old SRE saying goes, hope is not a strategy. So you should have kind of a plan. And actually hope it's not a strategy is even older. I think it also comes from World War II. So if you just do stuff because it worked somewhere else and you hope it might also work for you, that is cargo culting. And some of you laughed about those silly little villagers who are uninformed and uneducated and they just built those little fires and that's funny, haha. But if you're not actually understanding what you do, this is exactly the same what you're doing. Because you need to understand what you're doing and why you're doing it. If you look at ISO 9001 and 200001, those things are being ridiculed in the industry for many good reasons. But at the core, someone wants to make other people write good documentation, have processes which are repeatable, automatable, blah, blah, blah. All the things we talk about in IT, other people try to do in corporate. So, but it doesn't just work to say, okay, this is where I want to end up and have fun with it. I need to know how to get there. And now let's delve into how I can understand it. If we talk about monitoring, if we talk about signals, if we talk about observability, we have roughly two types of things. We have metrics, which are more or less stuff changing over time and how it develops over time. And I have events which are information at a specific point in time. There is tons of more stuff, but really broadly speaking, this is a good concept to just know which you want to use in what situation. Looking at metrics, this is numeric data, which is nice. You can have counters, things go up, interface counters, network speeds, network transmitted packets, whatever. You can have gorges, temperature in my data center or of my laptop. Or I can have also Booleans, which are a special case of gorges. They just are alert, yes, no. And I can have histograms and percentiles, which give me distributions of what is happening. It might be a latency, it might be interface usage, it might be whatever. If you see the heat map, that is the tangible result of having those. Counters and histograms, most often lose data, which is fine. It's just a valid engineering trade-off to lose a little bit of absolute precision. But to make it easier to handle, generally speaking, metrics are really easy or relatively easy to handle at huge scales. And one of the very best things about them, you can do actual math. And if you have something which does vector math on it, even better. Then you've got logs. Classically speaking, syslog, you have text items. And I don't know how many of you know that there is this whole spec of how to define syslog, what fields you need. Someone really put work into this. If you look at syslog today, it's a total shit show. No one actually follows those principles. Most of you will not even know that they are really existing. So this is in-light metadata, even if that's in there. And if you emit logs and you have more users on your service, those scale roughly linearly, which kind of tends to suck unless you do something. And you can summarize those, which is doing something about it. To just put this back into numbers you can work with. Tom will talk about this later, where he will show Loki. And you see all those little graphs, which are auto-generated from events. And this is just a really nice quick step to jump between stuff. You've got traces, which basically go through your execution path of your program. It might be a single program. It might be 10,000 microservices, who cares. The point is, you see how you get through a program, where you jump to different functions or different microservices. You can see where you really spend a lot of time. You might have some hot paths, which you want to optimize more or something. Sometimes they're not that great if you have certain races and such. Of course, they might give you the wrong initial answer. So you need to be kind of careful. As Janna said, they tend to be expensive. So they're either disabled by default, or just sampled every 10K, every 100K, every millionth sample is taken, or traces taken, everything else is thrown away and not even taken. Then you've got dumps. That is when your program hits the wall. And this is also really, really useful to have. Of course, obviously, this is the last breath of your program. So you want to know what exploded, how it exploded, and how you can keep it from exploding in the future. And at scale, you will have things exploding left and right. So now that we reiterated it, because most of what I've said, you kind of knew, but still it makes sense to actively go through this and really think about it and really make yourself aware of why you're doing something. So metrics are usually the first point of entry into your own observability story into whatever. By the way, you can take pictures as much as you want, but I'm also putting those slides on the internet. But you can also take pictures as fine. So those are usually the first entry into your stuff. And you should be using them for alerts, for dashboards, for exploration into your data to see, okay, this thing happened, how did my load on that other cluster change, or whatever. Then you've got logs, which are really useful once you have a rough idea of what you're looking at. You can do, obviously, you need good time in your network and in your services, but if you have good time, then you can establish a detailed order of events of what happened when, maybe something exploded and then another thing exploded and this allows you to actually go down there and drill down. It's also really, really important if you have any legal requirements to do due diligence, user data access, all these kinds of things like which of your co-workers access the private address of that other person, blah, blah, blah. These are things where logs are also really, really useful and you would need to persist them for ages, but depends on your restriction, but you should be thinking about this. Traces and dumps are basically useful to understand either how the whole system works or specific components which you're currently interested in. So this is roughly how your debugging story should usually go when you have some issue. Obviously you need to have this data as well and this is again what we're talking about, to make you think about what you need to solve those issues. So we have complexity, one of the great words of our business. So there's a ton of fake complexity, like huge amounts of fake complexity. Of course, face it if you want to have, I don't know, a simple webpage where you can do appointments. Chances are you can start with a monolith, you can even do CGI bin. You do not need your own Kubernetes cluster with 10,000 microservices, blah, blah, blah, blah, blah. If you scale out, you need this complexity, but if it's really simple use case, you might not need this. You might be over-engineering a process where you have 10,000 messages sent around for one single thing. So this is fake complexity. Fake complexity comes from people trying to feel important or not just understanding what they do or designed by committee. All these things might happen and you should do away with this complexity. But then you also have system inherent complexity. I'm wriggling my mouse. This is hugely complex. I mean, I have my brain, I have my body, blah, blah, blah. But here I have a touchpad which senses by electricity that I'm touching it. This goes into my hardware, it goes into my unix, blah, blah, blah, blah, blah. At some point this beamer says, okay, hey, I can show this. So this is really, really, really complex when you really think about it. But this is system inherent. You cannot do away with this all complexity. You kind of need to have it for the system to function. So as I just said, and now I have this complexity which I cannot do away with. I have killed all the other, but I have my remaining complexity. This is what I actually want to do with my system. So I cannot get rid of it, but I can make small little packages of complexity to just try and shove it into specific corners where there are distinct problems and I have defined interfaces between those problems. So we are at services. Another thing which people always talk about but not really think about. A service in this definition is anything someone else really depends upon. It doesn't matter if it's even yourself. It might be a customer. It might be a different team. It might be your boss. Again, it might be yourself. It does not matter, but this is something relatively well-contained which you can use as something. So there's tons of names for this delineations. You have layers, you have APIs. Doesn't matter. These are basically service delineations between different types or different chunks of service of different chunks of complexity. I like to call these interfaces contracts for a very simple reason. We'll see it in two slides. So I have my services. And even if I take a relatively simple example of just wanting an HTTP service to listen on the internet, I have tons of stuff underlying. We have my network. I have my QBlitz. I have my whatever. I have my microservices. All these basically built that site. So if I were to now just pull something out from under this tower of services, things would just topple over. And this is why I like to speak of this as contracts because contracts imply a firm long-term committed which has been agreed upon between all relevant parties and they actually consciously agreed that yes, we will be doing this exactly in this way and we will not be changing it unless we talk about it first. If you have your cell phone and you're paying for data, you do not expect your cell phone provider to just go, well, we cancel it because reasons. This is a contract. They have to reliably provide that service. And this is why I like that term. It changes how you think about this. And again, this is what this talk is about. So we had layers in the previous talk just now. And this is another really good name for it. And imagine if someone simply changed how IP works. Imagine if someone took those 32-bit things and made 128-bit things out of them. I mean, you're at the conference where we were the first non-networking conference where we defaulted to IPv6 only, where we defaulted to having net64 to push all the developers to really get their stuff running on IPv6. IPv6 was specified in the 90s. And so it's three decades now, roughly, and we're still not done. This is how when you have a really well-working service delineation, how long this can run. And you internalize this, but you don't really think about it. So why do we agree? And I took ahead of that slide because we internalized that it's good practice to contain this complexity. And a lot of the services and things you buy have this built in. So you should also be building this and contain your complexity into manageable chunks. CPUs, we trust CPUs. Well, if you read the topic, maybe not so much, but the thing is we trust CPUs at a basic level and they have a relatively well-defined interface. So we just know we can compile code against our operating system, which is running on a certain CPU, blah, blah, blah, and it works. A CPU is incredibly complex. I'm certain if I started studying now, I would not really ever fully understand what a CPU does. And most of you neither. Still, we trust it, we use it, and we don't even think about it. Some of us, if you have a random cloud service, you might be using tens of thousands of CPU cores. And you don't think about those compartments of complexity. So switching over to the actual topic of this talk, because all the other things were just pre-empt to get you to think about what is underlying of this. Same as I don't care about what gate I am at, as long as I have a working gate or a seat in my plane. Customers care about their services being up. They don't care about other components. You have to discern between the primary stuff, which is service relevant and everything else. You have to have signals for this one thing is exploding, but unless it's customer impacting or imminently customer impacting, you should not actually page people and wake someone up. Of course, it's a really useful signal once you start debugging. It's a really useful signal when you are there during business hours and trying to fix stuff, but it's not something which is really truly urgent. Anything customer facing or impacting is really, really urgent. So you probably guessed that one. These service delineations are there to contain complexity. So a different definition of Baron Schwartz. Monitoring tells you whether something works or not, like pretty much a binary state. Observability allows you to ask questions about why it's not working. But it's all just names. The important stuff is the concept which you have in your head, hopefully, by now. So observability in my mind is nothing you can ever really achieve. It is always something you chase. You have a moonshot and you try to get there to really, really have true observability and everything you do on the way is already good. And again, it's about mindset. So best current practices. Every outage you have gets a blame-free and I mean blame-free post mortem. You can say an engineer did X. You can even say that person did X. But you can't say that person is stupid and we should, you can't say that person is stupid and we should beat them up because they did X. So this is really, really important to establish trust. Because if you don't have the trust that you will not be blamed, you will not be fully open about what happened during the outage. So this needs to be blame-free. And it's always a good point when you do a post mortem to look at your SLIs and SLOs as in service level indicators, service level objects. Are they still useful? Should we have different ones to maybe be quicker next time? Can we even cut off some data collection? Thank you. And once you have that, you link services in your dashboards. So it makes really easy to jump between parts of your services to understand the whole picture. You can also build overviews. And if you have really truly important stuff about the underlying services, surface this to the other service owners. Of course, they need this context of, hey, this other thing is exploding. If possible, don't do black box stuff. I mean, also do black box stuff like is this working for my customer? But it should not be the end goal. The end goal should be that you actually manage to do something in your system to instrument your code to really, really get into the code. And every single time you think, maybe I should put a debug statement, just put a counter. And in like two months, you can still look, okay, do I need this counter? Was it useful? Was it not useful? But by doing this, you're building up data while you're looking at your stuff and think, like, maybe that would be useful. And especially in networking space, which is where I initially come from, this might mean that you need a conditional PO and you just force your vendors to, I need to have signal X out of that box. And unless you give me a signal X, I will not be buying those boxes. Now, if you collect tons and tons more data because you put all those counters instead of debug statements, blah, blah, blah, you have a lot more substance to work with. You should avoid having data lakes. You should have meaningful metadata attached to that data to your signals as early as possible. And obviously your tools must be able to handle this load. Of course, if your backend explodes because you're emitting more data, that is not very useful. And it's really important to choose tools which allow you to really, really work with the data to automate your processes, blah, blah, blah, and not just like dump it. Oh, no, last thing. So, as you hopefully know best how your services work, you are the people who can make this observability story. You start with the user, you end with the user. You ask, what critical path do my users have? What common and or critical path do I have during restoring service for that user? Of course, your own user or the other team might be your user as well while debugging your stuff to make it work again. What can you automate more? Does it make sense to introduce new service boundaries like pulling things apart? Those things used to be called functions, then they were called objects, now they're called microservices. The underlying concept is the same. And again, you stop with the user when is your user happy again? Because this is when shit stops being on fire and you can still debug and fix but it's not that urgent anymore. Really good stuff, sometimes called rubber ducky debugging. Take someone, rubber ducky or some person and explain your service. Explain why something is important about it. Why do the users care about this or that part of the service? How do you make it good again if it's broken? These are the things which you should be explaining to someone who's not from the field. Of course, this forces you to think about all those internalized things which hopefully now you can think better about. Thank you very much. Questions? And please remain seated. Why do you have questions? Well, you explained a big view principle very well but in the real world, you have, you not only have layers, you have scope differences because every service is more fun. Any big service will have some form of multi-tenancy. So the scope of the service is not the same as the scope of the things built over the service. And you say it's right, you need to surface the good indicators from one service to the other services built over that. But there is no, there are no convention, no technical solution for doing this. Practically, that means at your service, you build, for example, a graph and a dashboard with your internal, multi-tenant indicators and external, what you want your clients to see. And then your clients, they build their own service over your service. And then they redo manually all these dashboarding and somewhere in all the layers. I know what you're saying, I know what your question is. So how, what do you propose to avoid losing the signal between the lower layer service and the, I don't know. So first, everyone who's talking, it's really loud up here. I know you're not aware, but we keep telling you, please be quiet, thank you very much. So the short version of the question is whatever your services are, those might not match perfectly onto the services which you're actually running. And the simple question is you need to break down into atomic services unless they're, you can't really reasonably get smaller than this for whatever you're doing. And then you build groups of services and those have a specific name, service. So you build services out of services and you have different delineations for different groups about that someone might need an API and someone might need Grafana. That's a different story. I mean, this is, you need to talk to people, you need to work out what works in your organization, what works with your customers, blah, blah, blah. That's a different thing and that's a human thing. So you just need to talk and to get the intentions, to get the requirements and then work from there. But the generic answer is you need to break down into atomic bits and pieces of services, which makes sense. And then you build larger services from those atoms again. And doing this consistently allows you to arrive at something which is pretty good to debug and pretty good to handle. But it's not magic, obviously. Anyone else? Thanks for your talk. There are lots of standards in IT industry like OCI initiative and some web standards. Do you know about any standards for monitoring? Standards for monitoring. Yes, I mean, there is a ton of stuff which kind of grew over time. I'm very biased, but I'm still giving the answer. I'm Prometheus' team and I really like Prometheus and Prometheus exploded in the space and the Prometheus exposition format is something of a defective standard. And it's the first real one, which we had ever since SNMP, which is why we took this and are trying to make this into real standards, which is called open metrics, to at least have something for a signal, for metrics. Once this is done, I'd like to also explore other types, especially events, probably locks first, but not quite certain. I don't know of anything else which is really a standard, except for SNMP, which sucks. Other questions, way in the back, I think, or no. Questions? Also just for the record, you're not so special that your chair does not bang. So if you enter or leave, just stay there for some time because it's really, really loud. So questions? Not quick. You talked about open metrics. When are we going to see it? Yeah, I know. So for the internet draft, this is blocking on me. I do have a pull request and this is on GitHub. You can look at it and that's what we have in ways of internet draft. You can also look at the Python library of Prometheus Climb Library, which already emits open metrics and is standards-compliant and current Prometheus also ingests standards-compliant open metrics. So you actually have something to test against. It'll take a few more months. Other questions? Okay, thank you very much.