 Awesome. Thank you very much, everyone. We'll come back from the break and we'll be continuing with our next session from Richard. Richard is very well known for personality within the open source industry. And in this session, he'll be speaking to us about observability when you are deploying your infrastructure or you are managing your services or whatever you have in your enterprise. One thing that needs serious attention is observability. And he will be introducing us to observability 101. We did some workshop, we did a workshop on Grafana yesterday and last week. So this will give more context to understand it better and other things around observability. Over to you, Rich. Thank you. So it seems you can see my screen. Perfect. Yes. So yeah, let's get started. Observability 101 with a focus on Prometheus and beyond. Let's start with the buzzwords. Because observability and also SOE are absolute buzzwords. And as per usual, buzzwords tend to have a core of truth, a core of meaning, but often they're just applied to whatever you already have. Which is understandable, but it is somewhat dangerous because often you need to actually understand why a term has become a buzzword and why there is so much industry attention onto the term or that concept. There is a concept of cargo culting which is basically just replicating what you perceive others to be doing without actually looking into the details of it and then not getting the outcome which you would actually like to be getting, which is obviously not what you want. So it's dangerous to just try and see what others do. It's a lot about also thinking about why it works for certain people and for certain situations and then to try and adapt this to your own problems and to your own space. As I said, this kernel of truth, it is a lot in observability and necessary about changing the actual behavior and not just changing the name of whatever you have done before. In this context monitoring is the old term more or less and it has more taken a meaning of collecting a lot of data but not necessarily using it. There are extremes where either you just toss everything into data lake and you don't really use it, at least not in monitoring and observability context or full text and all full indexing which is just hugely expensive. So it's about finding out why something is the way it is and not just yes it is the way it is. So observability to me is about enabling humans to understand complex systems and obviously with cloud and such you get ever more complex systems and you don't want that. You don't, I mean you want to have the complex systems and you want to have the benefits of those complex systems but you still want to enable humans to understand those systems and also at the same time you enable machines to understand complex systems which means you can automate a lot of things like for example alerting. So again this is a lot more about the why is something broken and how can I fix it and not just well yes it's broken and I start from scratch in my debugging. We need to look at a few more terms to go to the depth of this. Complexity is one of the most important ones. I distinguish between two types of complexity. One is fake complexity which is just bad design or legacy design or what have you maybe there were design constraints before which are not there anymore doesn't matter but often things which are complex in a system are not system inherent and they can be in that complexity can be reduced and it should be reduced because if you just have complexity for complexity's sake you're making your own life harder and you're making it more expensive to run your service which is again not what you want. On the other hand we have this real and system inherent complexity and that is complexity which is a necessity of actually doing what you want to achieve because obviously if you do complex things and good things and if you have lots of moving bits and pieces there is some complexity which you can not just reduce away you must actually deal with it because that is part of what makes your service a service which your users want to use. So you can move this complexity around. We had monolithic and mainframe designs, we had client service, we have microservices, for example Prometheus itself is a monolith so you can see that you can make different decisions and even within the cloud native context it can make sense through a monolithic services like Prometheus but you cannot get this away you just move it. It must be comparsional mentalized, a different name for this is Service Boundaries, like your hard drive is insanely complex but it has a clearly defined interface and your operating system and your main board can just address your hard drive same for your CPU and such. Those are also super complex but it's comparsionalized away so you don't have to deal with it on the level which you are dealing with with whatever service you have. Same for cloud instances and everything of course and ideally it should be distilled in a meaningful way that you can actually extract what you want from that complexity to understand what is happening, where you need to understand it and else you can just more or less ignore it unless you're part of the team which is actually responsible for running that one thing. SRE is another password which often comes up in the context of observability. Another one would be DevOps which are again not precisely the same but they go in the same direction to me and there are other definitions but again to me the core meaning of SRE is to align incentives because you want different people, different teams, you want them to actually work together and not against each other and a lot of what you can see in the Google SRE book and such is if you just distill it to its essence it's basically about making people think about the same things and and aligning their incentives so they automatically without having to discuss and fight about this do the same thing or go in the same direction. Usually important here is SLISLO SLA which is service level indicator, service level objective and service level agreement. The indicator is just what you measure, the objective is what you do not want to go above and the agreement is where you go above you actually have to pay or you break a contract or whatever. Specific example if you have error budgets for your service this allows your developers, your operation people, your product manager, your everyone to optimize for shared benefits. If that service is super stable everyone gets to do their A-B testing and their new features and everything but if that error budget goes away the operation people can say okay we cannot put any updates unless it's super well tested which puts load on the developers and they can't put their new features in which they don't like so obviously they will try to use up the shared error budget which is shared between everyone as well or as good as possible same as the operations, same as the product manager, same as everyone else that is aligning the incentives of those people. What can this mean in the specific? Everyone using the same tools and dashboards would be a good thing of course you have this shared incentive that everyone invests in the same tooling. Everyone works on the same dashboard so they share a language because the terminology is if you have only one dashboard or five the terminology is the same so they share this language automatically they share the understanding of how to look into the service of course again all their tools and such are working the same way. That also pools your institution knowledge of course one improvement to the one dashboard from that one personal to the alert or it doesn't matter benefits everyone else so you don't have those 10 different islands of data. No everyone is working on the same system and they share this system knowledge. What is a service? It's basically you compartmentalize your complexity, have your interfaces, they usually have different owners and or teams and contracts define the service interfaces. Why the term contract? Well I like this term because it is a shared agreement in writing which must not be broken. If it's being broken you need to discuss this with whoever is invested in that service or in that contract so automatically you have this control function this forcing function. It doesn't matter if those are external or internal customers. Some people in your org will care more about external customers but I would argue that internal customers within the org are just as important because they provide services to other external customers so treating yourself within the org as your own customers between different teams and service owners makes absolute sense in my opinion. You could also call this layer and the internet wouldn't exist without proper layering where you can have your layer two, layer three, layer everything and you can actually fully paralyze the work on those different layers. Different innovations can happen everything as long as the interfaces are the same and are compatible you can do whatever. Wi-Fi has been developed without TCP having to be or IP having to be adjusted for this. Of course it is a different layer with clean interfaces and so you could just do this on a different layer and no one had to think about could I ever have wireless at the time when they designed IP it's just still working and yeah I already talked about CPU, hard disk and such. Even your lunch you will not in common case be doing everything like you won't be growing all your own wheat and blacksmithing your own tools to actually grow that wheat and everything you will be buying certain bits and pieces so no matter how much you cook yourself still you have those service interfaces everywhere in your life. Your customers they don't really care about your internal things they don't care if half of your database nodes are down they care about their database service being up and quick and that is how you need to think about those services you need to think about them from the perspective of the paying customer who doesn't really care about any of your internals they just care that the service works. Something which you will not see very often but which I think is hugely important you need to discern between different types of SLIs. Usually the wisdom is that you only care about your own SLIs which I disagree with I think that you need to also care about the SLIs of your underlying services so basically what what your underlying services consider their primary and service relevant SLIs for alerting such and for for seeing if the contract isn't all right and everything you should be treating those as informational SLIs to help you debugging and to understand what might be happening in your underlying services or in the underlying services which you rely on. As to alerting anything which is currently or impacting a customer service must be alerted upon and nothing else should be if it's just a disk which is half full whatever raise a ticket do it during business hours don't wake someone up for this. If your customers are not able to access the system of course of that half full disk that is the reason why to alert but not just that a disk is full over something. So let's look at tools. First and foremost obviously Prometheus many of you will know it but still let's walk through to 101. It's inspired by Google's Borgman it's a time series database which internally stores the values in 64 bit numbers. It has concepts for instrumentation and exporters instrumentation being modifying your own source code or other people's source code to emit metrics directly from within the system and exporters are basically proxies where you can take as an MP or a database or something and rewrite this into something which Prometheus can understand. It is not meant for event logging and dashboarding happens via Grafana. The main setting points of Prometheus are it has a highly dynamic built-in service discovery you can just point it at your Kubernetes and everything will happen as if by magic. You can have a zone transfer and just transfer that one zone and Prometheus will just start monitoring all what's in there. There's integration which pretty much every cloud provider or at least every major cloud provider and there's more coming all the time. So you just point them at this end point of your cloud provider and Prometheus knows about the services and starts scraping them. You don't have a hierarchy data model. You have an n-dimensional label set so you don't have your region country customer and then you want to group by customer and you kind of break this hierarchical tree model. No, you can just select by the label customer equals whatever and you're done. There's a language promkl we use it for everything processing, graphing, alerting, exporting everything. You need to learn it. It's a new language but it is insanely powerful. Prometheus itself is quite simple to operate and it's super efficient, most likely more efficient than anything which you saw which is older than Prometheus which is not so common anymore but still there's still people who see this as a new thing. Other selling points, it's pull based which gives you nicer properties about certain types of alerting and consistency checks. We have the concept of black box monitoring where you look at stuff from the outside versus white box monitoring where the box is completely open and you can look into the inside of that box. Usually every service should have its own metrics endpoint which with the agent and such you can go against but usually in Prometheus you should have that and we have super hard commitments within major versions about what we treat as as stable so we don't just break stuff. What are time series? Time series are recorded values over time or which change over time. If you have individual events like a function being called those are merged usually into counters and or histograms like latency and such, changing values like your temperature or your memory usage are as gorgeous and they can go up and they can go up and down. Typically the examples you've probably already read access rates to a web service would be a counter, temperatures would be a gauge, service latency would be histogram. It's super easy to emit and parse. I know people who just print F in their C code and put this on a website and that's it. That's how they emit data towards Prometheus. Super easy scaling. Kubernetes is equivalent to Borg which is what Google runs their services with and Prometheus is basically equivalent to Borg one but the APIs are more of Monarch type. While Kubernetes and Prometheus were not started with each other in mind, inherently they're already signed for each other because of the heritage, of the shared heritage and also if Kubernetes changes anything about their cubesake metrics and such, that's always agreed with Prometheus team because we have people overlap between those two projects. Roll numbers. The highest we know of are 2.5 million samples per second and Prometheus server which comes out depending on how you tune it. A recent test I got to 60k samples per second and core tests before that we went to 100k. We can compress those 60 bytes per sample and second or per sample into 1.36 bytes which speaks a little bit about the efficiency and the largest Prometheus we know of has 125 million active series. There's two long-term storage options. One is Thanos, one is Cortex. Historically Thanos is easier to run and was scaling storage horizontally whereas Cortex was harder but it has become a lot easier and that started with scaling the ingester and the query horizontally. Cortex took in the code from Thanos to also scale the storage horizontally and Thanos is working on taking the ingester and query horizontally scalable code because those projects are super new to each other. They experiment differently but still they are closed and I hope that at some point they emerge but probably not but I would hope so. The official format for Prometheus is called open metrics. It's basically an independent standard of Prometheus but Prometheus uses it as its official standard that is mainly for compatibility reasons to give people and projects and vendors something to support which is not called Prometheus. So for political reasons in part that name was chosen that's also about putting all of this into ITF so you have a real official independent standard. There is a concept of three pillars metrics, logs and traces. Of course they usually have the metrics and logs are the easiest and cheapest in many ways and traces are just where you go with your application monitoring which is why those are super tight, super tightly coupled and in particular tying metrics to traces or logs to traces is super easy with exemplary which is a way to attach IDs of traces directly to your logs or your traces. Reason being you don't have to have the full label set. On your traces you can just use this one direct pointer which has a few nice other properties. In particular you can just you already have all the context when you jump into your trace. You already know what's wrong. And yes I'm absolutely serious about that one. I did start Open Metrics to change how the world does observability where you have metrics, logs and traces all with the same data model with the same underlying assumptions so it makes it easier to jump in between those those things. Speaking of which, Loki. Loki is basically like Prometheus but for logs. It has the same label-based system like Prometheus. You don't need your full text index, you just index your labels and everything else is an opaque string which makes it super quick and super cheap to run. It works at insane scales. Your logs would have the same label sets as your metrics. I already said that which makes it a lot easier to just jump between the two and you can also easily extract metrics from your logs. If that looks familiar that's because it is. You have your timestamp which is mandatory in logging but else you have the same label set and then you just have your opaque string. That leaves us with traces. Tempo is one of the implementations. It's designed precisely for this exemplar-based world. It's only an object store. You don't have to run any any expensive services in the back end. You can just use an object store. It's fully compatible with open telemetry tracing, SIPKIN year year all those things because it is so efficient you don't need to sample your traces so if you have an interesting trace ID you know you can actually jump to it and you don't just lose it and you can like Prometheus Cortex, Thanos, Loki, they all support they all support exemplars so you can do this jumping back and forth. Some numbers on scaling. For what we run internally we have one million samples per second retain a hundred percent of those and if we go on as we go for 14-day retention with three copies stored we have a cost of roughly 200 CPU cores 300 gigs of RAM and 40 terabyte of object storage for 1 million samples per second for 14 days and we did a 10x jump recently and we already have plans for the next 10x jump. Those numbers are already a few weeks old I think we already have better numbers now. Bringing all of this together this allows you to jump from your logs to your traces directly this allows you to jump from your metrics to your traces from your traces to your logs and all of this is open source you can run it yourself we also have a cloud offering in such obviously but all of this is open source so you can really run it yourself without having to pay anyone or so you can just take the software run it and done. Thank you and now for questions. Awesome thank you very much Rich. Let me check the chat so far I have not seen any question yet so but I would like to ask one for all benefit of those that are not that are new to the cloud network system yet now is observability a major concern or a major thing that someone who is new to everything that they should worry about or is killed they should pick up at a very early stage. Yeah absolutely it's a hard requirement in my opinion if you look at previous systems where you had one service running on one machine or some such a lot of you basically have a lot of the same underlying complexity but this was well hidden behind an operating system and behind more traditional tools which allowed you to do all that debugging already that changed with cloud native course the cost or part of the cost which you have to pay for being so flexible and so scalable in a cloud native world is that you that you redistribute the system inherent complexity and if you have a service you had maybe your server and then you had more users and you had to buy a bigger server and and you you contained a lot of this through the system but now if you run everything in the cloud and you you have a lot of users jump in maybe you just scale out to two to three ten times the amount of whatever is your service thing and you scale this out and this leads to an absolute absolute explosion of the overall system information about your system as it is running and this immense amount of data you're not able to do anymore to just as a human go through a few log lines and figure out what's happening it's just impossible because you have so much stuff going on at the same time so you don't have a chance to to run a service properly unless you have a chance to understand how that service is run okay yeah and observability in large part is just a code word for make it possible to to understand all of this and not only understand what is happening but in particular understand what and why is happening when something goes wrong okay how is it different to tracing um not at all tracing is part of observability um there's like there are different there are different approaches to how you do tracing with in with in observability if you want me to talk about this I can easily do it but the high level reply is just um it's one of the signals which you need to do for proper observability and at least if you have access into your software which is cloud native and such is is pretty common if you run more traditional services or even servers and machines and and network routers you usually don't even have access to those places yeah we just cannot but as soon as you have access to them to create this you should absolutely make this part of your observability story okay awesome yeah thank you very much I think we still don't have any questions I've checked the live stream also there are no questions but I believe the participants have seen your contact details you can reach Richie on either Twitter or send him an email if you have any questions or if you need more clarification on observability or tracing he's our next partner then can definitely point you in the right direction thank you very much Rich and thank you for having me yeah sure