 Okay, today we are talking with Richard Hartman with Grafana Labs, an intro to open source observability with Grafana, Prometheus, Loki, and Tempo. Everyone, please remember that during the webinar you're not able to speak as an attendee. You're using the chat I see already, so say hello and thank you for Richard. We'll get to as many of those as we can at the end and we can even stay on a little bit longer since we're running late to take care of that. This is an official webinar of the CNCF and as such is subject to the CNCF Code of Conduct. Please do not add anything to the chat or questions that would be in violation of the Code of Conduct and please be respectful of your participants and presenters. Please also note that the recording and slides will be posted later today to the CNCF online programs page under online programs. They were also available via this registration link which will take you to our online programs YouTube playlist. With that, I'm going to hand it over to Richard so we can kick things off. Thank you again for Thank you. Thank you Libby. Thank you everyone. Word of warning, I keep getting pop-ups from the platform that my internet connection is unstable which I don't believe is the case but something is broken-ish so if I drop or anything I'll try to rejoin. So let's get started. Intro to open source observability. A little bit of validating that most of my life I've worked in engineering architecture operational roles so I have strong opinions about the right tools and about not perfect or not good enough tools. Oftentimes you have this parrot thing there where you have breaks between different media, where you have breaks between different trains of thought like how to how to index your data, how the mental modeling works, maybe one thing has the one color, the other thing has the other color or one thing is left to right, the other is right to left. It doesn't matter, you have breaks between your different systems too often which in turn means that way too often you end up paying extra cost in mental overhead or in automation overhead. It's not seamless, you need to switch mental modes when you go from your logs to your traces or what have you which is not nice and it just adds friction and you don't really need that at like five in the morning on a Sunday when you've gotten a pager. So let's try and rethink what we actually want to do here and I'm going to go through a little bit of like the philosophy of observability in a few buzzwords as a foundation let's say for what we are then talking about. There is a thing where the cloud native scale is basically what internet scale was two decades ago and that's kind of important to keep in mind because a lot of issues which we see in the cloud native world have already been solved in different contexts before us and it's always a good idea to look at what engineers before us did to solve problems like not the specific implementations because usually they don't fit their age if they're too old or the new age but the underlying concepts like for example computer networks, the internet, also power networks a lot of those tend to run on metrics because this is already a predestination of what you care about as a domain subject expert. So it's always good to look back at what has been done before and what worked not from the specific implementation but from an engineering point of view. As always in tech we have buzzwords. Buzzwords are usually they have a kernel of truth but by the time they are buzzwords they have lost most of that meaning which is a pity but it also stands like why they were so successful. Of particular note is cargo culting. Cargo culting is defined as observing behavior and observing success or results of that behavior and emulating that behavior without actually applying the underlying thought or fundamental engineering practices. It comes from indigenous people who observed soldiers building runways for planes and small control towers and such and then the gods sent stuff from heavens which was basically just logistics of the army but the perception was that just by building runways and such you could get gifts from the gods and to this day those things still echo in a few religions. So that is observed behavior it becomes part of culture but it's not actually doing anything it's not actually pursuing the goals or the underlying rationale and that's something which you always need to be worried about. It's not about just changing the name for a thing and anyone who was a sys admin yesterday as a today and you're done it's about actually changing the behavior and actually understanding why something is successful not just observing that it is successful. Monitoring while I personally use monitoring and observability more or less interchangeably and that is buzzwordy definition monitoring has taken a little bit of a meaning of collecting data not using it. You have two extremes in this one takes one thing where you have the full text indexing where you just in a vein attempt go after everything which you can find or data lake which outside of batch analysis is often a euphemism for no one is ever going to look at the thing. Observability is trying to reframe that a little bit about being able to ask new questions just observe what inputs what outputs a system has and being able to deduce the internal state of that system from those inputs and outputs as in ask questions which you didn't know you wanted to ask before and that enables humans to understand complex system but it also allows you to automate a lot of this. So it's not just about determining that something is in a certain state it's also about determining why it is in a certain state and ideally how to get it out of that state. If you cannot ask questions on the fly like new questions it's just not observability another super important concept is complexity where you have what I call fake complexity aka bad design which you can reduce and you should reduce in my opinion like unless you have other engineering constraints like I don't know money GTM maybe maybe compliance reasons what have you but outside of actually reasons why you have complexity you should always strive to get rid of complexity but you have real system inherent complexity as well and that can be moved but it cannot be made to go away like state is always someone's other's problem you have all your microservices they're stateless but someone has to maintain the database so that that complexity has to live somewhere. So yeah you can move it back and forth you can compartmentalize and in my opinion was strong opinion you should compartmentalize it and you should still it meaningfully and we have two different definitions of distilling this a the apis towards whatever the consumer slash user of the thing is and b already start thinking about what you need to emit towards the observers towards your operational teams so they can look at the thing that is basically SLI's SLI SLO SLA oftentimes people are confused what they mean it's really really simple SLI as several service level indicators what you measure objectives are what you need to hit and agreements and when you need to start paying of course someone broke a contract a lot of of SRE to me is about aligning incentives across the org because if you have devs they want to ship code then they want to ship new releases ASAP you have operational people who are paid for for stuff not breaking so you have diametry diametrically opposed incentives where the one group wants to move super quickly and the other group wants to move rather slowly and carefully and so they always they always fight they always have strife course that's built into literally into their compensation structure and into their complete organizational structure there is one of the main things of SRE to me is the concept of error budgets there everyone shares a budget for how many errors a thing can have and if you hit those budgets it's fine but it doesn't matter if this is due to new features or a B testing or a new deployment where the PM needed something really really urgently or things always breaking if things break too often in operations the devs don't have error budget for their testing and deployment velocity anymore so you align those incentives another nice thing is if you're able to build a shared understanding not just align incentives between people and that's where dashboards are coming in where all those dashboards ideally are shared between all those different teams because then you have an incentive to invest in shared tooling and everyone improves a little bit and everyone else benefits from the thing you pool all your institution knowledge around a thing from a lot of different angles and everyone works together in making this better it also means you're building the same language and you build the same understanding everyone has the same dashboard the PM doesn't need to fight the engineers about what that one metric is of course they literally look at the same data they don't use different words for different aspects of course all of them use the same dashboards the same alerts the same reports which in turn means they use the same language services another super important concept they could basically compassion mentalize complexity and if you remember just now I said one of those two abstraction layers would be an interface towards the user they usually have different owners and teams obviously teams can have or own more than one service but by and large they they tend to have their own groups of of whoever's responsible and contracts define their interfaces I like the term contract a lot of course it is is as commonly defined as a written agreement which must not be broken and you actually write it down and you agree it and you sign it so you have an agreement and by writing things down and making things explicit a lot of those implicit misunderstandings just go away of course once it's written down and agreed and that's the basis for what you actually do and how you operate a lot of people will take a second and third look and actually start negotiating details instead of everyone being like yeah whatever it'll work and then it breaks and everyone is fighting why it broke and then you realize that you had a lot of misunderstandings doesn't matter if the customers or consumers are internal or external treat them as if they were external of course they are depending on your thing anyone coming from networking like myself layers are another way of thinking about this the internet wouldn't exist without proper layering of course I can literally rip out layer one and layer two and I have instead of Ethernet I have Wi-Fi or what have you and that wouldn't be possible without those clean and long-term stable interfaces between the different layers other things like CPUs hard disk compute notes your lunch even if you cook from scratch you will not grow every last cucumber yourself you have certain interfaces where you buy other services and and just consume those alerting also super important um customers don't care if I don't know you have 20 database notes they don't care if 15 of them are down or five of them are down or all of them are healthy they care about that service which they are consuming being healthy and responsive and what have you so that's the perspective to mainly take define your SLAs your SLIs your SLOs from that perspective of is it user interfacing or is it user visible the nice thing if you do this in depth what is your providers SLA and SLI is perfect for you to debug of course if that database is down you don't need to debug why your webshop is not working you kind of know um so you you structure again you use the same language across the complete stack of what you're doing important to avoid burnout anything or anything which is currently or imminenting customers must be alerted upon and nothing else raise a ticket do it during business hours if it's not customer impacting just don't of course you'll burn out so that's the intro part now gets to the tech part permit this permit this if you don't know it's inspired by Google Google's Borgman it's a time series database internally it uses 64 bit values for pretty much everything which is relevant there's thousands or tens of thousands of well thousands of instrumentations and exporters that are public there's millions of installations of permit this it's not for long run by a grana main selling points built-in services cover that is well noticed um next like not impossible very uncommon to run Kubernetes without a permit is of some sort of course they are literally designed from each other even back from the Google Borg and Borg Mondays and more or less by a happy little accident with Kubernetes and permit is within CNCF low and behold those are the two founding projects of CNCF they go together um you haven't you have no hierarchical data model so you don't have your I don't know your region your your city your customer and then you need to select by customer and all of a sudden you need to walk up your hierarchy and all you need to walk down blah blah blah no you have an n-dimensional label set which is slice and dice as you need it so you select by label customer equals x and you're done prom kill is a functional label a functional language which allows you to do vector math on on on your data um which is super efficient like highly efficient in particular course the label matches matching usually does more or less by magic what do you want and this is used for everything processing graphing alerting exporting data every every way you work on the data is always through prom kill so it's a language you have to learn but it's the one language and then everything works simply operation don't need to convince you probably highly efficient it's poor based for good reason of course this makes a lot of things easier to reason about about correctness and up to date correctness of the state of of the wider system push versus pull is a borderline religious debate but in particular coming from the networking space there are some properties of pull which are next to impossible or super hard to to emulate in push based system unless the push based system has complete information of of everything which should be sending data at which point pulling is more efficient anyway um white box black box monitoring one looks at the thing from the outside without further information various white box monitoring looks at all the innards you instrument your code you emit data from internal um every service should have its own metrics endpoint with things like the premises agent which we announced today with my premises team had on um look at the blog on premises ioslash blog um we uh we can also accumulate this data for you and then even push it to to other backends um yeah and super hard api commits stronger than anything i've ever seen in my life maybe except for the linux kernel time series yeah most certainly except for the linux kernel at least defined as user interfaces which are not deprecated anyway what are time series um recorded values which change over time for example the temperature in your room that's a time series you usually merge those individual events of i don't know tens of thousands of people accessing that thing uh into counters and their histograms um typical examples would be requests to web server temperatures service latency this kind of thing it's super easy to emit parse and read that's literally how it looks on the wire so it's like i know people who print f in their c code and then just dump that file onto web server and that's how they instrument their code and it works like they're easier ways but for them that works and it's totally fine scaling kubernetes is spork per meters spork one so um yeah scale is is kind of built in per meters and kubernetes are designed and written with each other in mind at borg and borgman again um yeah just looking at prometheus i have a typo there is a two missing in that in that sentence um so roughly one million samples per second is not a problem in current hardware 200 k samples per second and core is is roughly where we are at and but that's already slightly old and the single largest prometheus instance which we saw in production was 125 million active times years like we as in prometheus team i know of someone who ran it at 700 million so um yeah it's kind of scalable but it's also painful at that point you probably would cortex or thenos or something speaking of um there's two uh two projects which have high overlap with uh with prometheus team members um thanos and cortex historically thanos is easier to run and scales storage horizontally cortex is a lot easier to run these days and it started with scaling a store ingesters and querying horizontally it took the code of of thanos to also scale storage horizontally guess what thanos is working on with ingesting querying um data from uh from from grafana itself um the largest single cluster it's not all of grafana just one cluster um but that's already old data we have higher numbers now um 65 million active series at a cost of 668 cpu cores and 3.4 gigs of ram um one customer is running at 3 billion but that's kind of more than pushing it but it did not completely die in a fire loki uh it is basically like prometheus but for log so it falls all the same design principles has the same label based system it has the same indexing type um it takes tons of code from uh from cortex um for kind of ovary seasons the nice thing is you don't need a full text index of course usually if you work on logs you don't need every last bit and piece of your thing indexed most often you're able to uh to extract a few relevant bits and pieces of of information you index that you search on that and the rest is just an opaque string which is which is stored without indexing which means you have a lot less overhead than cost and storage and in particular and indexing and in lookups um you can work at scale like significant scale sorry and one of the nice properties which are initially non-obvious to a lot of users is um as you use literally the same label based system as prometheus it's trivial to to turn your logs into metrics to extract metrics from your logs for alerting graphing blah blah blah blah uh basically preprocessing or processing logs into metrics again remember same like uh internet scale two decades ago that's kind of the same trick which is literally the same thing where a lot of singular events were turned into metrics and then just the metrics exposed in loki you have that mechanism built in which is super nice and except for google's m tail which kind of was dead even when it was released something which which we haven't seen in the open source or in the open world like certain search engines and such have this internally but not uh not others prior to loki at least and you can pump basically all type of of of text-based information into into uh loki one of the lead devs at velge even puts um his his car telemetry and pictures from his dash cam into loki course he can and he likes to of course again the content back here is unindexed which means you can just put whatever in and it's just an opaque string or blob uh to be precise you might remember uh the prometheus exposition format we saw earlier or the open metrics format which we saw earlier um that's literally the same with the labels here uh you just need a timestamp of course obviously an event is is always at a specific point in time um so you need to emit that specific point in time whereas um the metrics are handled differently on a conceptual level you can emit precise timestamps but usually for mathematical reasons which we are not going into here um it's it's better to to have prometheus or cortex or ethnos handle handle the timestamping versus with loki it's better to have the emitter handle timestamping um some numbers um our queries at at graphana labs regularly see 40 gigs per second gigabytes per second um i know that we already at at rough production see 80 gigs uh due to a new way how we how we um scale our queries um which means you can go through insane amounts of data within a super short time be regularly query terabytes of data in under a minute and ideally you then emit this back into metrics so you don't have to do those expensive or relatively expensive queries regularly you can just what you really care about already emit into metrics and then again you reduce um total amount of information also computational complexity by orders of magnitude tempo the last of the bunch um with open metrics um there was another thing which which was brought into the open which was before that basically limited to to google um with my open metrics head on when we were talking ages ago about potentially merging open sensors and open metrics um one of the things which stuck with me is when when the googleers mentioned how how searching for for traces didn't scale and when google tells you that searching doesn't scale searching for something um you better listen which which i did so exemplars are just an id you have an id for a trace and you attach that to the trace but you also attach that id to a metric or to a log line so now you know that a relevant metric or relevant log line carries a trace with it and you don't have to have this needle and haystack problem where you um where you have to search through all your traces or life analysis do a life analysis on your traces to deduce what the properties of that particular traces um you already know that this is a relevant thing of course it came from that high latency bucket where i don't know your p99 was two seconds what have you doesn't matter but you know you have a high latency there you know you had that one error you know you had that one security exception what have you and you know that this one trace is relevant to the thing which you're currently working on in which you saw in your logs or your metrics so you don't need to search you don't need to switch mental context all the time trying to walk through a ton of traces or spans you simply from your metrics from your logs where you already know that something is relevant jump into your traces super nice um they are built into pretty much everything which we're talking about um of course kind of obvious they're nice but temple also um also allows you to search of course some users and some use cases just require searching of more or less raw um traces and spans my own personal opinion at some point it would be nice to optimize this out if if you need to do search as of today um but if you need to rely on search going forward that's also completely doable better would be if if you go through exemplars because it's just so much more efficient only works in object storage you don't need kassandra elastic anything expensive in the background um given an object store and you're done um it's compatible with all the things open telemetry tracing sipkin jager um by default we are not uh sampling uh you can sample if you want to but we don't sample um i also need to update that slide i see um of course as of four month ago which is eons in in this production velocity uh we had over two million samples per second at 350 megabytes per second and we have 14 day retention three copies stored at a cost of 240 cpu 400 gigs 450 gigs of rem in 132 terabytes of object storage in the p99 of 2.5 um it's better already but like temple scales and it scales insanely high bringing all of this together um this this more holistic thing allows you to jump from logs to traces from metrics to traces from traces to logs and all the all the other different ways um of course like it's literally designed for each other and while they are all distinct projects and you're not forced to use all of them to to reap benefits if you so choose you get you get the most bang for your non-bug um of course a those things have been designed for each other and personally speaking since at least 2015 i have been working towards having those three things for metrics logs and traces as a holistic thing so there is a long-running underlying design um as to the bang for the buck all of this is open source you can run it yourself i like food and shelter so you're also more than welcome to go to grafana cloud or or buy enterprise or what have you um and there are some more features rough sniff test if the user the intended user has more money than time it tends to be a paid feature if they have more time than money it tends to be open source like that's roughly the the the sniff test for our monetization strategy again most or anything we talked about right now is completely open source you can run it yourself a few screenshots most of you know how how grafana looks but still those blue lines are relatively new and super nice you can have events you can have you can have your alerts you can have things like this which which give you a lot more context you can also have exemplars visualized and things like this and tons of other visualizations as just last week we had observability con 2021 online obviously um a lot of what we just talked about you can find in more depth uh without that rush tool to cover as many questions as possible at this location grafana con slash go anyone um that's also part of the slides it's even a clicking thank you very much uh you can post talks on github like all of them for last decade also um email twitter are there for your per user and let's see what we have as questions um do we get created questions and they're read out or how does it work i honestly don't know sorry i didn't anyone with a question just drop it into the chat and richard can take a look at it as they come in sounds good and we'll go from there good good so there are currently no questions which means i wouldn't have had to hurry as much i can also add lib and go into more detail and other stuff but do ask questions if you have any how to orchestrate apps to integrate with grafana cloud um can you orchestrate is can you expand on what you mean with orchestrate course i think you're mixing on the one hand your own orchestration of application versus um how to how to emit data towards grafana cloud i can try and have have a partial reply as as to the second part of that question how i understand it um the easiest way for for most things is the grafana agent um which is what the prometheus agent which was released today is based upon um of course this allows you to to channel all your your signals towards grafana cloud if you have any of the other interfaces like the common ones um they're all supported um like ideally you you put things somehow into into prometheus remote right um to to emit towards grafana cloud if it's metrics um for traces uh open telemetry tracing is is the gold standard so you should absolutely do this um if you have non prometheus things and there's an exporter for pretty much or for probably everything on the market to get data into prometheus format and then you can use the agent or other mechanisms to um to push towards grafana cloud if you want to the open telemetry collector also supports prometheus remote right so you can also use this um yeah pretty much everything which which is on the market is supported prom tail and such for low key and everything is built into into the grafana agent if you just want the bare bones open metrics to um to prometheus remote right pipeline the prometheus agent is better if you want built-in um exporters if you want prom tail if you want to have open telemetry tracing and all those things built into a single binary the grafana agent is better depends on your trade off some deployment models like to have a single huge binary which does pretty much everything other deployment models mandate that you have tons of smaller services both is valid both is covered um are there docker images available um yes for everything as far as i i'm aware um if not focus on on cncf slack or on on grafana community slack or shoot me a message um but i would be surprised if we don't have up-to-date docker images for everything i'm i'm certain we do um do you have an off-the-shelf helm chart for getting this whole setup um i think we do um there's tons of work in our integration crew and we are hiring like crazy for the integration screw um there all of this is made more seamless internally we use tanker which uh is jsonet which is then compiled um into um into helm and others and also is able to to ingest ham charts which means you don't have this common problem of um of having those super static slash brittle ham charts which are hard to to change and and hard to to track in particularly if you have both upstream changes and your own local changes where you functionally need to fork pretty much everything and and carry your own forks if you if you need to do anything more than than really baseline changes i suggest you you look at tanker and and jsonet um i can drop the url into into chat in a few um which is a lot more malleable and also allows you to define other things like like alerts and such and you all have this in one one language jsonet which is quite nice how to integrate apps to send metrics or emit data to grafana cloud um depends on the type of of well okay no you said metrics not signal sorry um okay let's go with metrics and then with data metrics um permitted client libraries um is the gold standard for for emitting um metrics as of today um for uh data defined as as traces open telemetry tracing is the gold standard for logs it doesn't really matter um of course logs is just historically kind of a mess as most of you will probably agree um so prom tail can ingest pretty much everything and just hammer it into shape for um for locally to consume um again all of this is built into the grafana agent um but for for your own applications when you need to emit the actual raw data from your own code and you need to instrument your own code for metrics permit these client libraries for uh traces open telemetry tracing and for logs it doesn't really matter cause prom tail eats it all how does correlation happen between loci logs and temple traces so um going from your logs to your traces the ideal case is you have an exemplar on your on your logs there you know that your id for for that trace or that span or both exemplar support support free form text so as per wc3's um tracing uh standard um we support both span and and um trace id um and that modeling is also coming in large part from how google did it internally like a lot of this has a history from there so it tends to already work nicely with each other um and you just tossed it in and once loci is aware that yes this is an exemplar you can just jump to your trace storage and and just go there um there's also an inverse index where you can look up um trace IDs or exemplars if you have one and you need to see that one log line you can also go the other way which is of particular interest if you if you came to your trace or your span through a search within temple of course then that exemplar is is like the shortcut back into into your logs or metrics should kubernetes applications slash services be designed in any particular way to use these tools what is a good starting point to integrate these tools to custom kubernetes services running in a cluster great question and it's not basic not at all for prometheus slash the adults it's super simple um prometheus um i touched on this but it didn't go in depth has a thing called services recovery um which is just an interface where prometheus understands how other services run their thing uh first and foremost kubernetes but there's also things like text files where you just write yaml and and populate your your services recovery for anyone more on the networking side zone transfer is is possible so you have your bind or whatever unbound dns server um allow zone transfers by prometheus and it just ingests the complete zone and just starts monitoring or scraping everything which is defined in that in that zone and again that is also the case for for kubernetes so you literally just point your prometheus at your kubernetes and you tell your kubernetes that yes this thing may may get the data and automatically prometheus gets all the data from that kubernetes cluster with or from the pods like the services internal blah blah might be different depending on your precise setup maybe in the inside car blah blah blah the usual but for the pods itself and such all that is automatically emitted um which is super nice of course it's literally one thing to set up and automatically you have all that data in your local prometheus if you don't want to have local storage or you have issues with state which was the reason why we created the prometheus operator ages ago to to handle state within with incriminates you can also just run the grafana or the prometheus agent and just shove all that data into e.g. grafana cloud or one of the other prometheus compatible offerings speaking of prometheus compatibility also in the prometheus block again prometheus io slash block um we did start a prometheus with my prometheus head on we did start a prometheus compliance thing there or prometheus conformance thing there if you are compliant to the relevant apis and and service interfaces um you get certified as prometheus compatible which means for the users that you actually know that a thing is prometheus compatible and and you can just use it without fear of of something breaking um prometheus cortex grafana cloud or prometheus compatible do you have any best practices blueprints for self-managed grafana slash prometheus slash lowkey setup any best practices to optimize performance um depends a little bit on your scale so if you have normal scale like if if you're working at a huge company or you run a team and they have i don't know how many users blah blah blah blah blah that is not as applicable but if you have normal sized amounts of data um it's pretty easy um of course you just start a prometheus or a cortex or a thenos um cortex and prometheus have single binary modes where you just start the binary and and you're done in this case i would recommend uh prometheus myself if you get started um lowkey also has a single binary mode tempo as well um so you just start those binaries and you can just uh you can just start uh ingesting data into those systems as for prometheus um i would suggest the documentation on prometheus i o as to grafana cloud lowkey tempo i would suggest the documentation on grafana com um those are the best ones digital ocean also has quite a few um super nice prometheus tutorials which are i think four years old but they are super nicely written so um yeah also we are extending the tutorial section on prometheus i o so yeah does prometheus integrate with tools like istio um i think i know the answer but i don't want to give a wrong answer so i can follow up and uh shoot me an email or something i'll i'll get you the authoritative answer from robot or from joe sorry not from robot and and before i say something wrong 13 more minutes and no questions this is your chance any other questions have we stalled out seems like do you want to include a slack channel or something in the chat richard um or julie just to for any follow-up questions anything like that yeah we have the uh i mean for we have to split this for cortex and prometheus um you have you have the cncf slack i do let me put let me put ours in and uh online programs and then if anybody has any other questions you can hit each other up here for the performance okay well if there are no other questions i want to thank you richard thank you everyone for um hanging in there with us as we got things started a little bit of a rough rough start but um i think this was a great one and you got tons of great questions and let's keep those conversations rolling and um thank you again and the recordings will be up in a little bit uh this afternoon thanks for having me thanks everyone so much thanks everyone