 Hello everyone! Glad to be here at Open Source Summit 2021 and today I'd like to talk to you about something quite relevant for this forum about open source for better observability. I'm going to talk about what observability is, about the three pillars, the signals, logs, metrics, traces and so on. We'll talk about the role of open source and its success and challenges in this domain. Then I'll cover the leading open source tools for logs, metrics and the traces and to wrap up I'll discuss open telemetry and the unified vision for observability. A word about myself, my name is Doton Horvitz. I'm a developer advocate at Logs.io. Logs.io is a SaaS platform for cloud native observability that's based on popular open source tools such as Prometheus, Elasticsearch, Kibana, Yeager and so on that some of which we'll talk about in this in this conversation. I'm also an advocate of open source and communities in general and the CNCF in particular. I co-organize a local CNCF chapter in Tel Aviv so if you're around Tel Aviv do join one of our monthly meetups. I also run a podcast called Open Observability Talk so check it out and in general you can find me everywhere at Horvitz, Ajo RO VATS so if you find something, if you tweet something interesting out of this talk do tag me and I'll be happy to share. Let's start with a quick question. What do you think that is? No, that's not the coronavirus imaging. Here's a hint. These are the microservice architecture diagrams for Amazon and Netflix. Now imagine, what would it take to monitor something like that? And indeed, monitoring cloud native systems is hard. It's hard because we talk about highly distributed applications spanning tens and hundreds of nodes and services and instances, systems that are very dynamic, ephemeral, spinning up and down and scaling in and out. We have additional layers and dimensions. It's not just a bare metal and operating system. You have the guest OS and host OS and container and pod and namespace and version deployment and the Kubernetes control plane and so on. And not more layers. It's also additional dimensions by which we monitor our system. So it's also a high cardinality system. And on top of all of that, any typical system these days uses many third party frameworks, whether open source or cloud services for its SQL database and no SQL database and the API gateway and the message broker and the HTTP proxy and you name it. And while we didn't write them, we still need to monitor all of these. So this is the landscape for monitoring today's systems. And indeed, when you go and ask people what's the difficulty that they encounter running in production. This is for example, from the recent DevOps polls from last year. That's a yearly DevOps survey that we run at logs.io. And we asked specifically about running Kubernetes in production. The number one issue that kept on coming up was around monitoring and troubleshooting. So the problem is definitely there and it's hard. And the way to deal with this is with observability. In fact, the very definition of cloud native systems talks about systems that are observable. You can see here the definition by the CNCF, the cloud native computing foundation. But what is observability anyway? So the form of definition taken from control theory talks about the measure of how well internal states of a system can be inferred from knowledge of its external outputs. And in software systems, these external outputs are essentially the telemetry that our systems emit, namely the metrics, the traces, and the logs. And people often refer to them as the three pillars of observability. And by the way, these are not the only ones. There are other signals that people talk about like profiling and others. But these are the three main signals for observability. And one important point to make from the definition that we talked about observability is, again, a system that needs to be observable makes it very clear that observability is a property of the system. It's not something that you bolt on in the aftermath. It should be there from day one as part of the system design as a first level citizen. And that's something important to note. We'll probably talk about it more later. But let's go back to observability in plain English. A much more useful definition that I like for observability for software systems is the capability to allow a human to ask and answer questions, which essentially means that the more questions we can ask and answer about the system, the more observable it is. And the reason I like this definition much better is that it makes it very clear that observability in essence is a data analysis problem. We have lots of data from many types and many sources. We talked about different telemetry types such as metrics, traces, logs, profiling and others, but also different sources. We talked about the front end, maybe Node.js or Python code. You have your backend part such as Java or C sharp or something else. You have third parties that you use in your system. I know that SQL database and the Kafka and the Redis and whatever. And what we need is essentially the fusion of all this telemetry across types and across sources, which actually gives us tropes of ability. And that's a very essential point that people tend to overlook. So it's definitely something that you'll probably hear me saying more than once. And now that we understand what observability is and why we need it, let's look at the three pillars and see how they tell us the what, the why and the where about our system. And then we'll talk about the leading open source tools around them. And let's start with metrics. Metrics is typically the entry point because metrics tell us the what, what happened, help us detect issues. For example, it could tell us that the service is down or there's a certain endpoint suddenly is very slow to respond. And metrics are essentially numerical data points to be counters, gauges and others, but essentially numerical data points, which are very efficient, of course, to collect, to process, store, to aggregate. But on the other hand, they lack context, more information around what happened. Now, though it works, which is also what we'd be looking for in monitoring tools, as we'll see later, is that systems emit metrics. And then there's a backend that collects these metrics, whether in pool or push mode, aggregates, calculates all sorts of aggregations, averages, P99, P95, and so on, stores it in a time series database, and exposes a query language for time series data, that we can draw these cool graphs that you can see here on the screen. That's how it works. And typically, metrics are combined with alerting on events. So we can define things that are exceptional in our system. For example, if a user logged in three times or more in sequence within five minutes, alert me on that. Or if an endpoint takes longer than 20 milliseconds to respond over a certain period of time, alert me on that. So we can define events on which you want to be alerted on. And you can have alerts also on logs and other telemetry types. But for metrics, that's the classic alerting. So metrics tell us the what, and then logs tell us the why. Help us diagnose the issue. And that is thanks to the fact that logs are essentially very rich in context. Essentially, that's the developer who wrote that piece of code writes out what's going on in the code. So it's very verbose, very rich in information. On the other hand, it's textual and very verbose, which makes it more takes up more storage space, more costly in terms of storage. So in a way, it's the opposite of metrics. It has context, but less efficient to store. It requires to pass the incoming logs. You need full text indexing with storage and query language to support ad hoc queries by any field that might appear in the logs. So that's how it works. Again, what we'd be looking for also in tooling around the log analytics and management. So metrics tell us the what, logs tell us the why, and then traces tell us the where. That's a new kid on the block and the question where is actually rather new. We were fine with logs and metrics for a long time. But today, with microservice architectures, every request coming into our system flows through a chain of interacting microservices. And then the question arises, where did the error occur? Where did the performance degradation occur? Where's the critical path? These are the sorts of questions that distributed tracing comes to answer. Help us isolate the issues and improve the performance of our system. And the way it works is that each call in our system, its system calling on our chain creates and emits a span for that service and operation. You can think about it as structured logs, which include context such as the start time, the duration, the parent span and so on. Now this context propagates between the calls in the train in the trace through our system. And then there's a backend that collects these spans, these logs that my app emits and reconstructs the trace according to causality. And then visualizes it and analyze it visually. Typically, what you can see here on the screen is the famous timeline view or the gun chart that shows us how the service A calls service B and so on, the sequence of calling. That's on distributed tracing. So just to sum up, we saw that observability is essentially the ability to ask and answer questions about our system. And that the three pillars are metrics, logs and traces that tell us the what, the why and the where, respectively. Now that we understand what's observability, let's see the role of open source in observability. After all, we're in open source summit. That's what we're here for. And which open source projects lead the domain. But before we get to the specific projects used, I'd like to share with you three important insights about the role of observability, open source in observability. The first insight is very happy one, it's that open source is today the preferred path. It's actually you might say it's the new norm. You can see several analyst firms here on the on the screen indicating 60% of the organizations use open source monitoring tools by 2025 70% will use open source instrumentation, and so on. So it's definitely happening. Organizations choose open source. And also, the most commonly adopted tools are open source. That's the number one conclusion actually from the CNCF's end user technology radar on observability was that if you look at all the tools, not just open source ones, you still see the top three, for example, being open source tools. And obviously, that's not surprising that open source is key to observability, because the open source community that created Kubernetes and cloud native has delivered open source tools and standards to monitor them. And it's very important it's not just open source tools, standards, open standards, and open formats take an important role things such as open metrics, open tracing, open telemetry that are emerging to converge the industry and prevent vendor locking. And we'll talk about this as well. So that's insight number one, insight number two is less favorable. And that's that there is no consolidation in the observability space. In the same CNCF and user technology radar that I mentioned, it was very clear that half of the companies are using five or more tools. It's astonishing. The third of them had experience with 10 or more. So this is a serious problem of tools pro essentially. And this is a challenge not just for, you know, operating and managing many tools, as in other fields, but also, it's a challenge for observability itself. Because as we said before, observability is a data analytics problem. And many tools create many data silos. And then we find ourselves very limited in our ability to ask and answer questions when they require correlation across different tools across different data silos. So it's a serious problem for observability space. And the last insight that I'd like to share with you about open source, you know, observability is the topic of real licensing of popular open source that changes the landscape. In the past year alone, we've witnessed several real licensing moves for leading open source projects moving to a more restrictive license, a copy left license such as a GNU, a GPL or even to a non open source license, non OSI compliant license such as SSPL. Typically, this happens by a vendor that controls the project, not by a foundation. So we don't see that by Linux foundation. But when a vendor is involved, it happens tends to happen. And it could mean for the end user, it means that the source code may very well still be available and accessible, but you're restricted in your usage or modification of the open source, or you may even need open to open source your own code in some cases. And this of course pushes some users to look for alternatives. For example, other open source projects under the Linux foundation that can't consume these restricting restrictive licenses, or even commercial companies such as Google in the example here that you can see on the screen, who bans use of a GPL, although it's an open source license. Google open source says very explicitly on a GPL that the risk heavily overweighs the benefit. So the real licensing moves definitely are changing the landscape. Most of them, again, certainly the ones from the past year are very fresh and the industry and the community are still trying to process and see what's going on. But it's definitely something that may change the landscape. And by next year's talk, I may present a different landscape to you on this same topic. So just to summarize the three insights, we see that open source is key to observability, which is very good news. On the other end, we see a challenge with the tools pro and the resulting issues with data silos. And we see the real licensing of open source that changes the landscape. And now that we understand observability and we saw the role of open source, let's look at the leading open source tools. And funny enough, many open source projects around observability are called Open Something, Open X, which causes quite a bit of confusion. I even wrote a quick reference guide to address this confusion. So I'd like to go over the open sources according to the pillars of observability, the signals, and see at least the leading ones. And we'll start as before with the metrics. And the leading open source tool for metric monitoring is by far Prometheus, especially for cloud native systems. Prometheus is a project under the CNCF. It's in fact the second project to have graduated after Kubernetes. So it's quite mature. And Prometheus provides quite a bit of the functionality we talked about before for metrics monitoring tools. First, it provides service auto discovery, which means that it can detect the different services and components in your system. And as we saw before, today's systems have so many of these with microservice architectures with many third parties involved that being able to automatically discover these targets in Prometheus terms is an exceptional benefit for Prometheus. Then Prometheus performs metrics scraping. So it pulls metrics off of these targets, which again is in pool mode is much more friendly in these levels of distributed systems. Then Prometheus has a time series database for storing the time series. And it exposes the query language called promql for querying the time series data. So as you can see, it's quite comprehensive. And again, being sibling under the CNCF, it offers native integration with Kubernetes. So anyone running on Kubernetes has a very seamless plug and play experience with Prometheus, but it's more than just Kubernetes. Prometheus offers a vast ecosystem, both within the CNCF and other projects and tools that support Prometheus. In fact, a key factor here is open metrics, which is another open source under the CNCF that has spun out of Prometheus. And open metrics is the exposition format, the format for transmitting metrics off of systems that has become very popular. It's, as I said, the standard under the CNCF and it's based on Prometheus format, so it's pretty mature. And it also proposes a formal standard now under the ITF. But the most important thing is that the de facto it's been adopted by many, many common tools and frameworks that expose out of the box metrics using this format. So whether you use Kafka or RabbitMQ or MongoDB or, I don't know, MySQL or Apache or Engines or Jenkins or in a GitHub or, I don't know, even Cloud Services, AWS, Azure and others, you're more than likely to see that they have it out of the box. Typically, you just need to go to slash metrics and you can see that or just maybe turn it on in the configuration. And this large ecosystem is key, as we said before, in dealing with such diverse and dynamic systems as we see today. Another open source that is important to note in this context is Grafana. Grafana is not under the CNCF. It's run by Grafana Labs. And Grafana is a visualization tool that has very good integration with Prometheus. It cannot, by the way, work with other data sources, not just Prometheus, but it's very popular in the combination of with Prometheus. So Prometheus plus Grafana is very popular and it offers very powerful visualization of the Prometheus data. One important update from this year from April is that Grafana has been re-licensed from Apache 2, licensed to AGPL version 3 type of license, a copy left type of license by Grafana Labs, which again, we talked about the re-licensing and the changes. AGPL is a copy left, is an infectious license in a way. So it's something that may change the landscape. It's still rather fresh. So many discussions around that, but definitely something to be aware of in the open source landscape. So we talked about metrics. Now let's talk about logs. And the open source stack that has been most popular for log analytics and log management for many years is what's known as the ELK stack or ELK stack, which stands for Elasticsearch, Logstash and Kabbana. Elasticsearch is the core. It's the text distributed data store with full text indexing. It's based on Apache Lucene text search engine Java library and it also exposes Lucene query syntax. And you have also Kabbana, which is the visualization tool that goes along with Elasticsearch and allows for, you know, creating dashboards and ad hoc querying with Lucene and so on. And log stack is there for parsing the incoming incoming data, ingesting it into Elasticsearch. An important update about that from February that both Elasticsearch and Kabbana have been re-licensed from Apache 2 non-open source license SSPL by Elastic BV, the company behind the project, which again stirred up a lot of discussion in the community, especially this even more than what we saw before with Krabana because this is a non-open source license. Several motions have started. The most significant one is OpenSearch, a new project that has spun out of Elasticsearch and Kabbana, essentially a fork of Elasticsearch and Kabbana meant to keep them open source under Apache 2 license and keep them run by a community. There are several companies backing it up, the most significant one is AWS that has put a lot of effort in this open source, also contributed its open distro for Elasticsearch open source plugins, the forward plugins for Elasticsearch. Now they're converted to plugins for rich plugins for OpenSearch. Again, it's a fairly young project. It just started obviously after Elasticsearch's re-licensing move. It's been GA'd just less than half a year, so it's very fairly young and does yet to be proven to see that the community builds up around that, but definitely looking promising and I'm glad to be part of this project, my company Logs.io, we see a lot of importance in promoting this project and keeping Elasticsearch and Kabbana open source, essentially. Another open source that is relevant in this context and is quite popular is LOKI. LOKI is a project run by Grafana Labs, same as Grafana and other projects, and LOKI is a bit different. It doesn't pass the incoming logs or do full text indexing, like most other log installations. Instead, it indexes and groups log streams using the same labels that are already used for Prometheus. So because of the interchange with Grafana and seeing the integration with Prometheus, the most emphasis is about the easy cross-working with the metrics data coming from Prometheus, so obviously it's more efficient to scale and more tuned for your Prometheus, but more limited in its texture search. You can't arbitrarily ask about any field that is within your log data. So that's about LOKI and as we said for Grafana, same update for LOKI, Grafana Labs has re-licensed LOKI as well from Apache 2 to AGPL version 3. So again, fairly new update that may impact the landscape. And after traces, the third signal and here the open source tool that is the most dominant one is Jäger. Jäger tracing is a project under the CNCF, it's a graduate project, quite mature with a few dozen organizations using it in production. Typically the full stack of Jäger offers both traces, you know SDK libraries and Jäger Collector and a backend analytics tool and UI for querying the data and visualizing it and analyzing it. Typically it works in production at least with some backend data store, it could be Elasticsearch or it could be Cassandra or something else, but this is a typical setup and we'll talk about open telemetry later that may take some of these pieces off of Jäger, the traces and the collectors, we'll talk about it later, but at least for now Jäger offers all the suite of capabilities as part of Jäger project. Another open source project that is important here is Zipkin, it's a bit older project, more veteran, but still common in organizations, there are also others like Skywalking and Tempo and others, definitely Jäger is the leading open source in the tracing domain. I like to pause here and make an important note that merely having metrics, logs and traces does not guarantee observability, many people think that that's what it takes and definitely not. As we said before, observability is a property of the system, it's not something that you bolt on in the aftermath, it should be there as a first level citizen. So I'd like to go over some three common mistakes that I keep on seeing and how to correct them to make our systems observable. The first one is using unstructured and ad hoc logging, plain text, unstructured and essentially the developer wrote that piece of code, the business logic just threw out a string, output, the string containing all the information that he or she knows about what's going on in the business logic and it may very well be understood by that developer, it may be understood by peer developers reading this log, but as we said, observability is a data analytics problem. We need to work at scale, we need to use tooling to analyze, to correlate, to cross-reference, we can't rely on humans reading line by line massive amounts of data and analyzing them. And for that we need to make sure that our logs are structured, we need to move away from plain text to some JSON format, something that is machine readable, having each and every field in its own designated separated field, containing requests, more metadata, like request or transaction identifying, to correlate traces to logs, we need to prioritize the mass amount of information, index valuable data, also for metrics, by the way, a proposed correlation, we'd like to include, for example, exemplars for metrics to provide more context for the metrics and enable drill down from metrics to traces. So this is first best practice that I would recommend. Second thing that I see is that even organizations that have metrics logs traces, the mindset is still, you know, the classic sysad min mindset with relatively reactive monitoring and maintenance mode of work. And again, for data analytics problem mindset, we need to work, like data analysts, proactively querying and getting the right insights from the system. And the third problem is that people have metrics logs and traces, but each one in its own silo. And as we said before, these silos create broken troubleshooting workflows and the issue of tools sprawled that we talked about before. And if we want to have observability, we need the fusion of all these signals and that is what gives us true observability. And so we understand that we have the tools to analyze the different signals and we understand that we need this fusion of all the signals. But how do we actually go about generating and capturing this telemetry? So the vision is nice, but the practice is that each programming language has multiple libraries for logging and for metrics and for traces. We saw that organizations use five tools or more and each one of these tools and vendors has its proprietary agents for instrumentation and collecting. You need, you know, the client libraries for Yeager and for Datadog and for Splung and for Zipkin and so on. So how do we go about, you know, making this vision of capturing telemetry data across observability? And this is where open telemetry project comes into the picture. This is a project under the CNCF. In fact, it's a merge, the result of a merge of open tracing and open sensors. So if you use any of these, this is the future path for these projects. And open telemetry is a framework for generating and capturing telemetry data, especially from cloud native systems, across traces, logs and metrics. And Hotel, by the way, nicknamed Hotel, has been adopted by all the major vendors, all the monitoring tools, all the cloud providers. So it's definitely converging the industry around it. So what does Hotel provide us? Open telemetry provides a unified set of vendor agnostic APIs, SDKs and tools for generating and collecting telemetry data from your systems and then, you know, exporting them to any back-end tool of your choice. You want to work with Prometheus, you want to work with Jaeger, Zipkin or any other, just export whichever tool you want. So essentially, you have, for instrumenting your application, Open Telemetry offers a single API and SDK per language that is based on a unified specification across the languages. And then that's for instrumenting your application, so the application generates and emits the telemetry data, you know, logs, traces, metrics. And then for collecting the telemetry, Hotel provides a collector component that receives, can receive multiple protocols. And then it processes and aggregates the data, does all sorts of on-the-flight, on-the-flight calculations and optimizations and, you know, sampling and so on, and then exporting it to your back-end of choice. Another important element that Open Telemetry offers is OTLP, Hotel Protocol for transmitting. So it also aims to create a unified protocol across the different signals and create a standard and format for these. So and it's important to note, the collector can receive multiple formats, not just OTLP, it can, you know, if you have Prometheus format, if you have, you know, Yeager format, Zipkin format, whatever, it can ingest these, it can receive these. Nonetheless, OTLP as a project does offer also the OTLP protocol as a unified protocol. Another important node that I'd like to make is that Open Telemetry is not a single, you know, monolithic project. In fact, it's made up of multiple projects and working groups that work on the different vectors of this humongous and very ambitious project. So, for example, the specification for distributed tracing is already in GA, but the specifications for metrics and logs are not yet, are pre-GA, for example. And if you look for the SDKs, then you may find, you know, the Java SDK may be already in GA, but another language may not be GA. So, it's important to note that when you go about starting your journey with Open Telemetry, you need to first understand which stack you're interested in, which programming languages, which signals your systems emit and what exactly you need out of OTLP, and then check the status of the different, you know, components that you actually need out of OTLP. And I wrote basic, you know, guide to Open Telemetry. You have the URL here on the screen. So, if you are in your first steps to Open Telemetry, I hope that this guide will give you the overview of the different components, the state of different things, and obviously the links to where to find the up-to-date statuses and answers to the drill down on the specific ones that you're interested in. So, if you find it useful, definitely go to logs.io slash learn slash open telemetry dash guide. I hope you find it useful and give me feedback if you lack something, I'll be glad to enhance it. So, that's about Open Telemetry, and now I'd like to summarize what we've seen. Modern systems are hard to monitor, and we need observability. We understand that very well by now. Open Source is, gladly, the new norm in observability. We see Open Source tools such as Prometheus and Elastic Search and Yeager and others that have become, sorry, de facto standards in the industry. We also saw a relicensing of, you know, popular Open Source tools that is changing the landscape, so things may very well change in the coming year based on that, and we also saw that there's a severe tools problem that, you know, companies are, you know, running multiple tools, which tends to create data silos and broken workflows essentially. And what we need is to bring all of that, all of the telemetry data in one place, take down the silos, and be able to effectively slice and dice and correlate the data and gain insight into our system, which is the essence of observability, of course. We also saw that Open Source standards such as open telemetry and open metrics are converging the industry, help prevent vendor lock-in and bring us a step closer to a unified observability vision. Especially, we've seen that on the side of generating and collecting the data and emitting, you know, with this project, and I expect personally that we'll be seeing more such efforts for unified observability to address also other parts beyond generating and collecting data, also around data storage and querying and correlation, exemplars and similar. So the future looks bright for Open Source movement in observability and with that, I'd like to thank you very much for listening. If you have any feedback or any questions, feel free to catch me on the Q&A or on the office hours, or just reach out to me at Horowitz, H-O-R-O-V-I-T-S. I'd be more than happy to follow up, to answer questions and to get your thoughts on the topic. So I'm Doton Horowitz and thank you very much for listening.