 All right. So I'd like to thank everyone for joining us today and welcome you to today's CNCF webinar, which is managing observability and modern applications. My name is Matt Baldwin. I'm the director of CloudNative and Kubernetes software engineering here at NetApp, and I'm also your CloudNative ambassador today. I'm going to be moderating today's webinar. We would like and so I would like to welcome the presenter today who is Ryan Brittenzaft. He is the chief technology officer at EPSagon. And before I get going, I want to have a few housekeeping items I wanted to scale over. During the webinar, you're not able to talk as an entity. There is a Q&A box at the bottom of the screen. Feel free to drop any questions you have into that box and we will cover those either at the end or in the middle of the presentation. And then we'll try to get through as many as we can, but we do want to try to stop at the top of the hour. But feel free to drop them in as the presentation is going and I will moderate at the end. This is an official webinar at the CNCF and as such is subject to the CNCF's code of conduct. So please don't add anything to the chat or questions that would be in violation of the code of conduct. Basically, be very respectful of your fellow participants and the presenter. The recording and slides we're going to have posted later today on the CNCF webinar page. You can get to that at cncf.io slash webinars. With that, I'd like to hand it over to Ryan to kick off today's presentation. Perfect, Matt. Thank you very much for that intro. So again, thank you everyone for joining. I really appreciate it. I'm going to talk today about managing observability. I'm going to talk about several subjects and we're going to review the agenda soon. But just feel free to add Q&A along the session. I would love to answer them as soon as possible rather than wait for the end. So, you know, just a few words about myself. Currently, I'm the CTO at Epsilon. I'm also one of the co-founders. I'm going to say just a few words about what is Epsilon. I'm also an AWS serverless hero, which means that practically I can say a lot of things about serverless and you're going to believe me. But that's not the main focus of this presentation. In this image that you can see me, I'm looking for Wells in Hawaii. Unfortunately, I couldn't find any. And you're more than welcome to follow me on Twitter. I talk a lot about serverless topics, cloud native and things like that. So you're more than welcome. Just once and for all, I'm part of Epsilon. We're doing an automated and agentless observability tool for microservices. I'm going to review most of these topics alongside the session. But if you're interested in learning more, you're more than welcome to get into the website. Now that we got that, let's talk about what's going to be today in this session. So I'm going to talk about, you know, the all flow of observability, how it started with the initial steps of monitoring and logging. And then, you know, all of a sudden, we constantly hearing about observability. And within that, we got a, you know, burning topic about distributed tracing and how is it and why it's working and, you know, so many topics to discuss. So let's kick it off. So let's talk about first about why monitoring. I mean, it's pretty standard. It pretty makes sense why we need to do it. Most of the people will say, because we need to make sure our application works. But that's not the bottom line of why we need to monitor. And when we understand that it will impact all of our monitoring aspect, we do it to make sure our business is working properly. Ultimately, in most of these cases, I guess, in most of the attendees over here, your application represents your business. And it's not just work or it doesn't work like good or bad. It's mostly about how your business performs, because the way your application performs is the way your business is working. Now, when we're asking, what should we monitor? So there is four golden signals that used to be like a best standard. There is a good piece by Google called SRE Book. I would love to share the link afterwards if you want to. That talks about what are the top most golden signals to know that your services or service is working good in a distributed nature. So first latency, how fast this service replies? Think of it if it's something that is facing a customer that's super important to reply fast and not too long or something that meets our SLAs. Traffic, just to understand how much traffic are we handling? Because if we're handling a single customer or thousands of customers, it's something that we need to take into consideration. Also errors, which obviously we don't want to have any errors, but every application got some errors. So it's crucial to monitor and see trends in such such thing. And last but not least is saturation to understand how full is our system? Do we have a lot of space to get some more traffic or are we on the edge? And we're going to end it soon in terms of saturation. So these are the golden signals. We're going to talk about tons of different things to monitor, but this is just the basics that we monitor in every service. And in the old school kind of monitoring, we used to have a service as you can see on the right side. We used to have like something exposed through a REST API, some business logic deployed all on the same server, which probably had also some database that we were communicating into. And in order to monitor and gain these metrics that we just talked about, usually the old school monitoring was get an agent, put it on your web server, and it's going to ship out metrics. These kind of metrics are only host data metrics. It means that it's not any application level metrics. It means that I can't see inside my application. I can see, you know, the processors and the CPU and the memory, everything about the environment, but nothing about insides, the core insides of my application. And also, it collects just metrics. It doesn't collect any payload or information. It will just tell me, hey, you had an error, but wait, what's the error? Or what should I do next? Or what's the message? And so on. So it's limited only to collect metrics, at least, you know, I'm talking about the old school kind of monitoring agents. After we have this kind of agents, we need to take the next step. Let's say, as I mentioned before, we're getting some error. We need to troubleshoot the problem. Usually metrics, in most cases, metrics wouldn't help us to troubleshoot problems. We need some more debug data. And, you know, debug data is logs. Every time there is a problem, every engineer will just dive into the logs to explore them, to understand what's going on there. So we need logging. And again, if I'm talking about it, the old school kind of logging is another agent-based, something that is placed on our server. So now we got two of them, two friends talking to one another, consuming CPU and memory just to get me information. Usually these kind of logs could either dump the information locally to the same server. So I would need to SSH or RDP into that server to get some more information, or in most cases, most reasonable and modern one, to ship it somewhere remotely, like an Elastic Splunk or any other service that allows me to ingest this kind of data. And one thing to note about it, it collects only the log data will die. It's obvious that it collects only the log data, but it means that if you haven't logged information inside your code or inside your service somewhere, it's not there. It's not going to be logged. So it means that it requires some manual work in order to make it work. Otherwise, it's just not logged. Now that we covered that, that was a pretty old kind of solutions. Let's fast forward into the future. I think everyone, especially today in the CNCF agrees that we got a market that rapidly grows both in cloud and microservices, either kind of cloud, private, public clouds, either kind of microservices using containers or not using orchestrators or not. This is growing super fast. And this kind of grow shows us a different perspective on how we should monitor and how we develop and operate our applications. I'm taking you from the left-hand side to the right-hand side. We used to do like host base. We own the host. I still remember the days that I used to set up my own server somewhere, physically. And we had this kind of monolithic application that took care for the user interface, business logic, data access layer, all running on the same server. For this kind of scenarios, these agents that would run on the server would be good because they'll probably cover everything that I need. But then we moved towards more distributed kind of applications. It really introduced a lot of benefits. Being able to scale only the services that matter, being able to independently develop in every team the right service. But it did introduce some kind of problems now. It's more about the communication between the services, rather the internal infrastructure. Now, if we take another step to the future, so even the host gets obstructed. In most of the cases today, when we're thinking about cloud, we're thinking about managed database, managed message queues, managed web servers, and so on. And even if we take a step forward, we think about third party APIs. So, for example, if I need to charge my user with a credit card, in most of the cases, I'm going to use a service that exists for that. Same applies if I need to send him an SMS or alert or something. All of these services exist. I don't need to reinvent them. Now, if we're looking at the third, like the right one, it's changing a bit because all of a sudden, I'm not always interested about CPU and memory. I'm more interested about the application level. And first and foremost, I'm mostly interested about the communication between these kind of services. So, in these kind of environments, let's talk about what challenges brings us and why are we talking at all about observabilities. So, if we're looking at this kind of nature for engineering and ops, I think the top one problem today is troubleshooting. How do you approach this kind of troubleshooting? Are you looking at the monitoring, at the metrics, at the logs? At which tool? How do you know the root cause? How do you correlate between these kind of events? This is getting more and more complex as the application is growing more and more. Also, monitoring, being able to observe such great architecture that composes lots of different resources becomes a challenge. It becomes very difficult to understand flows from end to end. Visualize architectures, it's not just a single server. It can be hundreds of them services that are running. And also, that impacts our development, especially the development velocity because now, to test my changes, I need to integrate to different services and check out in the logs and see the metrics, it becomes harder and harder to investigate it. Now, when we're talking about observability, and this is probably the main part of these slides, we think or the definition breaks down into three topics. So, topic number one is metrics. Metrics are aggregatable. It means that I can create a chart that will show me trends, will show me spikes, will show me any kind of metric information that will help me understand changes in my services, applications, infrastructure and so on. Second part comes the logging. This is, you know, with metrics, I can't say for sure which user affected or, I know, what kind of problem I exactly had. And logging helps me to do that. Logging will tell me, hey, this is the event that you're looking at with the problem, and this is exactly what happened. Another pillar of observability, and this is today, the most important pillar is tracing or distributed tracing, and we're going to drill into that in much more detail. Now, I think just as a quick note, and I'm also going to discuss it about at the end, that today you find out different tools for different parts of observability. You probably have one solution that fits your metrics needs and alerts. You have another solution that you aggregate all logs, but they don't correlate between one another. And I know that distributed tracing is still fairly early in the market, but in most of the cases it's another tool or you don't even have it. You try to build it on your own. And what happens is that you have three different tools that all are being used by engineering and DevOps, but now they need to integrate and correlate between all of these tools and it becomes a pain when you're having, you know, a massive application in production. Before we jump into tracing, I want to cover some topics or things that I think are the best practices both in monitoring and logging because, you know, I can speak at any one of them a lot, but I rather focus on the distributed tracing part. So for monitoring, first and foremost recommendation, aggregate all metrics into a unified dashboard. I can't tell you how pain it is that, you know, some metrics are found in this dashboard and the other metrics at this dashboard, infrastructure one are here and application level are there. It just becomes a mess and it can grow. It can scale very easily, especially if you're having tens or hundreds of engineers. Now secondly, you need to define your own critical metrics. Don't just look at CPUs and memories. Look at what really important for you and usually it comes or boils down to application level metrics. For example, how many 500 errors do I get? Or, you know, even there is a, you know, when you're getting so many errors every day, you just, it becomes very blinded to that. It creates an alert fatigue. So think of what's actually should I wake up in 4 a.m. in the morning in order to troubleshoot. And for that, you need to define thresholds. I mean, a single error probably wouldn't wake me up, but if it's repetitive and there's some problem that occurs over and over again, I want to troubleshoot that. Also use custom business metrics. It means that don't settle with just what comes out of the box because there are some business metrics that will help engineering ops and product teams to evolve their application much faster because they'll be data driven. For example, let's say we're handling checkouts. So let's see how many items there are in the checkout. So we'll be able to plot that or understand it. This one, I would say don't necessarily have to be on the same dashboard, but it always comes from your application. Now, I think some of the examples of what you monitor in application level metrics. And again, I really mentioned that because in the modern kind of applications infrastructure becomes more and more commodity and application. It's the thing that you really need to look for. So for example, the average duration to an external HTTP call is something that you need to monitor because let's say I'm using a third-party API like Stripe for billing and all of a sudden it becomes very slow. It's not only impacting me as a software, it can impact cost and performance and so on. It means that probably it impacts my customer as well. Or if it's an emailing service or SMS messages, it impacts my user as well. So this is something that you need or want to monitor. Also, another great example is the minimum number of calls to a message queue. Usually message queues are being used in data pipelines. And data pipelines, sometimes you can look at the CPU and memory and all looks good. And then you understand that for the last couple of hours, no data was coming in. And nobody monitored how much data is coming in. So being able to monitor that on the application level and ask how many messages I'm getting in a queue per minute maybe, this will help me to understand how my traffic or saturation is going. And also obviously the number of 500 or 400 errors in most cases would be thresholded, but at least you'll understand how many errors your users are getting. In terms of logging best practices, we used to log just a single line that will tell me now I'm calling to the database with this specific query or something that is very textual and not structured. And the best way to do logs today is to print them out as a structured way, JSON dictionary or something relatively in your programming language, that will introduce some more metadata. For example, which service am I, which state is it running on, and other environments, metadata that will help you understand what are you actually doing now, whether it was an error or what kind of error. So when you're filtering logs, usually in all modern logging solutions, you'll be able to index all these kind of fields and then look for exact logs. For example, all logs that match production and this specific service where the log was some error instead of looking for your exact line that you've recorded. Also, try to automate as much as possible the process of logging using middle worse or any other method or instrumentation. Try to make it as smooth as possible that, for example, every call to your database will be automatically logged. So you won't need to, you know, after every call add some extra annotations and lines because that's make the code more ugly and less maintainable. Also, as I mentioned, try to index your fields that you're actually using in both elastic and other services. You can index the field that you're interested in to be able to ask more meaningful questions and plot charts that make more sense. Now, right before I dive into the third part, which is the distributor tracing, I think there is still something missing, even if we're doing the best monitoring and the best logging out there. How do we still correlate between metrics and logs? And let me give you an example. Let's say it's 4 a.m. in the morning and the operation team told me that the database seems to be misbehaving. Like something works very slow or not appropriately. Now, as an engineer, or let's say even if I'm an architect, I know that there are maybe six or seven services that are communicating to this database. Now, I'll need to figure out based on these kind of metrics, like let's say CPU goes up or response time is going up, where in the logs I'm going to find the relevant log lines that will help me to travel should this kind of problem. And that's not easy, unless you really work out to integrate this kind of services. It's going to be a time that will take you to correlate between what you're seeing in the metrics and where the logs of this specific metrics. Also, the second problem that's starting to rise up, and this is exactly distributed tracing, how do I correlate between metrics and logs between different services? If I'm having a service A that communicates with service B, let's say via a message queue or HTTP, how do I know to correlate between the logs that were printed out in A and the logs that were printed out in B? I want some correlation in between them. So I think that's a super great segue to move exactly into the distributed tracing part. And this is the main part that I'm going to discuss about. I would really love to hear any questions that you have because it's going to be technical from now on. So distributed tracing, just to have a better understanding what we're talking about in case you're not aware or not sure what is it. Distributed tracing tells you a story of how a trace or an event propagates in your system. So it's like looking at the workflow or a transaction, not a DB transaction, but a transaction in the essence of N2N in your distributed system. And as you can see on the right side, we've got a client calling us, it goes through the load balancer to the authentication service and the billing service and some other resource. And then it comes back to the client. If I would have needed to look at the logs of this specific request in five different places, it would be a nightmare. It would be hard to follow. And that's just a very small example. This can be grown 10 times more than that. So that's what distributed tracing is doing. And distributed tracing has two main parts. The first part is to generate the traces, i.e., generate the data about the trace and being able to correlate as it moves between services. And then the part that already collected the traces now, what should I do with that? The front end, the client, the UI that will ingest this information and will present it in a meaningful way that will help me to accomplish good observability. Since there are three rules here, and both in generating traces and ingestion, we got top standards today. So open tracing, which is the leading standard, which got migrated with open sensors towards open telemetry, which is the new standards in terms of tracing in modern environments. Second, after that comes Yeager, Yeager tracing that helps us to ingest and visualize the upcoming traces from open tracing or open telemetry. We also got an external for CNCF, but still open source, open Zipkin, which is pretty similar to Yeager. You'll be able to compare between them and see which feature needs better. Let's talk about the first process. The first process is to collect all of the traces. First step starts with instrumentation. Instrumentation, in other words, is the method of collecting the information about the calls that I'm making. So every call that I'll make, for example, to an HTTP endpoint, to AWS SDK resources, to a Postgres DB calls that comes into my Spring, Flask, or Express applications, I want to instrument them. So it can either be manual. So for example, every time I'm getting a request, I'm going to add, hey, I got a request to my Spring application, and it can be more of an automated instrumentation, which means I'm going to hook myself into the, as a middleware or in any other way, to every time that there is an incoming request to any Spring, to any endpoints, and I'll be able to capture all the information. Once I'm in there and I'm instrumented, I can start working with the open tracing terminology, and these are the bolded words. So I can create a span. Spend, it's like the basic unit in tracing that tells me an event that has some start and duration. So I can talk about a span, an event, or an operation that happened in the code. So I need to create the span for the request and the response. I'm going to show an example soon. I also need to add some context to the span, because once I started the span, it's not enough to say I had a post operation that started and ended in this time frame. I need to say some more information, for example, like to which URL am I calling to, or what was the status code in the response? So I'll be able to see some more information. Ultimately, I need to inject and extract ideas. So far, you know, without that, I'm just doing better logging. But now I need to make sure that this trace or this span will propagate to different services. So let's say I'm calling to some HTTP service. And apparently, this service is also mine. So it's service A calling service B. I'm going to inject in the HTTP headers some ID that will say, hey, I'm part of this trace and this span. If somebody is else out there, know that you're continuing some different span or different trace, like you're continuing the same trace. On the other end on service B, for every incoming call that I'm making, I'll need first to extract and verify, is there any ID existing that I know that I'm part of? If yes, so continue that trace. Otherwise, start a new trace because nobody told me I'm part of something. Just as a quick example from Open Tracing using Python. So this is an example of getting incoming requests in your web framework. It can be Django or Flask or any other similar one. So under the handle request, we're instrumenting and, you know, starting or activating our tracer. And then we need to extract the IDs as we mentioned. We need to start a span. We're adding some more context using the set tags. And, you know, you're seeing just, this is just for a single kind of request. But what happens when we need to do it to so many requests? This is just a thing to bear in mind and we'll discuss about it. Next come the part of ingestion and the client side that actually gives us the value and helps us to do something with these traces. So first of all, when we're thinking about such a solution, we need to figure out what's our scale. I mean, we need to ingest millions of events per day or billions of events per day. This will really affect our environment that we're going to run Yeager or Zipkin or any other solution. Also, we need to make sure that we index context and tags for every search. So for example, if I'm adding a tag of the URL endpoint, I would like to ask, show me all traces that goes into my web server to that specific endpoint. And if it's not indexed, so I just, I just can't do that. Also, we should think about, you know, when we're thinking about such client, a way to visualize traces, it can be either a timeline that we're going to see soon, or a graph, something that might be more appealing or more easier to understand when we're doing troubleshooting or understanding a specific scenario. Ultimately, we need to set up alerts. So, you know, being able to get alerts, for example, if we're getting a 500 response code for some third-party API, or if we're having some slow performance, we need to be able to get alerts for such problems. And, you know, this is just the basic, most basic information about what we want from such client. There are obviously lots and lots of more requirements. Just as an example, this is Yeager. Yeager visualize in the waterfall charts all of your traces. So you can see exactly from your front end, all the way coming to your backend, including all of the calls that was being made, for example, to read this scenario and marking down calls that were being some error or issue in them. So it's really easy to, you know, all of a sudden understand what it takes from really the beginning all the way to the end for your user experience. Let's talk about some best practices doing distributed tracing or tracing in general. I think that tagging is a crucial point today in tracing, and it will help you both for search and aggregations. Let's give some examples. You can tag some identifiers to your traces. For example, a user ID or a customer ID or purchase ID or device ID, any kind of identifiers that will help you to say, for example, now customer, some customer calls you and they say, hey, I had a problem with your website, I couldn't check out. So you can easily go into the traces and say, hey, dear tracer show me all traces that match user ID one to three, and it's been some error. And all of a sudden you're finding the issue that he was talking about. Also, we can see flow control, for example, event type or handling type or things like that, that will tell me, for example, how many times and this is especially for aggregations, how many times am I handling type A event and or type B event because I want to understand the saturation of each and every one of them. And ultimately, also to traces, we can add some business metrics just as we talked before in the monitoring. This is another good place to use them. So for example, we can put item in cards, items in cards. So it will help me to understand that this specific purchase, how many been there, but also to plot a chart with some aggregated data about how many items in cards there are in general. Also, another thing that can be super helpful is tracing with payload. When you're having the payload inside your trace, it almost acts like a logging but you just don't need to search for it as you do with logging. Things that you can put into the payloads or like user ID, but it doesn't have to be a manual tag. It can come, for example, from the HTTP header or, for example, a key in the NoSQL database because when I'm troubleshooting, I'm interested to know this update operation to which key or hash key it was or even the response payload from an HTTP code. Let's say I'm calling some service like Stripe and I'm getting 500 error. I'm interested in seeing the result to understand why exactly did I got this kind of problem because it will help me troubleshoot and it will help me investigate any problem that I'm having. Also, one thing that's worth mentioning is that tracing can act as a glue because as we saw so far, we have one thing for tracing, like one solution for tracing, other solution for monitoring and metrics, and another solution for logging. But if we're doing it right, I mean, it's super not easy, but if we're doing it right, trace can also reveal me where the log is and, you know, in good case scenario, also integrates to that log. And trace also can tell me where am I running? I mean, I'm not just talking about what host am I running, but also if I'm running as part of a specific code or am I running as part of specific function of the service or specific cluster or service, it can tell me everything about the environment that I'm running in. So once I'm having a problem in a specific environment, I can go directly to the traces or from the trace system metrics about the environment at the same time that the trace happens. So this really aggregates all three pillars of observability under the same tool. Just before we finish out the last main topic that I would love to speak about and have some example with one customer that we'll add through the journey of observability is to talk about some best practices in observability. And, you know, when I'm thinking about the good kind of observability, I'm talking about something that is fully automated with zero maintenance because what we saw with open tracing is great, but it means that somebody will have to get trained about it and maintain it somewhere. I also want to see something that visualizes everything for me, you know, something that will show me what's going on, how it looks like, plot all the charts, timeline graphs and everything. I want you to support any environment because I don't want to maintain different traces for different environments. For example, if it's running on a Kubernetes cluster, on cloud or on-prem or function as a service or some other orchestrator or IoT devices, I want this tracer to be able to run anywhere. I want this tracer to connect all of the requests, whether they are coming through HTTP calls, GRPC, message queue like Kafka or Rabbit or any scenario that will be there, I wanted to connect it for me. And then I want this tracing solution to be able to correlate all of the data and help me search and analyze anything that happens, just as we talked about tagging and context that we're adding to the tracing. And now, one example of one of our customers that we took for, and I think this is a good way to start, if you're now sitting there and you're saying, I need to gain more visibility or observability into my services and I need to start thinking about distributed tracing because it seems critical more and more today. So first, try to understand what's your business goals. Is there something that bothers you or you're just interested in learning more? These are two different approaches. And also understand what's your current architecture. If your architecture is very simple, probably distributed tracing wouldn't give a lot. But if your architecture composed with tens of services or microservices or even so hundreds of them, you definitely need something in place. Once you've figured that out, try to determine what's your approach. You can do it yourself. I just showed a great example of tools that you can build it out. You'll have to implement it on your home, do all of the heavy lifting, do constant maintenance. It might make sense for some use cases, but you need to take that into consideration before doing so. The other way around is just to pick a managed solution that will do this for you. Obviously, there are costs for these managed solutions, but it will reduce anything that you'll need to do manually and maintain over time. Also, when you do so, if you try to, if you don't know whether you want to do it yourself or pick a managed or even if you do know what you want to do then, try it before you go all in. You need to understand what's the process looks like, what kind of value will it bring to you, compare it to different solutions because we're engineers. We want to be able to pick the right solution that fits our needs. So make sure that you're trying at least multiple services out there. And make sure that this kind of service integrates to your ecosystem. Whether you're running on Cloud A or Cloud B or on-prem, whether you can communicate to the outer world or not, whether you have some logging solution or not, try to think what integrates best to your kind of environment so it wouldn't be like a completely new tool that you need to understand over time. Now, the last thing is to evaluate the tool and understand how to communicate the benefits to the decision maker in order to influence the impact. Because usually the top level will say, why do we need such a thing? Let's just continue the way we're doing it today. But if you communicate the benefits and the benefits can be faster time to ship features, I mean, faster development velocity, or reducing the downtime that we're having, or just having a better observability and monitoring to our system, it should be communicated well to the decision makers. I'm seeing some questions that I would love to answer in the meantime. So one question is, can we set latency benchmark like mean, max, or even percentiles for a given trace? So you can definitely do that. Think of the trace as a structured log, and you can place in your log what's been your duration of handling these specific requests. If you have a lot of traces, you can plot them under a chart to see what performance there has been over and over time. And once you do that, you'll be able to put it in the chart and understand whether you've got trends or not. Just to be aware today, Yeager does not support that. I know that they do have a plan to add some more analytics data that will be able to do that. But as of today, it's not doable, so you'll need to build something on top of that or do your own kind of client side. And that was the question for Murata. Just before jumping into the other questions, I'm just going to do a quick summary and then we'll go through all of the questions. So first, I think that you all understand that observability is, you know, it might be some buzzword, but it has a lot of meaning in our modern kind of applications. And they require more than just monitoring or just logging. You're seeing that we can't solve out modern problems with all kind of tools. So we need to understand, you know, what's our application really needs. We also understand that within observability, distributed tracing takes much more major part than it used to be because our applications are becoming distributed. So we need distributed tracing to address these kinds of needs. And I think this is just as a quick tip for myself as, you know, I'm a big believer in a serverless or managed environments. Try to stop implementing your own internal solutions unless you really need it and unless it's really in the focus on your business. Because if your business is to create the best e-commerce website, stop trying to invent your own security tools and deployment tools and monitoring tools and logging tools. Try to use something that exists because somebody else, it's, this is his own responsibility to build such a tool and it's probably will do it better than you. And you know, there are some exceptions if you're a really big company or you need something internally or you're having some security issues, but in many cases, you know, out of the box solutions that are out there will be able to help you with your needs. So that's it. I really thank you very much for joining. Matt, do you want to maybe help me with questions now? Yeah, I got a couple of questions. What kind of basic troubleshooting is needed in a typical enterprise though? Can you maybe repeat that or clarify that again? Yeah, what kind of basic troubleshooting is needed in a typical enterprise? Got it. So that's a great question. You know, when and I think all of us are engineers here and when you're troubleshooting some problems, you just need payload information as the way I call it. Paled information can come in two ways. One of them is logs. So if I'm printing the data of my incoming request or every call that I'm making, it will just help me to investigate the problem because if I'm seeing that response is getting a 500 error or it's getting slow, I wouldn't be able to troubleshoot it. That's just a symptom. I need to go into the payload data and information. So this can be consumed as logs. But logs, as we mentioned, you need to do them manually and they're hard to search for and, you know, it's hard to correlate or to do it via traces with the context. So to the context itself, you can add meaningful information like what was in the headers of the HTTP call or what was in the response status code that I sent out. So this trace will help me to troubleshoot and understand exactly for this specific scenario what happened. So these are the kind of tools. If you're talking about like actual tools so it can vary from any kind of logging solution, I mentioned briefly some to that I really think that are doing great job like Elastic, which is an open source and also Splank. If you're talking about tracing, so there are not many solutions today that are doing this, especially not in the way that also includes the payload, but you'll be able to find some out there. In general, where do you see troubleshooting going? Where do I see troubleshooting? What? Where do you see troubleshooting going? Yeah, hopefully for me, I think that the more this process will be automated and based on similar scenarios. For example, we're all building probably the same kind of applications in the same kind of frameworks. And you know, many things are becoming more and more commoditized. So being able to respond to some problems of one thing, being able to respond to some things even before they happen would be amazing. It can be AI ops or any kind of similar thing that will just help me to understand the problem even before it happens. This is super critical. This can reduce a lot of down times to a lot of companies out there. The second one is to have some common information or common knowledge sharing about problems, because I'm pretty sure that the problem I was facing today with something that I deployed to production, many other engineers were facing too. And you know, usually today what we're doing is to look for the problem in Stack Overflow and try to break our head into the wall to understand why it's happening to us, why nobody wrote about it exactly in the same way I did it. But somebody will, I think, roll up the sleeves and will probably be able to correlate between similar problems. And this will mostly be based on logs and traces all combined together. What about customers in mixed environments? So you're in a mixed container and serverless environment. What are the trends you're seeing with that? This is actually a super great question. I think with the rise of serverless, you know, the time being now that serverless is being spoken everywhere, it's just like I think maybe three or four or five years, same with containers that, you know, all of a sudden everybody is talking about how do we do containers and microservices in our environment. It's still in this slope that, you know, it's not fully, fully adopted by everyone. And I think the best approach is to mix these kind of hybrid environments because there are some things that serverless is good for. And, you know, I'm not taking into consideration, you know, business needs, whether you can run on cloud or what kind of function as a service are you using or not. I think you do have a lot of benefits, but choosing the right environment is just like choosing the right framework or the right tool. We're engineers, we need to evaluate when a serverless function will be better and when a container environment will be better. As a quick tip, you'll need to probably choose both of them in different scenarios. So, and it's okay to have both of them. You just need to make sure that all of your tools, your monitoring, troubleshooting, tracing tools are all fit this kind of hybrid environment. What would you say are some of the problems that users in these environments are starting to run into in these mixed environments? That's also a great question. And I think I spoke about it early in the conversation. So, when you're moving to such environments, it's easy to get started. You know, today we've serverless in some of the cloud vendors, you almost started no time. And no time you're getting into production. And all of a sudden, you'll find yourself with, you know, hundreds of pieces can be containers, serverless and so on, and resources out there. And you find yourself like not ready, not ready, you know, both in monitoring and tracing and security and so many aspects that it feels like, yeah, you just went out of the plane without your parachute, just like it's falling down and you're going to hit something. So coming prepared in such environments with the right tooling, it's the key focus or key element to make sure it's succeed. Well, so one last question then then we'll let you go. How do you how would you so you're, you know, you're an engineer, engineer, your team of engineers, or you're an engineering manager, you want to do observability. Now you got to go pitch that to the business. So what what's what's some advice on how you would present the value of observability to the business, like to like engineering teams on the flight, what would you know, because engineering teams tend to get into the weeds quite a bit. And so how would we're like the top five points that you'd want to make to the business about observability to get them to ride a check. Yeah. First, I would start with understanding by myself, whether I'm an engineer or a team leader, I would understand by myself what it brings me. Like, what are what are my current problems, you know, understanding my current problems, whether it's a long time to develop or troubleshooting time that goes very, very long or even very inefficient, inefficient. So understanding the problem, then evaluating some tools or understanding what kind of benefits am I going to get choosing a tool, because you might find yourself, I'm having some problems, but no tool is going to solve them, then actually evaluate them, you know, try to do it. I know that most of the companies have some privacy concerns and things like that, but they can't even evaluate the tool before, you know, they're going through the legal process, but try to evaluate on some dev environment or some playground to see what it actually gives. So when you're coming to your decision makers, you'll show them, hey, look, I'm not just talking, you know, fluff or quoting messages from website looks, I tried it and looks how nice it's gonna be and that's pretty similar to our environment. And it really helped me to troubleshoot or to understand my complex environment. And then it's much easier to get a buy-in from the decision makers, because usually decision makers are driven by or I hope to say that they're driven by data. And if you're coming with the right data, hey, this is our problems. This is what we need. And look, there is a solution that's gonna solve this for us. And I already tested it. So it's a good way to get a buy-in. And then, you know, you're moving to production. And if it works well, you've got the tool. Well, thank you. I want to thank everybody for attending. And again, our webinar will be posted at cncf.io slash webinars. I think there is just another question for Manish. Yeah. When it comes to observability, what good open source tools are available in the market? So I'm going to talk about some of the open source tools. I think a good way to get started is to get to opentracing.io to just read about what is tracing. It goes much more in depth than what I mentioned before. Now, you have two options. Do it yourself. It means that you can do it built on logs or Yeager or Zipkin. These are other great tools to do that. For Manish solution, you know, I'm a bit biased, but I think Epsilon is a good way to understand what can be provided from such a tool. Everything that I've talked about obviously comes with Epsilon. So it can give you a good comparison. And then, you know, the rest of the usually APMs, the application performance management or monitoring solutions out there, can also give you that, but to some limitation or some to some extent. Sorry, I missed that one. So this was great. Thank you for the presentation today. Thank you to the attendees for joining us. And I want to let everybody know we will have this posted up very soon to the CNCF website, CNCF.io slash webinars. And with that, have a great day. Perfect. Thank you very much, Man. Thank you very much, everyone. Thanks, everybody. Bye.