 Hello, everyone. I'm audible, right? It's clear, otherwise. OK, cool. Thanks for joining in today. I'm Prishant Sarkul Trisha. And today, we are going to talk about observability in this session, in particular, end-to-end observability. And I work as a software engineer at Postman. For those who may not know, Postman is this API platform where you can design, develop, and build and test your APIs and share them publicly. And yeah, within your team as well. So I work in the reliability team of Postman. And whatever I'm about to discuss today in this session is sort of related to my experiences in the reliability team. So yeah, so let's just begin. If we just look at a glance, then in this session, we'll discuss about what observability is, what end-to-end observability is, where it fits, where it doesn't, what are the pitfalls, and how we can overcome them. I'll be sharing my own experiences as well. And towards the end, there's a small demo as well. I'll be sharing some references too. So I hope you really enjoy it. So without wasting any more time, let's just dive in. OK, so the first point of interest should be understanding observability. So why is there such a hype around this term? Observability. So with this session, we'll try to understand all that. And for making it easy for us, we are going to take this up in the format and the order that's stated. Why, what, where, and how. So cool, let's begin. As Simon's in excess, start with why. So we are going to do the same. We'll be asking ourselves, why do we need observability? So if you like, take a look at this slide, this particular diagram. It talks about development on one single machine. It could be your local system right here, this laptop. So all the code, all the configurations are in this one system, and there's just one developer. So if suppose I'm writing the code, I have complete context about it. And if there's any bug, any sort of failure, I would know where to look and how to fix it. So in a flash, I can make my application come back running again. It's sort of manageable. And we can say it's a golden scenario, not very real. So gradually, let's just move towards practicality now. This is a much bigger scenario than what we just saw. So there's a team of individuals, and the application is also more complex. We can see that it has multiple services. But everything's present in one machine, a monolith. So still, the context is central within the team, and things would be manageable. Like, bit worse than our previous situation, but still we can say it's manageable. But now if you look here, this is what microservice architecture usually looks like. And it's widely adopted in the industry as well. So here, there are multiple services running in different instances. They are interacting with different databases. They are interacting with each other. And a lot of teams are there. So different teams manage different services. And the mesh is really complex. And as your business grows, you'll add more features. And this mesh is bound to become more and more complex. And this architecture, this microservice architecture, was introduced in the industry on purpose. We wanted faster deployment times. We wanted less time to reach the market. So we decided to make our teams and our services loosely connected. And yeah, it brought a lot of advantages, but it came with its own set of disadvantages. And if you look in this diagram, if we say service D goes down for some reason, it blows up. So it's going to take with itself, service A, service C, and all the services, those services are calling. And my errors may get propagated to the front end. So my users are now like, they know that, OK, something's wrong. And in this situation, the context is not central. And if you want to find out what's the origin of failure, you want to debug the situation and fix it, it will be really chaotic. And there are really those super engineers. And it's difficult to find them and retain them who know everything about the architecture and who are willing to fix everything. And in my opinion, we should not rely on the super engineers, but instead focus on adding observability to our service measures so that whenever things blow up as they are bound to happen, we are able to detect them and fix them faster. So yeah, this could be like a perfectly valid Y. If you are a developer or a software engineer like me, you would feel that, OK, I need to be fast in triaging. I need to fix my issues faster. And that's a perfectly valid Y for me. But if you take a look at multiple personas, and for example, a business leader who's in finance or sales or something like that, then for them, it may be a bit tricky. So let's just ask ourselves multiple levels of Y so that we come to know we are clear about why we need observability. So yeah, let's just begin that exercise. Cool. So as established, we want observability for a better debugging and fixing experience. Why do we need that? So that our engineers get a peaceful night of sleep. Cool. Yes, we need that. That's required so that our engineers are more productive. If you are not fighting fires all the time, you'll be more productive, you'll be less burnt out. And that's something we really desire. And more productive engineers lead to better quality product that leads to a better user experience. And yeah, with a better UX comes customer satisfaction, which we all desire when we are building a business, when we are building a product. And needless to say, all this boils down to business value. Yeah, so it's all about money. It's all about reputation. And if you think about it, if you have a good product, you're going to get customers, sure. But if it fails often and you take a lot of time in fixing it, your customers would be frustrated. And they'll eventually churn out the trust in your product will decrease. But consider a reverse scenario where you have a product, good product. It has observability integrated. And now you'll be able to detect things faster. Before they suppose reach to your customers, you'll be able to detect them faster and fix them faster. And now your customers are getting a more consistent experience. So the trust in your product rises. And as a result, they'll be satisfied and you'll get new customers. And more importantly, you'll also retain old ones, which is equally important. And when you just think about it, it's like a small client today could be a bigger, even the biggest client tomorrow, just because we are thinking about a simple stuff like adding observability to our entire product chain. So this should be our clarification. We need observability because of the business value it creates. And with that firmly established, we can now move to the next section that discusses about what is observability. Okay, so there's this amazing definition which comes from control theory. Like mostly people think that observability is monitoring or it's like introduced in the IT industry, but it's really not. Also observability is not a tool which you can buy or which you can build. It's sort of a concept. And we need to understand that concept so that we are able to apply it in our product chain. So as per control theory, observability is a measure of how well internal states of a system can be inferred from the knowledge of its external outputs. Cool, let's see like diagrammatically because I love diagrams. Okay, so we have a system here. It has multiple states working, degraded down, et cetera. It's sending some sort of signals to an external system. If we consider like our IT landscape, this could be our monitoring system which is like collecting all those signals. It's evaluating those things and determining the state. And like it can take certain actions on them. It could generate alerts or it could trigger our self-healing systems. So if you see this, again, golden scenario, yeah, the system is observable. But again, let's move towards reality like we did earlier. Now we have multiple systems and they are communicating with each other in multiple ways. And the communication broke for some reason. One system is down, it's not working as expected. So in this scenario, is our system still observable? Is all this entire chain still observable? So I would say yes, but only in case we know about the communication structure. We know what systems are interacting with each other and when that interaction fails, we get to know about it. We get to know the reason behind it. So if we just like redefine our definition, we can say that at least in terms of IT, a good observability system should reveal the communication structure between multiple components. So that's our final what? For observability, we can say that external outputs should map to the internal state of a system and we should know about the communication structure. So with that clear, we can now move ahead to where and the how. And to perfectly understand that, we should now talk about end-to-end observability. So let's just see what this means. So if you look at this slide, this is usually what the usual process for the chain looks like when we are developing an application and deploying it. So in the beginning, you have your developers who are coding in multiple IDs and then they push all the changes to a repository. You have your CI-CD system where you run your test and then it builds some images. It deploys things on the servers and then finally, your clients are able to reach your servers via a network. So this is the whole process. This is the whole chain where observability would be required, but simplify it. Let's just break it down one by one and see what happens. So level one is where your developers are coding in multiple IDs. So here you want that your code should be standardized in terms of formatting. It should be consistent. So yeah, there are things to consider here. So for these things, for these purposes, do we actually need observability? Do we need to collect data about it? So I would say that in order to accomplish all that stuff, a better way would be maybe printing files, some pre-commit hooks, some pre-commit checks, and this particular example I just added to highlight the fact that just because we can collect some data, we can add some observability, we should not do this. There should be a clear purpose behind and if there are better ways to tackle stuff, we should definitely look for them and we go ahead with them. So with the first thing checked out, we'll move to the next level. That's our remote repository level and this is the place where your tests run. So this could be unit tests, some sort of integration test, E2E test, et cetera. And test-driven development is sort of a widely adopted approach now in the industry and usually people, like in the industry, they generate some test reports and pipe the data to a central system and the most common thing they try to find is the test coverage, how much of your code base is covered by your tests, which I feel is a really good starting point but a lot of more useful insights could come from this test data, these test reports. For example, if suppose your unit runs typically in five milliseconds, but after some changes, it started completing in 50 milliseconds, so a 10x increase. So it could indicate there's some trouble with the changes and you may have to look into that and that could be a really valuable insight and flaky tests, we all know about flaky tests where some test just keeps failing and if we collect some data about it, we can actually question ourselves, like is this a poorly written test or a poorly designed component and all these insights could be valuable for our development and the quality team. So the first level where we should consider adding observability should be here at the test level where we start generating test reports. Next would be our external systems, the CI CD system and image registry. If you are an application developer, you usually use these systems but you only care when your build fails, otherwise you do not think much about it and I guess that should be the case. The responsibility of observing these stuff should lie with the team who are providing these services either via an external tool or via an in-house solution and if you are a part of a team who's handling these sort of services, then whatever we are going to discuss next would apply equally on these two. So we would consider that these would be checked again. Cool, so deployed on servers. So with this we have come to the most important part of our entire chain where our code is already deployed on servers. So let's just zoom in on 0.6 a bit. Okay, so this is what after application is deployed the structure looks like. You have your application maybe running in multiple containers. Those containers are hosted on certain instances. There could be a load balancer distributing the traffic. Your application could be interacting with a database and if suppose this is a client-facing application, your clients could connect to it via a network. So just from one glance, we can see that there are multiple components involved and there are multiple connections involved as well and we would want to add observability in all of these places. So let's just see what all we can do here. At the first level, your application level, you would want to know whether your application is healthy or not, whether it's up or not and your users are able to reach it or not. So yeah, the first level would be the application level. Now your application is hosted on certain containers. So you may need some insight about these containers as well. Then those containers are on some instances. Data needs to be collected at that level as well because you may want to know about resource utilization. Further, database needs its own observability and there are a lot of connections and as we clearly established earlier, we need to know about the communication structure. So our application is interacting with the database. It could be interacting with other applications too in the entire ecosystem. So a lot of connections involved, a lot of network involved and DNS, of course that's involved as well and you may require DNS level observability for things like DNS lookup time, which you would want to decrease if, like it's giving too much latency inside your network. So if we just look at this, like I've summarized everything here in this list and a lot of levels of observability is required and this diagram and this particular list, it kind of confuses us and it feels as if it's too difficult to comprehend and is it even worth it? Should we even take this much headache for adding observability in our ecosystem? Our entire purpose was to make it easier for us but if it involves so much complexity, is it even worth it? So yes, it's worth it because yeah, it's confusing but we have to simplify it. We have to make it simple for ourselves and we owe it to our engineers that we make things more observable for them. So let's see how we should, how we can simplify all this stuff down the line. And that we'll do in our next section, that is the how. And in order to understand how we need to like add observability in our systems, we should be clear on our purposes. So in order to do that, we have these list like these questions. I mean first is what is the state of my system? What is not working? Why is it not working? And if something's not working, is the failure or degradation in my system or a dependency? So these are clearly like taken out from the what we established of observability, kind of correlated with that and we'll try to answer these questions one by one so that we know how we can like add observability in our systems. Okay, metrics, logs and traces. These are often touted as the three pillars of observability and I guess no discussion of observability is ever complete without discussing these three. So let's just take a look at them one by one and see what we get. So the first is metrics. So a metric is a measurable value and we collect these metrics periodically and these are generally aggregatable. And when we are talking about monitoring, we are generally like talking about these metrics, collecting them in real time and analyzing them to derive some sort of meaningful information. And then we can visualize this data in whatever format we like, graphs, charts, et cetera. So the whole notion behind this monitoring is that if we know what normal looks like, if we know how our service responds normally, if we know those values, in case it gets degraded and those values fluctuate or they change, we will be able to detect them, fix them, improve them, et cetera. So that's the whole reason why we should actually think about monitoring. But what should we consider monitoring? Like there are a lot of data points and a lot of metrics that can be captured, but should we even consider about taking everything in? Is it practical? So I think these golden signals, generally called golden signals, these four metrics, these apply equally to application and infrastructure level components. The first one would be traffic, which talks about the number of requests your system is handling at a time. So this could be useful when you want to make scaling decisions perhaps, when the system is under load, you may want to scale up. And suppose you get all of a sudden a spike in the traffic, which is mysterious or unexpected. It could be like an indicator that your system is under attack, a DDoS attack perhaps. Second would be latency. So this means the time taken by a system to respond to a particular client request. So the processing delay and stuff like that. And if you see an increase in latency, it could mean that there could be a degradation in your system, which you may need to check. If it happens for a prolonged period, then yeah, the system could be degraded for some reason. Third would be error or error rate. Generally the 4XX and 5XX responses are considered as error rate or failed requests. And we want to check how many of our requests are failing because they could indicate that something's failing in our entire system or something's degraded at some level. And error rate could be a good indicator of that. Final one would be saturation, which talks about resource utilization. So resources could incorporate things like CPU, memory, network bandwidth, et cetera. If your resources are overutilized, your client request could get dropped, they could get timed out, and you would want to know that because your customers may get frustrated at this point of time. And we do not want that to happen. So with these four signals, I feel that we would be able to find the internal state of our system, whether it's failing, it's degraded, whether it's under attack, whether it's overutilized. And in a way, I'm able to know the internal state of my system from the external outputs, which are these signals. So we can say that with monitoring, with the metrics, we'll be able to answer the first question, what is the state of my system? But the rest are still unanswered. So we'll have to move to our next pillar, that is logging. So this is quite common while developing applications to developers add some sort of logs in places where there are known errors. So it will be easy for them to find out what happened and when and where. And usually logs are like activity records which are performed in the system while it's running. They have a timestamp and they have more information on the error and exception. They could have stack traces, et cetera. So if you want to like go back in time and to a particular event to see why that failure happened, logs could be really helpful there. So these could help in like detecting where the application is failing, why it's failing, what could be the reason behind it. So with logging, we'll be able to answer two more questions but there's one last question which is still left. And for that we'll move to our final pillar which is tracing, quite important one. So a trace records the trail of a client request as it moves through our ecosystem. So it helps in knowing like if a particular request has like touched a number of services in your entire ecosystem, a trace would be able to help you like to find out what that looks like. So in case one of your dependencies is giving inaccurate results or if it's slow, a trace can help you in catching that. It kind of reveals the communication structure and yeah, that's pretty important. So at every level, at every service request touches, you'll be able to find in a trace how that service impacted that particular request, what latency it added, what errors it added and that could reveal really granular insights. And with this tracing, since you'll be able to find out is the problem in your system or your dependency, we can say that the final question get answered and we have added double ticks in the two questions because tracing helps you give more granular data. So all our questions are answered now. We have found out how we can add observability. So let's just see what the conclusions could be. We need to add monitoring for some important metrics. We need to add logging. We need to add tracing. Looks pretty straight forward on paper. So it feels like, yeah, there are three checkpoints. I can just, I have a checklist. I can just do these items and my work is done. But when we add these things on production, when we run it for some time, we get a lot of problems and those problems are worth discussing. So let's see what are the pitfalls and how we can overcome them. Okay, expensive logging, case one. So when like your traffic volume increases or you have a lot of services, generally the volume of logs also increases and that could impact your cost of storage. Generally, people prefer to add your older logs in cheaper storage mediums for dealing with expensive logging. And it's working kind of fine for our industry as of now. So I would say that I won't spend much time on it right now. Second would be expensive tracing and this is really important. If you have a lot of users, a lot of incoming traffic or you have a lot of services, you are going to generate a lot of trace data. And if you are piping these traces to an external provider, then you'll be charged not only for the storage cost, the ingestion cost, but also the network transfer cost, the egress cost. And it could prove to be really, really heavy on your pocket. And we have to do something about it and general way to do it is sampling. It's not like a silver bullet, a 100% foolproof strategy, but it gives enough breathing room. The first way is to like sample just a percentage of traces, but a much smarter way would be to just sample those traces that have errors of which are slow and sample a few normal traces as well. This is usually called as tail-based sampling. The final pitfall I feel is the superficial metrics. So we said that the golden signals can be applied to all sort of application and infrastructure components and they'll be able to capture certain metrics and give you the indicators about performance. But do they usually like tell us what our users are facing? So if you think about it, our users usually care about what they're experiencing on your product. They don't care about what you're observing in your product and there's a huge gap there and we need to bridge that gap. So yeah, the metrics, superficial metrics is all about that, bridging that gap. So let's see how we can bridge that gap, how we have tried to bridge that gap. First thing would be availability. So when we talk about availability, we are talking about user perceived availability. So first point of interest would be, are my users able to reach my system? It could be possible that within my network, I'm able to reach my system and interact with the product, but my users are not able to do so. So simple ways, usually setting up some sort of health checks and we do that external to our ecosystem. There are a lot of providers too that help you do this and they could set up health checks and trigger them from multiple regions all around the globe. So with this, you would be able to know what users and which regions are able to reach your system or not. A second more important point and user perceived availability would be, are my users able to do what they want to do? So if I'm running e-commerce service, for example, a typical user of mine would be like adding a product in their cart, doing some sort of payments or just browsing through the stuff and adding stuff in their wish list and or canceling an order, stuff like that. So these are the flows that are happening and if we have set up some health check and points which are simple, just some sort of points which trigger a system, send a request and if the system gives an output, they're like, okay, everything's running, everything's perfect. If we have those sort of health checks, then we may like miss out if there's any problem in these flows and that's not good because if I'm not able to understand what my users are facing, then all the monitoring has no purpose. So we need to know what our business critical flows are and set up health checks in a way that are transactional. E2E health check, you can say, it kind of traverses your whole flow and then only reports your up and down status. So that could be helpful in determining whether your users are able to do what they want to do. In addition to that, if you have front end application and you want to kind of measure performance of those applications, you could even use something like real user monitoring where you add certain JavaScript codes in your front end app and it collects performance metrics for you and metrics like load time and stuff like that and you would be able to see how your front end is performing or maybe for some particular user set of users or users around the globe. So with availability out of the picture, let's now move to the performance part where we get to application performance index usually called as aptX. So this is like an industry standard that involves finding a score and we usually find a score like in between zero to one. The more the score is towards one, it indicates that your application is performing well. Let's just look at the formula. Usually aptX can be like calculated from external providers as well, but if you want using this formula, you can do that yourself too like in your in-house solutions. So if you consider that your application, suppose it responds in T milliseconds in general and your requests are getting, like they are getting fulfilled in less than T time or equal to T time, then we'll say, okay, our customer is satisfied. If they respond between T and 40, then we'll say, okay, the customer is tolerating things a bit and if it's greater than 40, yeah, the customer is frustrated, they are gonna abandon your product and that's something we don't want and that is something we want to measure. So here in the formula, if you see the denominator is about total request and if you have more frustrating count, then that denominator would be higher, which would lead to a lower aptX. It would move more to one zero, which could indicate that your application is not performing well. So in a way, it looks like it works fine, but some concerns which are usually kind of, which come up usually with people who use aptX as such, we also face these situations. Firstly, a 40 could be like a really long time in some cases, if you have supposed a website that responds in five milliseconds, then your user is definitely not gonna wait till 20 milliseconds, not gonna tolerate 20 milliseconds, the 40 time, they're gonna abandon it. So this 40 could be really long and the score is just talking about response times and the T value that's like the main thing here, if you set it wrong, you're gonna end up with false results and an important thing is that, suppose even if I find a proper T value that works fine for me, not all application end points in my system would respond within that T. There could be some routes which are slow and if I set my T based on those slow routes, then the faster routes, if they are degraded, we won't be able to catch that. So again, false results, that is something we do not want. So we tried to overcome these situations in this way. Firstly, you should consider your error request in the frustrating count and so that way, it's not only about response time now, you can consider errors as well and saturation and traffic, they would be covered because if your resources are heavily utilized, the latency's gonna increase, response time's gonna increase and then you would want to ignore the expected errors. Suppose you are running a login service, then the 4x6 errors, the unauthorized error, that's expected, that's gonna happen and you do not want to get alerted on that error rate. So just ignore it out. Of course, you would have to monitor your application's performance for a certain period of time in production so that you know what normally looks like. You would want to know what a normal threshold of T would be and you would be able to find a proper abdix code, target it and find ways to improve it and you'll have to find these settings with time. It takes time, but once things are set up, they can actually work for you and the final thing would be you find your business critical routes which you will obviously do for the transactional uptime checks and you monitor these separately with separate abdix T, so abdix at individual endpoint level, so if these routes need special attention that could give you more granular data and you ignore this out from your entire application so that way, in case the route is slow, it's not gonna affect the whole entire application. So that was about abdix and how it can be used. The final thing that I wanted to highlight here was observability-driven development. We talk about test-driven development a lot that is sort of widely adopted in the industry but observability usually takes a backseat and which should not happen. I mean, if a developer is thinking about end-to-end ownership and if they want that their application should always be up and running and healthy for their users, observability is a necessary part and for end-to-end ownership, this is pretty necessary. For some external providers, this is sort of like a one-time job. They have their agents, you install them and they generate all sort of telemetry data and you can also do this yourselves by configuring tracing via open telemetry. That's also like a one-time job so it's not much a burden on the developer. Most of the integrations can be done in one go and I have a demo to show the same because I wanted to really highlight this and so I'll be using open telemetry to generate some trace data and send it to a Yeager backend. So let's just move towards the demo now. So I've prepared like a GitHub repo as well. If anyone wants to check that out, check what the code is and how we can integrate tracing there then this should be a good place. There's a readme as well which you can use for setting up that entire chain for a sample checking. So let's just try the demo now. Okay, so I have three services here, books, orders, and customers and as you see there's a tracing file in each of these. So I have used simple open telemetry integrations and these are Node.js services. So I have used auto node instrumentation here for like integrating the tracing part. This is like easily available and like described pretty well in the open telemetry documentation as well. So what we do is we just, like what I have done here is I've added auto node instrumentation and I'm using a MongoDB as well. So I've added instrumentation for that too so that we are able to capture those spans too. So that could be pretty interesting and for an exporter where we are going to send this tracing data, we are using Yeager. So a Yeager client is like initialized as well. And let's just see. So I have three processes running and if you see the tracing is initialized for each of these and you need to initialize tracing before you like begin your application code, like run your application code so that everything is instrumented. And for like the demo purpose, I have used containers, Docker containers. So there's a Yeager backend running and a MongoDB container running. So now I'm going to interact with my application and for that I'm using Postman. So I'll be sending some requests to my different services and I'm going to see how that appears on the Yeager. So I've already added some books like for the demo purpose. So yeah, these are working. I have some books. Let's see about customers. Yeah, I have some customers as well and let's see if there are orders generated. Cool, there are orders as well. So let's see how this appears on Yeager. I've sent three requests now and book service, okay, let's just find some traces. So a few seconds ago, 538, we have certain traces coming in. We tried to like request on the get in point of books and that is over here and it shows all the spans. It shows that MongoDB was kind of referred to and whatever information about that we require about the span is also here, what time it took is also here, like the duration is here, the start time is here. So this gives you like really granular data. Next, let's just try to fetch an order by a certain ID. Cool, so send a request again and now I'm going to check for the order service. Order service, here's my order service. Find traces and yeah, this is a few seconds ago, 539 and earlier we had like fetched orders by using a get method, so that is also shown. We'll see this particular span. So you can see that like a lot of spans are created, your order service did a MongoDB search. It also contacted two other services for this. It contacted your book service, it contacted your customer service to fetch those details about customers and the books. So that can also be like seen over here and if you go to these customer services or book services, you'll see that yes, order service had called them. So let's just verify that once. Yeah, it was called by order service and this was like the whole chain. So with this, like with the tracing, we are able to get like more granular data, we are able to see the whole communication structure and at each level, what sort of latency was there, what other metadata is there, we are able to get that from the Yeager UI. And one thing which I really like in such tools is like the visibility of system architecture. So let me just go to that, okay. So I'm able to see the architecture here. My order service is calling my customer service and my book service and this could be pretty useful when you just want to have a look at how your entire ecosystem looks like how the services interact with each other. Some APM monitoring tools even let you like see the DB, like the DB interaction call. I'm not sure if that's possible with Yeager because I've not tried that out, but it could certainly be possible. And as I mentioned, other providers do have those integrations and there are some like good initiators in this area where each service is health and everything like that. If a link is broken or something like that, that would also reflect on your system architecture graph. So yeah, this was like the end of the demo and we have, we'll come back to our presentation now, cool. Okay, so here are some references, some really good things that I found out and you could also go through them. These are some amazing blogs and videos that you can have a look. And yeah, that's about it. Thank you for your time. And if you have any questions, please let me know.