 Hello, everyone. My name is Pius and my colleague Kiran from IBM will be covering a short beginner's guide on observability and specifically its observability with open telemetry. So in this presentation, we'll be discussing how currently developer and company are modernizing their application and we'll then quickly move our focus towards worthwhile and the three pillars of the observability followed by the open telemetry specification and then we'll cover some of the commonly used terminologies in this area like PAM, PAN context, and the data models of the tracking. We'll end the presentation with a short demo that shows how the tracing with Yago works with a small microservice application that we currently have and conclude it with a short summary. So we'll quickly come on to the modernizing application. So there's been wider option of microservices in our application services that communicates with their counterpart services running in different platforms like private cloud, public cloud, et cetera. It has become immensely important to track down the request and flowing through the different components of our application. Similarly, there is an increase of the adoption of the DevOps practices. The DevOps being adopted by almost all the organizations now because it increases the ability to deliver the application at much higher velocity than it's being done otherwise. But with this velocity, it also becomes important to trace the performance of our application. With all the new DevOps practices, microservice style, and the hybrid cloud coming in, it becomes very important to introduce the observability in our application. Observability is one of our very critical components of any application that tends to understand the bottlenecks of our application products or even services running across the hybrid clouds. Now we'll quickly cover what basically observability is. So the observability is not a new concept. It's been there with us in different forms. So in physics and control theory, observability used to be a measure of how well internal states of a system can be inferred from the knowledge of its external output. There's also a notion that the evolution of monitoring services with microservices or various other components that we are working on has led us to the observability. And that's because the new technologies like Docker, Kubernetes, containers, open shift, etc., and various new practices have helped us in enhancing the performance of our software product. And it has considerably reduced the friction between the depth, pre-prod, and prod environment. But at the same time, the greater complexity in our product has been introduced. And for that to handle, we have introduced the observability. This observability helps us to understand or debug the real or in-sequent path that our request takes. So there are some in-sequent paths that are often taken by the request that mostly are not encountered in the normal knowing environment. Similarly, it also helps us to gain the insight across the application. So if there are any hidden patterns or any hidden paths being followed, so we can easily determine that from the observability. So there's also been a notion that why do we need observability? And the one big reason for that that often comes is the introduction of the multi-layered architecture. So it helps the developer to understand multi-layered architecture with shared information on what's slow, what's broken, and what needs to be done to a hand for, you can say, improve the overall performance, which particular component is most likely contributing towards the latency right now in our application. It is not just knowing about the population, but it is also about why the problem is happening. What change has been introduced in our existing application and why it was changed? What was the reason behind that? Is that feature really worth to have? Because it's impacting the performance of our application with a good amount of latency. So these are some of the most important questions that are answered by the observability. And it's something that helps us to determine who's contributing to the overall latency of our application. It is defined as the measure of internal state built on the output. And the outputs, in our case, are the LOX events and traces. And these are often known as the three pullers of observability. A LOX has a record of some activity that has happened somewhere at some point in time. And they are invitable timestamp record of discrete events that has happened over the time. So we all know LOX. We've been using the LOX widely in our application. These are some of the most commonly used, one of the most essential attributes of any application. Right logging at the right log statements whenever there is a action being initiated by a user or by a process, we add a LOX. Similarly, events are focused on the periodic measurement of information that allows you to have an idea of the overall state. This could be like this space, CPU usage, or number of processes of our system. In fact, with events, you can harness the mathematical modeling and prediction to derive the knowledge of the behavior of a system over an interval of time. For example, you can use the event to generate a weekly, monthly, or quarterly report to see how well your CPU and various web processes are being used. And you can have a prediction based on the events, like if there has been a X number of resources being used during certain time intervals. So we can have that prediction and accordingly, we can increase or decrease our CPU memory so that we have sufficient resources available to meet the demand. Now, the third point which is interesting here is the traces. So distributed tracing enables you to analyze the performance through your microservice architecture all in one view. And this is accomplished by tracing all the requests that just from the very beginning and initial web requests from the web UI was initiated till it completes its processing at the very end. Traces are used to identify the amount of work that has been done at the each and different layer. So that's where the traces plays an crucial part for us. So to summarize in short, locks and metrics, locks are about what happened and when it happened. However, the events are about collecting the information periodically and traces are about identifying the amount of work done at each layer. Now, it's always good to have all these three in our components, but the challenge lies in integrating all these three across the different components that we have. Services of an application like there will be different services being used in different components and it becomes quite challenging for the different service provider and the service provider application to have a single tracing vendor. Everyone will have their own choices and everybody wants to use their own set of tracing application. So the use case of tracing is basically that from the very beginning till the very end we have a single flow which has been tracked by the tracing application. So to overcome this, that's when the open telemetry comes in. The open telemetry provides the library agent and other components that we need to capture from our service so that we can observe, manage and debug the request that are going on. It's basically a semantic specification to have a vendor neutral API writing distributed tracing so that anyone or any application who have the open telemetry specification being followed are good people. So one such example in our case is the YADR. So they do follow the open tracing specification and that's what we have been using in our demo purpose as well. So there are three major components or you can say three critical aspects of the open tracing specification. They are named as Tracer, Span and Span Context. So open tracing as you're mentioning defines an API through which application instrumentation can allow data to a pluggable tracer. Open tracing lays out some of the standard like standardized spam management, how the process communicates when they are crossing the different states, how the active spam management is being done, how the inter and intra process propagation being handled. So these are the some of the guidelines, some of the specification being laid out by the open tracing specification. But apart from that, there are three critical and interrelated types in the open tracing specification. The first one being the trace. So the description of a transaction as it moves through a distributed system is what is being handled by the trace. Basically whenever there is a transaction which goes from one layer to be another is being traced out by the tracer. Now the second one is the spam and this is one of the most widely used term or one of the most important number that will come across many of times while implementing open tracing for your application. So it's the main time operation which is representing a piece of the workflow. It accepts key value pair, it has timestamps within it, it has structure logs attached to it, a particular instance that it carries a lot more information. We'll be covering it in more detail later on. The next important part is the spam context. So spam context is basically when the trace information accompanies any distributed transaction included when it passes from one service to another over a network or for a message verse or in any other architecture. This spam context contains the trace identifier, the spam identifier, and any other data that tracing system needs to propagate to the downstream services. They might be one span which is initiating a few more spans like there is a parent-child relationship between them or there is a follow-up relationship between them where one process triggers another process action. So that's one of the places where spam context comes. Like I was saying that spam context is one of the most widely used terms that you'll come across a lot in this tracing world. So we'll see what the span is in much more detail. Like you've been saying, it's the primary building block of a distributed trace. Representing an individually neat or well-done distributed system. One of the good things about this plan is it encapsulates the following states inside it. And these are like operation names, start timestamps, finish timestamps, and then key value pairs in form of span tags. A set of zero or more key value tags can be added and they become very useful if you want some additional information to be added on this. The operation name that we've been specifying is like each span has a certain operation to perform. Like in our case, the diagram shown here, there is a DB query that has been fired. So at time t is equal to zero, the DB query was fired and at time t is equal to x, the DB query was finished. So this has us to understand how much time this particular span, how much time this particular operation took to complete. So this is mostly how span contributes in the tracing to understand the flow, to understand how much time a particular action of the user or the program has taken to complete. And like in the previous we were saying for span context, each span context encapsulates these states, like from where it has been originated, from where we realize the origin of the span context. So we'll cover it in a bit more detail. So the references between the span is like the span context coming into the picture. The span may reference zero or more span contexts that are casually created. So these are the like relationship like child of and follows from. So in the layman term it's like you have a one process which triggers two child processes in it. So it could be an ORM which is then initiating a couple of C equal queries in the database and then inside the DB it is making a couple of transaction logs to make sure that the database follows all the asset guidelines that the normal RDB has. So for that this is how the child of relationship will look like. So a span may be child of a parent span and in a child of reference the parent span depends on the child's span and some capacity. So in like in our example we mentioned that the ORM database will depend on the C equal query which is child span and some capacity. So unless the child is finished the parent will wait for that operation to terminate. So if the time span of a child span gets extended it will also impact the the timing of the parent span as well. So this will contribute towards how the entire data model of the tracing looks like. Similarly follows from reference it's like it don't it do not depend in in any way like the child span has but it's more of a follows more of a follows from relationship that the parent that the child has with the parent. So it's kind of in a casual sense but it's still improving and as you go further as the more development happens in this area you can expect some more clear destination on the follows from references. So the child of reference is what mostly used among everyone and it's quite popular everywhere. That's the open tracing data model how it looks like and hopefully then this will clear much more this will give a better clarity on how the these these references looks like when they're working. So for example the first figure that we have that's the casual relationship between the spans. So this span A is the root span as it's here and it then triggers the two child span B and span C. The span C is the child of span A. Now you can see that span C further initiated two child spans spanning and span S and then span F, span G was initiated by F and H was initiated by span G and they are following kind of a follows from relationship. So span V and span C are the child of span A that was the one relationship and then span H is followed from span G. So that's another reference that we were talking about. So if you can if you want to visualize you can visualize it in the form of a process where you have a main application which then concurrently triggers two processes. You can say that the main program program A concurrently starts process B and process C and then they continue to serve their application they continue to serve their microservices in their different in their different features that they currently have. The second figure that we have currently it's more of a time diagram where timeline which is that how much time span A took to complete span B was the child of span A. So it's definitely it has contributed to the overall performance of span A, span D then was a child of span B and so and so forth. So span C has been there parallel to the span B so you can see that B and C are child of A. So the temporal relationship between the span is not there to see what the relationship is there among the different spans but rather it's about to conclude how much time each span has taken which particular operation was more efficient than compared to the other and that's the one major reason of using the temporal relationship in this span. So figure one shows casual relationship between the span in a single phase and figure two visualizes the place with the time axis as shown in the diagram in both the diagrams you can see how the different spans are linked to each other and how they are helping us to understand the operation name start and finish, finish time span duration of each operation and how they have contributed to their previous or their previous span. So now we will quickly move on to the next section. So we'll be covering a short demo and we should tell you how it overlooks like how the Yager interacts with the application that we currently have in place. So for the demo purpose we'll be using the two Dockeroids container images one have the Yager server running in the back end and another one has the hot rod application. The hot rod is basically arrived on demand services. So I have both the Docker in my machine locally available. I'll start the Docker server for the for the Yager back end first and then I'll start the Docker server for the for the backend application which is based on the microservices. So you can see that both the servers are up and running and they'll be running in their respective IP addresses. So one double six eight six is for the for the Yager back end and 8081 was for the hot rod. This is how the URL for the Yager looks like and we'll cover the Yager UI in much more detail later on. But meanwhile we can quickly hop on to the to the hot rod application which is writes on demand. So if you see there's a web client ID 8102 which gets generated and it's a unique ID. So for every request every request that comes in it's a unique client ID it's basically mapped to the customer and then with this application we can select the caps. So we'll select the Japanese server. We can see that there is a car being allocated to us it should be arriving in two minutes. There is a request which is 8102-1 being mapped over to a client ID and there is a latency of 728 milliseconds. So this latency is basically the overall latency over time it took from the time I click on the Japanese server to book a cab. Then it did some back end processing and it came back with a car that I've selected. So that's the overall time that says taken of a complete front end and the back end to happen. So we have done really the one request and this one request should have now been stored in the Jagger back end. So we'll just hover to the front end and refresh the page and you can see that there have been some data which has been uploaded so now we can see seven services there. Now we'll see how Jagger can help us to collect the request that flows throughout the request part. So we'll go to the system architecture then the AG and then we have come to a graph where it says that the front end goes to the driver front end makes another request to the customer and the front end which is a UI made couple of requests couple of calls to the route. Overall there have been some 25 or 26 RPC calls and these all have been restricted in this diagram. You can see front end made one request to the driver which internally made 14 requests to the ready server then there was one request to the customer being made and which ultimately might be landed into the MySQL server and MySQL would have done its own current operation and then gave back the results. So this is how you can get the overall flow of how your request being passed through our application, through our right on demand application which is based on the microservice architecture. It also shows the number of requests being made and like one and ten so one for the driver one for the customer and ten for the route. We'll now go back to the Jagger main UI and we'll quickly show you how the Jagger UI looks. So in this you can see there's one request 51 spans three errors one request for the customer one for the driver 24 for the front end one for the MySQL and 14 for the radius. This is the overall summary of what has happened when we have just tried to book a one cab ride for us. If you click on that you see the much more detail of how much time each individual request has taken for it to complete. So you can see there is fine nearest driver that I'm currently hovering on and I'll go back to the first section which says front end. So when we click on the front end the first request of next to the customer the second have been made to the fine nearest. When the customer endpoint was called it ultimately landed into the SQL select and that's because the MySQL made a query to retrieve the data about the customer and similarly the second request was made to find the nearest driver so we get the details of the driver and we could see all the bunch of requests being made for that particular service. These are some of the time frames that it took for this service to complete. So you can see a whole different range from 31.5 milliseconds to various but if you want more detail into what has happened into the SQL select query you can go ahead and click on that and it'll show you much more details on that. We'll get back onto that meanwhile you can see all the different traces of all the different requests being traced out and these are the time stamps which can be used to find out what performance issues might have happened if there is a lag in one of the requests. So we'll head on to the SQL select so you can see the time taken by the SQL select or you can say the time it took for the processing you just click on that and it shows that the duration was 291.3 milliseconds on the right hand side. The request that's there that's 8102 was the same that we have seen while booking a cab in the hot rod application UI and that's the same that gets in the SQL select tag format that we have seen and this is the web client idea that we've been using. If you go to the tags and you see the request we can also see the exact SQL query which was being used for the processing so you can see it says select start from the customer where customer ID is 731 so we have all the extra information that we get while using the traces so it's not only at the tags it also adds the logs if you go to the logs you see that if there is an event that has happened it says acquire log with the transactions waiting behind so it's not only about the traces it's not only about the events it's also about the logs there you get some more detailed information about that and then anyway we have the time diagram time graph which helps us to determine how much time each individual request has taken there's also some information about the host name IP address jagger version that has been used in the application which helps us to determine the overall overall behavior and the architecture of the application that has happened as part of the development so this is not the like like I said there in this particular application there was only one request being made but the more request you made the more data it will have and it will be more useful for you to understand what all has been happening in your application with just a single single request that we have made and from that there was all this bunch of requests 51 spans that we've been talking about the graph that you can see has only one dot and that's what we have explained now if there are much more requests being made you will see a lot more dots than a single dot there will be a couple of more dots and you can explore any of those requests by just clicking on that so that's how the the performance is being tracked by the by the open tracing we can just go in any of the microservices we can expand it we can go into deep into that and try to differentiate what all applications or what all latency issues have been happening because of this uh so with that I will quickly move on to the next section that we have which is the summary summarizing of what all we have done so as part of this entire uh the entire introduction of open telemetry we have seen how useful it is that by introducing another attribute another feature of open tracing we are able to identify the the performance bottlenecks of our application it's not only about the the uh the the locks it's not only about the events it's not only about the metrics but it's also about how well uh it goes while we are trying to incorporate observability entirely into our application so like been saying observability is not just a good to have feature it's just like in any other attribute that our system has uh just like we have usability we have stability we have uh we want our system to be highly reliable highly available similarly the system must have all the observability section covered properly the goal of the observability is to ensure that the services running in production are able to detect any unusual or undesirable behavior such as error or any of the slow responses it becomes easier for the for the developers for the uh SREs to develop their distributed application because every applications are getting used in the different hybrid cloud formats uh it's getting more and more uh difficult for the for the developers to debug them so with this we can easily locate uh where our application is spending more time and where they are bottlenecks that's causing the latency to increase uh it also helps us to collect the errors and exceptions which are not otherwise uh handled by our application so it's not only about uh some of the things that we can avoid it also helps us to locate uh some of the unknown behavior of our application that that we were previously unaware of so locks evens and traces serve their own unique purposes and they are complementary to each other uh none of them can replace each other they increase our visibility towards our complicated distributed application we can add a lock uh at every major entry and exit point of request and a trace at every decision point of request it also makes sense to have all three cementically linked so that it uh it becomes possible at the time of debugging to reconstruct the part taken by the request such insights obtained uh from the combination of different observability signals becomes a must to have a feature to debug the application these are some of the references that you can refer to uh explore more about it so it's an open source project you can go ahead and try contributing it to the demo that we have used is also available in the github uh under the open tracing uh on the repository so you can go ahead and tweak it uh as you want you can also try a sample demo to get a better understanding on that and uh since it's an open source you can go ahead and raise your issues you can have your own feature built in and you can contribute into it uh with that we'll uh conclude on the presentation uh that's uh that's what it's there in the open tracing in this presentation feel free to drop any question that you have thank you for us and Kiran hello yes this is Kiran so so there's one question uh on that face any thoughts on how compliance and auditing is based in potential imagery so the auditing can be done uh you can take care of auditing by uh storing the play of the locks uh probably you can use something like elastic search or or database where uh where the response can be stored and then uh have a ui like uh reactor where the locks can the response can then you can and the complaints can be built uh of you know if you can build on top of the tracing api uh just to make sure uh the different services that take part in uh of the in our application uh actually complies to the standards that we mentioned of that before so yeah but so for some of the parts I can can render on how uh you're please let us know if there are any further questions uh we are here to answer maybe for other all right well thank you very much for for you for coming this session and listening to us on on this talk uh we hope it was useful to you thank you very much