 Okay, hello everybody, welcome to this Enabling Observability with OpenTelemetry Talk. My name is Mauricio, I work as a software engineer at Kinfolk. This is my social network data, just in case you want to reach out. Kinfolk is a small company based on Berlin. We offer services for Kubernetes, container, process management, also for Linux, user space and kernel. Those are the different handles on social network and email of the company, just in case you want to know more about it. Okay, in this presentation I will show you the concept of observability and will give you some details about distributed tracing. I will give you an introduction to OpenTelemetry and will present how to use OpenTelemetry for instrumenting an application. In that I will present you the OpenTelemetry API, what context propagation is. I will also show you how to use instrumentation libraries, how to set up the different spotters. And finally I will give you a demo about everything combined together and how everything works. And then I will show you the concept of automatic instrumentation and will present you a demo about that. So let's get started with observability. Observability is a mechanism that allows us to answer the question, how is our system behaving? There is a formal definition that comes from the control theory that says that observability is a measure of how our internal states of our system can be inferred from knowledge of its external outputs. This definition could be not relevant for the software. So we could say that for the software world observability is about getting information about the state of the system without having to chip the code to the system without changing the code of the system by observing its outputs that are tracing logs and metrics. So there is a system that generates some information, some debugging information. These information are the traces, logs and metrics. And with this information we should be able to infer how the system is behaving. If there is any problem, if there is no any problem, if everything is going good or not. The observability is traditionally based on three different pillars. Those are logging, metrics and tracing. Logging are a series of timestamped event messages. I think most of us are familiar with that. So the most typical example for that is that we have a server and the server generates a file with the events that are happening and when those events happen. So with the logging we are able to answer the question what and when something happened. This is the more traditional way of performing debugging. Metrics are a numerical representation of some data. It allows us to answer the question how much. Examples of metrics could be the numbers of users connected to a system, the quantity of run memory that an application is using, the speed of a transfer and so on. Metrics are anything that you can represent numerically. Tracing is the last of these three pillars. The tracing is a representation of an entrant transaction. So tracing shows how a transaction is performed, how a transaction is processed by all the systems from the beginning to the end and it allows us to get a general overview of what happened. So it allows us to answer the question how did it happen because it offers this complete information about how a request, how a petition is performed, is processed by the different systems. So in this talk I will concentrate on tracing. I will present you the distributed tracing concepts and all the concepts about open telemetry I will be talking will be specific for tracing. Okay, so what distributed tracing is. Distributed tracing is a mechanism that allows to see how a transaction, how a request is processed from the beginning to the end. It shows the different systems that process the transaction and a sample of a transaction could be when a client opens a web browser, the transaction then goes to an HTTP server, then this is HTTP server uses a database and so on. So distributed tracing allows to get the whole picture of how a transaction is processed by the different systems. This distributed tracing gives information about what are the different services that are transferred by the request, for instance the HTTP server, the database and authentication service and so on. We get information about what is the time spent on each service. So for instance we could see what is the service that is causing the biggest impact on the time required to process the request. And we also have information about what are the parameters of such a specific request. For instance what is the URL, what is the query performed to the DB and so on. And also what are the parameters of the service processing the request. Distributed tracing is composed by three different elements. Those are trace, span and attributes. So a trace is the representation of the whole end to end transaction. It is composed by one or more spans. And again the sample of this will be the client that opens our website and what happens there. The span represents a single unit of work. It contains some metadata about the operation. It also contains information about when the operation started and when the operation finishes. And a sample of a span could be a query to the DB. The attributes is the information that is contained on the span and it gives the context to the span. So let's say that the span represents a general operation like a query to the DB, but the attributes define what that operation is. So we can see we can have in the attributes for instance the URL of an HTTP request. We could also have the query that was performed for the DB and so on. So the attributes give the specific information about what the span is. This is an example of a trace. In this case we have a trace that is composed by four different spans. So generally in the traces representation the time goes from the left to the right and we can see that the span A represents the main operation. The first operation in this trace we can see how the span A invoke another span B operation that also invokes a span C operation and after that we can see that there is also a four span D operation. So each span is representing a different operation and the whole trace is representing the whole transaction. So we can see here that the operation that takes the most time is the span A and the relationship between the different operations. About the relationship we can see that the span A is the pattern of span B and span B is the pattern of span C so it allows us to keep track of what is the operation that is calling other operations. Okay, let's continue with open telemetry. Open telemetry is a set of libraries, agents and other components that are used for generating and collecting telemetry. Open telemetry supports metrics, traces and logging. The logging is still being developed. Open telemetry is the union of open tracing and open census projects so as a video history there were two similar projects that have overlapping functionalities and they were creating some confusion because the same scope that they had. So the communities of those projects decided to combine the force and to create a single new project. Open telemetry is going to be the next major release of open tracing and open census what it means is that there are not going to be more releases of open tracing and open census and there are going to be deprecated once open telemetry is available. So why open telemetry? So there was a not clear standard for observability. There were many different projects, many different tools, but not one of them was an standard. So open telemetry tries to be an standard for observability and it tries to do that by being an open source project. In this case this is a samples project of the CNCF and this is a share for by most of the market leading companies. Also open telemetry provides a vendor-neutral API so it avoids the vendor lock problem. The architecture of open telemetry is composed by an API, an SDK, zone supporters and bridges. So the API follows the open telemetry specification. The open telemetry specification is the ones that tells how the API should work on the different languages implementations. The API is the component that the applications of the third-party libraries using open telemetry should interact with. So we can see here that there is an application that there are some libraries and those libraries interact with the open telemetry API. So the open telemetry API is the ones that allows to generate telemetry data within the applications of the libraries. It can be used without any implementation. What it means is that if you are using a library that has support for open telemetry but in your application you don't want to enable that support and you don't want to support open telemetry, the library should still continue to work. What is going to happen is that this is going to happen and this is going to use a minimal implementation that does not. So the library continues to do the calls through the open telemetry but those calls are not operation calls. So the SDK is ready to use implementation by open telemetry. So open telemetry provides the API and provides a default SDK implementation but different implementations are supported. So if there is any company or anyone else wants to create a different implementation for the SDK, they can do that. And the API allows to use different implementations. So here we can see that we can use either the SDK implementation or we can use any other available implementation. The only thing people implementing the other implementations have to worry about is that they have to be compatible. We have to implement the full open telemetry API. The SDK includes the concept of spotters. So spotters are component for sending the traces also the metrics and the logins to another system for process and storage and the bridges are compatibility layers with open sensors and open tracing. So these bridges are something for making the transition from open sensors and open tracing to open telemetry easier. So if you already have an application that is using any of those open sensors or open tracing, you could block this application to a bridge and the application will be used open telemetry. What it means is that you don't have to modify the instrumentation on your application. You just can't continue to use your application without almost any modification. The architecture of open telemetry is designed in a way that are separation of sensors. So the library developers only have to depend on the API so they only have to worry about importing and using the API. The application developers do worry about importing and using the API and also an implementation. So what is important here is that the library developers doesn't know anything about the real implementation. They only worry about the API and this is the application they want choosing what implementation to use. Also different monitoring vendors should maintain their own spotters. They don't have to worry about the API and so on. There is a clear, defined interface between the spotters so they only have to worry about implementing the spotters. Okay, so what is the status of the open telemetry project? This is current and later. This is going to be general available already soon. By end of the year it should be reached that milestone. Once the open telemetry GA open sensors and open tracing will be deprecated. So there are no new features there and they will be in deprecation mode. The SDK is implemented for the most important programming languages. You can see the list there. We have support for many of the existing programming languages. Some of the support in some programming languages is more mature than other ones but they got it to reach general AI by end of this year. There are also spotters implementing for most of the vendors out of there. Okay, so what is implementing an application or instrument in a library? When somebody says instrument in an application it means this is using the open telemetry to generate telemetry data in this application or library. What do you need to generate telemetry data with open telemetry? What do you need to understand? You need to understand the open telemetry API. You need to understand this propagation concept. You need to understand the instrumentation, what the instrumentation library are and you need to understand the spotters. These two are only important if you are instrumenting an application. I will show you these concepts. So let's start with the open telemetry API. So the open telemetry API is the piece that allows to generate telemetry data in the libraries and in the applications. The main concepts of the open telemetry API regarding tracing are the tracing provider, the tracer and the span. So the tracer provider is an object that allows us to get a tracer and the tracer is another object that allows us to create a span. Once we create the span we have this span object and this object allows us to start the span. So starting the span means that an operation is starting. So when we call a span that start we are capturing the time span of that operation. Once we have started the span we can set some attributes in the span. So the attributes is information about the specific operation that is being secured. For instance the URL of an HTTP request and so on. The events are also information about what is happening while executing that operation. The difference between events and attributes is that events are time stamped so you cannot sadly when the event happened and the attributes doesn't have any time stamp. And the other operation you can perform with an span is to end the span. So when you save the time stamp this is to say that the operation is complete. This is just a quick summary. You can get the full list of the different elements and their documentation in the open telemetry specification. So this is an example of how the open telemetry API can be used. In this case this is done in Python. So the first thing we have to do is to create an instance of the tracer. So once we have the instance of the tracer we can create the span. So for instance we are going to create this my operation span here. We start the span. In this case we can set some attributes for the span. This is just an example. And here is the real code that does the operation. After the operation is complete we can end the span. So what is important to notice here is that the span is started before doing the operation and the span is end after the operation is done. So we are able to trace, we are able to see the long duration of the operation. Okay, so let's keep going with the contest propagation. So the contest propagation is a mechanism that allows to keep track of the current of the span. So when you have an operation and you start a span this span should be marked as the active span. So what it means is that if you create another span the new span should be the child of the already system span. So what it does is that it allows to have this pattern to child relationship and we are able to construct the full trace based on the formation contained on the span. There are two different kinds of contest propagation. The local means when the operations are performed in the same process and it could be either be manual or automatic and the distribute contest is used when the operations are performed in different application in different processes. So for the local contest we have the manual case so in this case we have an object that saves the current span, the correct span and this is the responsibility of the user to pass this object in the different functions he is using. This approach is using all languages that doesn't support the automatic contest and we can see here an example. So we have these two different operations. We have full span to the bar so in the full operation we create an instance of the contest and when we start and span we pass the contest to that function. So what is going to happen here is that this function is going to return a new contest that contains the full span as active. So if there is a call to offer the operation we should pass the contest to that function of to that operation. So we pass the contest to that operation and in the bar operation we start another span in this case and we also pass the contest. So what is going to happen here is that the implementation of this function understands that the contest has full as the active span so this is going to say that bar is the child of full and then later on we are able to reconstruct the full trace using this parent to child relationship. So in this case the contest is the automatic contest. In this case the contest is implicit and handled by open telemetry what it means is that you or the user doesn't have to worry about passing around and unfortunately this is available in some languages. This is available in some languages because it requires some special features of the languages. This is implemented something like saving this global variable should be local and independent for each execution unit. So if you think that there are different threads within the application each thread should have their own contents. So different languages have support, have some features to support this. Python is one of them. So this is the same example as before. In this case you can see that there is no any contest created in the span. We are starting and in this span we are calling the bar function but we are not sending the contest. What is happening here is that when the span is created and started it is marked as self-default automatically by open telemetry. So once the bar span is created here it already knows that there is another span that is self-default and it is going to say that full is the pattern of bar. So everything is doing automatically the user doesn't have to worry about that. What is nice about this approach is that you don't have to modify the function saying assures in order to add this parameter. Let me go back a bit. On the other hand in the manual case you have to modify the functions to include this contest parameter because you don't have to worry about the other kind of contest is the distribute contest. This one, even if the operations are performed in different applications they have to be aware of the current at the span on each other. There are different implementations of this distribute contest based on the transport protocol that is being used for instance if this is using HTTP the is sent over the HTTP headers and there are different specifications for that. If you are using GRPC there are also implementations for that and so on. The users can configure what is the specification or what is the implementation they want to use. So this implementation we could say that in the local case the contest is safe on a variable that is present on the memory run. For the distributed contest what we have to do is to send this contest over the wire maybe HTTP or GRPC but it has to share with another process and this is done by these different implementations. The open telemetry instrumentation libraries do that automatically. What it means is that if you are using an HTTP library for instance that is compatible with open telemetry this HTTP library is going to take care of that and this is going to distribute to propagate the contest automatically. On the other hand if you are using an HTTP library that is not compatible with open telemetry you should worry and you should take care of propagating the contest by yourself. Okay so we can see here how it works. So there is a client. The client is performing this full operation and this creating this full span before calling the server the server and we have the server that has this bar operation and that has also created this bar span. So the problem is that the client is going to call this bar function and so how we have to relate this to a different span so we have to say that the food is the part of the bar span. So how it works so when the client performs the request to the server the information about the current span is including the headers of that petition so in this case we are including the trace ID we are also including the span ID so the server will be able to understand that this operation is part of this trace and that the part of the bar is this span. Then after the operations are completed both the client and the server are going to report the span to a remote collator and then the collator will be able to construct the full trace based on the information provided by the client and the server so in this case we can see that the collator is able to understand that the food is the part of the bar and is able to construct the trace. The other thing you have to worry about when an application or the instrumentation library so unfortunately all libraries have built-in support for open telemetry open telemetry is a new project and many libraries are not in support of open telemetry yet and there are also some libraries that won't support open telemetry so there is this concept about instrumentation library and instrumentation library is a wrapper around another library to make it generate telemetry data so the open telemetry community is trying to implement this instrumentation library for the most popular libraries in different languages so users are able to get some tracing information from those libraries so basically the instrumentation library what it does is to wrap the original library and to start the spans before performing the operation on the original implementation. This is an example of Python so these instrumentation libraries are available as packages in the different languages so for instance we are going to use the request request is on library for performing HTTP request and there is the instrumentation library for that is called open telemetry instrumentation request we can import the instrument for that library and then we instrument that library here we are enabled the instrumentation in the request library then if we perform any operation using that library we will have some telemetry data coming from those operations ok so I already show you how to use the open telemetry API to create to start the spans I show you also how to use the instrumentation library to enable the instrumentation in third party libraries so we have all these spans, all this information but somehow we have to export these to our remote system for processing or analyzing so the supporters are these components that send the traces to our remote system for processing and analyzing our storage the open telemetry specification requires the different implementations to have support for Jager the open telemetry collator those are like the like the supporters that are mandatory to have but of course there are many other supporters available in vendor specific reports so there are a supporter for Azure, GCP, data dog like this step and so most of the important companies have support for this and this is really amazing about open telemetry that there are already activities with many different vendors and if you are using a vendor and you want to change the vendor, the only thing that you will have to do is to change the supporter what I was saying about vendor lock is exactly this so if you already have the instrumentation done with open telemetry you can change the vendor just by swapping the supporter, you don't have to worry about changing the instrumentation code and so on this is an example or in Python about how to use the Jager supporter so we import the Jager supporter, we import some glue that is called a spam processor we create an instance of the Jager supporter in this case we only need to pass a service name, a host name and the port where the agent is running and if you are using a paid service probably you will have to also pass a token here it depends on each particular supporter and then once we have the instance of the supporter we can attach that instance to open telemetry by using a spam processor ok, let me show you a demonstration about how to instrument an application using open telemetry I am going to show you a simple demonstration that is composed by two different applications so we have the client application and the server application what this application does is that the user pass a country and the application replies what content is this country in so the client pass the request to an HTTP server and the HTTP server uses a database to get this information the only important point about this architecture is that I want to show you how different components interact together and how to enable the instrumentation libraries for this the HTTP client and HTTP server libraries also for the DB library ok, so let me switch to this so I have Jaeger running locally this is running locally we go here there are no any traces yet so let me show you the code of the application before so this is the client application so here we have the import for the open telemetry components so here we are importing the open telemetry API and here we are importing the SDK we are also performing the import related to the supporters and we are importing the different instrumentation libraries here so in the client we are only going to use this request library and we are importing the instrument for that library here we set up the open telemetry framework so the first thing that we have to do is to tell the API what is the SDK implementation we are going to use so in this case we are going to use the standard the default open telemetry SDK implementation we create the Jaeger Sparkter instance we connect that to open telemetry and then we enable the instrumentation in the request library later we get an instance of a tracer and this is all the code for the setup open telemetry here is the real code of the application so what we do here is to create a new span that we call query HTTP service we create the span and then we perform the request to the server for the ones that are not familiar with Python what it does is that once this code block ends the span is automatically end so we don't have to worry about calling the star and then functions and this is done automatically by the open telemetry python implementation the server code is more or less the same we have the same spot for the API and for the SDK also for the sporter what is different here is that we have different parts for the libraries used in the using the server so we have this flask library that is an HTTP server and we have this library to connect to the database all the setup is the same in this case we call the instrument we enable the instrumentation for these two libraries too here we create an instance of the server the HTTP server here we create a connection to the DB and this is the code of the application so this is the function that handles the HTTP request we create a new span in this case we set an attribute on a span we save the country the user is asking for in the span and then we secure the query to the DB and we return the result to the user so let's take a look how it works so I'm going to start the server so the server is running and I'm going to perform some calls to the server so let's do this with Columbia literally and so on so I perform two different calls so let's go to jigger so if we refresh the page if we click here and see that now we have traces for three different services so for instance we set the client service here we have the three different traces so each of these traces is representing an operation this is for a different country in this case so if we go here we can see the full representation of the transaction we can see that the blue one refers to the server and the pink one refers to the server and we can see the different spans here so for instance this query HTTP service was the span we were manually generating and we can see here some information about the span this is some information that is automatically introduced by open telemetry this span this HTTP gets a span is the one created automatically in the request library because we enable the instrumentation for that library and in this case we can get information about the HTTP request itself so for instance what is what was the method you use the URL the status code and so on if we go to this query dv we can see here that we set the country attribute so we can see how this this information is a spart to jigger so other information useful for this is that we can see the different time that the operations took so we could see that the full operation took something like 14 milliseconds we can see here that the getting the continent the server took 5 milliseconds query dv took 1 millisecond and the internal query on the dv took almost yes almost also 1 millisecond so let me switch back to the presentation let me show you let me tell you about automatic instrumentation so this is not possible to have all the applications or all the libraries instrumented one reason is that all applications doesn't include support for open telemetry so there are some all applications that are not updated anymore so they don't support open telemetry there is also the case that our new applications are also don't support open telemetry because the developer decided not to support it and the third and maybe the most powerful reason is that this is difficult and this is called still to add manual instrumentation to a large application so you have to go to the whole code of the application to create this application so it is very time consuming and costly so the idea about the automatic instrumentation is to do this manually to do this automatically sorry so how if we instrument the most popular libraries in the different programming languages and we enable that instrumentation of those libraries around time automatically the application will generate some telemetry coming from those libraries so the idea is that there is no any instrumentation code in the application itself but there is instrumentation code that is enabled automatically on the libraries that the application is using so we are able to get some telemetry data from those some telemetry data from those libraries the open telemetry has this concept of agents so an agent is an application that automatically enables instrumentation on the libraries around time this is an application that is secured before the real application and according to the programming language you use different techniques to enable the instrumentation around time for instance in python we have this open telemetry instrument that is the agent in python and then we pass the program so in this case this application is going to enable the instrumentation for the libraries that the program is used and program will generate some telemetry data so let me show you a demo about that so this is the same code of the previous application but in this case there is no any instrumentation available this is just one thing we have to do here is to configure the spotters we still have to configure the spotters in the code of the application this for sure is not the ideal solution but this is something we are working on to improve so in the future you should be able to do this without even configuring the spotter in the code of the application so if you see here the only thing we are doing related to open telemetry is to configure the spotter and here we don't have any information about span so here we are not creating any information we are not creating any span nor enabling the instrumentation for the libraries we are using the same happened in the server case we are only configuring the spotters but we don't have we are not creating and we don't have any spans here so let me show that so I'm going to switch to another folder with the application with the code without instrumentation I'm going to secure that so this is secured almost in the same way the only thing that I have is to I have to append this open telemetry instrument command before there is a type of error here we go ok so what happens here is that this application is enabling the instrumentation libraries and my server is importing and using let me do the same here once more ok let's go to jigger so if we refresh the jigger page we are able to see that there are three new traces that were generated seconds ago so if we go to some of the traces we are able to get something like a general overview of what is going on what is the difference in this case we have less spans in the previous one we have five in this one we have three because in this case the application is not generating any spans but all those spans are coming from the libraries that the application is using so yes anyway we are able to see to have some information about what is going on and what are the operations that are performed by those libraries this is just the same as before we have the same information as before so the interesting point here is that we don't have to include any instrumentation on the library and where we are still able to get information from the libraries ok so if you want to get more information about open telemetry observability and the situation you can go to these links I will publish you a link with these slides so I think this is all thank you very much for your attention if you want to get these slides those are available on this link and if you want to get the call for the demonstration this is also available on this trip thank you