 I have to start the last day of that one. Remember that the rate, the present sessions that you like, you can go back off to the website, and connect to the mobile application. And remember that at the end of the day, we'll have a special session with some, closing sessions with some prizes. So if you're not going to run away, we are going to say it in the end. And now we're going to start the first presentation of the day for this room in Microsoft's Instruct, with PaoloPhi and Distributed Tracing Open Tracing. So, hello everybody, my name is PaoloPhi, and my presentation is going to be about keeping Distributed Applications under control. But first I would like to start with myself, who I am. So, two years ago, I joined RedHead as an intern student. I was working on my master thesis, which was about time series prediction engine for Hauklar project. Hauklar is known as the Swiss Arminife monitoring solution. Why Swiss Arminife? Because it consists of several modules, which can be used independently, like alerting time series database and so on. After graduation, literally the next day, I joined full-time position in the same team, however, different sub-project. The sub-project is Hauklar APN, which is known as... which stands for Application Performance Management. At this time, I was introduced to Distributed Tracing. There is a link to my github account, where you will find the demo application for my presentation, and also the slides about after-presentation. So, what is on today's agenda? I would like to talk about why we should care about distributed tracing, why is it important, and then talk about concepts and terminology of open tracing, and continue with demo and life-coding about best practices how to instrument your applications. At the very end, I will tell you how to start with open tracing and what are the ways to instrument your applications. So, why tracing? Why should we do tracing? We used to have these big monolithic applications, which were easy to monitor, because you just were collecting some logs and metrics like how many records you have, and so on. It was pretty easy, but nowadays it's trying to split this big application into small ones, which basically runs in separate processes, in separate images and on separate machines. So, it's pretty hard to correlate logs from all the machines, all the processes, to one place and see what is happening in your environment. You can also SSH to all the machines and kind of download the logs and figure it out what is happening, but that's not really good and easy way to do it. So, the question is, how you're monitoring to tell the story what is happening in your system and open tracing should solve this. With open tracing and distributed tracing you can see all the logs at one place, in one system. And yes, we have heard yesterday that we love distributed systems, but they break and they break often. So, to have a tool to debug them is like key point. So, why open tracing? Who of you has heard of Zipkin? And who of you has heard of open tracing? And who of you would like to start with tracing in your production? Yeah, that's cool. So, why open tracing? Well, first of all, it's a start, but it's a young, but fast-growing project. It started in 2015. The API for instrumentation is defined in eight languages. You can see Go, Java, JavaScript, Python, Ruby. And the first version was announced in 2016. And also important to mention that it joined Cloud Native Computing Foundation. You are familiar with Zipkin, right? So, it's a big project. I like Zipkin. It has a very big ecosystem. You can use it in many languages. But there is a problem when you use Zipkin that you are locked to one specific solution. You cannot change that. And in open tracing, it's just an API to instrument your applications. So, if you would like to start, for example, you have your applications written in Java, Python, JavaScript, right? So, you will spend a lot of time with instrumentation. And it's important to be able to switch between backends because it's a good thing to do and you are not locked to one specific solution, right? So, again, why open tracing? There are three important points why we should use open tracing. First, it's explicit. So, monkey patching wouldn't work because you could change some liter code or update version of dependency and the instrumentation would break. The second point is that it's decoupled. As I said, it's vendor-neutral. So, you can change between multiple implementations basically in one line of code. First thing is consistent. It means that it uses the same words announced across all different languages. So, in use the same terminology and not the same API, but the same terminology in Python as in Java. It, however, does not mean that instrumentation code in Java looks the same as in JavaScript. Open tracing tries to honor the language structure. So, what do we have in open source implementations? I'm working on Hauklor APM project. We currently support Java, JavaScript, and more libraries are coming soon. There is Zipkin, which supports Java and Go. Jagger, which is basically very similar project to Zipkin because they use kind of the same model. So, you can use Jagger with Zipkin in some way. There is also AppDash for Go, Python and Ruby and additional smaller projects, mostly with InGo and support on the Go. What is important to say is that if you cannot use, for example, Go instrumentation from AppDash and use it with Zipkin, it wouldn't work. You have to use the same instrumentation libraries across your environment from the same provider. So, if you use instrumentation library from Hauklor APM, you cannot set the data to Zipkin because the model is different. I would like to explain the concepts of distributed tracing. There is this key thing which is called SPAN, and SPAN basically represents one unit of work. So, what is important is that SPAN, you have to start it and you have to end it. So, it implies there is some duration, right? So, you are able to calculate duration of some method in location in your system. It also, you can add some data with tags and also add some logs. Logs are different to tags because you have some timestamp so you know what happened. It's basically just key value, tag is key value, some string, key value string. So, it's cool to have a SPAN, right? Because you can calculate duration of one method. But, it's not enough so we have a trace. Trace is basically a list of SPANs because usually you have more operations in your system. And, as I said, trace is a list of SPANs which is basically like tree-like structure. So, it's directed acyclic graph, as you can see on the picture. So, these edges represent causal references between SPANs. So, you can see that SPAN C and SPAN C are children of SPAN A. And SPAN X follows from SPAN F. There is difference between child of and follows from. For instance, follows from the SPAN X, you can start after SPAN F finishes. So, you can't see the semantics of operations, right? It's a pretty cool thing. Currently, open tracing defines only child of and follows from. But, you can define your own references. So, let's build a trace. So, you have a bunch of happy users calling your public API in OJS. So, the first request goes in, and you have to start the SPAN, right? So, you generate some ID, let's say in this case it's 1, and also trace ID, because you would like to identify all the SPANs for this request. The request goes to Java, so we generate next SPAN. Then, the request arrives in Java, so we generate next SPAN. What is the difference between SPAN 2 and SPAN 4 in terms of duration? Well, the SPAN 2 contains also network latency. So, the duration is longer. Then, also we would like to call our Python process, so we generate next SPAN and next SPAN. The question is, how is this information represented in some tracing system UI? Well, I think the best way to do it is to use kind of like timeline view, because you can see a lot of information on this graph. Well, each line represents one SPAN, so this first one is this one, and it's usually the longest one, but it does not have to be. Some high-level information, like how many services were called, how many SPANs were generated, and what is the total duration. You can see that we were calling from JavaScript Java and also from JavaScript Python, right? And you can see it happened at the same time, so it means that it happened in parallel. By the way, in Python, there were some more operations to fulfill this business rule, so we had to generate also two next SPANs, and you can see these two SPANs were done in serial. So it's kind of show you what is happening in your system in terms of time. You can also see that the request from JavaScript to Java, the network latency is this one, but from Java to JavaScript is a bit longer. So let's jump into the open tracing API. We have talked about that SPAN is represented by some ID and trace ID. This encapsulates in SPAN context, which is, I would say, implementation-specific because in Hauklar we store different IDs than in Zipkin, so you cannot read these SPAN IDs. However, what you can do, you can set some baggage on SPAN, and this baggage is propagated to the children. So for example, in your UI, you set something as a baggage, and it will be propagated to your backend service, which is kind of cool and pretty powerful feature. When there is SPAN, you can send some key-value tag. You can log data, which is basically like a tag, but with a timestamp. You can set and read baggage. You can then read tags and logs. And the last operation, what you can do, you have to finish the SPAN. When the SPAN is finished, it is sent to the tracing system. The last key-important interface is tracing, and it's the key point to start SPANs. You have to just provide operation name. Then there is extract and inject. We have talked that SPAN context is propagated between services, right? But we have to have a way to extract the context to SPAN context. So you basically provide, for example, carrier, HTTP headers, and it would extract SPAN context. When you are sending a request, you would like to inject that context into the carrier. So you call tracer, inject, context, and HTTP headers. So talks are cheap, show me some code, right? So I will show you some simple Jaxrs application. Hope you are familiar with Jaxrs and Java. So this is a very simple application. Let's look at dependencies. We can see OpenTracing API and then Hauklor APM implementation with some REST recorder, Zipkin implementation, and Zipkin recorder. So we are going to instrument this REST endpoint. Just basically written some email for a given username. So as a first thing, we should start a SPAN, right? Because we would like to get duration for this invocation. So let's call tracer build SPAN. Now we have to provide operation name. So let's go with method name. So it's get email. And then there is a parameter, right? So we can add parameter. And now we can start SPAN, right? As the last thing, we have to finish the SPAN. So now we will get the durations of this method invocation. But we can add some data, some important data to the SPAN. So we can call tags. For example, let's add HTTP status code. Sets, we provide the SPAN. And in this case, we are returning 200, so let's do 200. You can also add some URL and more information. Can anybody tell me what is wrong with the operation name? Imagine you are calling this method with get email and then, for example, my name Pavel. Then you are calling this method with get email and, for example, Joseph. So we would have two SPAMs with different operation name representing one method invocation, right? So that is wrong because you wouldn't be able to find this SPAN in your tracing system. So how to fix that? We can do something like this, which is good, but if you use it at every REST method, you would get a lot of SPAMs with get name, which is also not good. So the best way how to fix it is use a unique operation name which can represent only this SPAM. So, for example, get email with something like this. And this would work. Can anybody tell me what is wrong with this one right now? Will we get the right timing data, in particular in Jaxres? Can anything go wrong? Well, that wouldn't take so long. It can take, but in Java we have, for example, filters, which can access your database and call other services and it can cause latency problems, right? So imagine that filter is calling database, which takes, I don't know, one second, and this invocation takes three milliseconds, right? So you would see in your tracing system three milliseconds, but in real case, the duration is much longer because there is this filter calling database. So to solve this, it's better to implement some tracing filter and put it as the first thing in invocation chain. So, okay, we have instrumented this method and there is some business logic, right? So it's this method with SPAM context and username, which is basically calling some wide random and starting next SPAM, finishing it and returning just some string. So let's go to command line and start Zipkin and also start AlcularAPM. So let's go to the configuration. You can see that this rest handler is expecting tracer, which is just open tracing interface. It's not any specific implementation. And in the configuration, I have prepared two options. We can go with Zipkin, create Zipkin tracer with Bragg, which is basically Zipkin instrumentation for Java, or we can construct the Alcular tracer, provide some username, password, URL, and service name. We also, we have seen that on the chart with SPAMs that there is service name, but in open tracing there is no such concept, right? So, but you have to, a lot of systems implement it and you have to provide it when you are constructing your tracer. For example, in this case, APM tracer, some deployment data data, this is service name, and this is basically build number. So, okay, let's start this application. As you can see that we are using Zipkin implementation and generate some requests. So it returns. Okay, let's go to Zipkin UI. And we can see there are some, this is basically the service name. Let's find traces. And we can see we're calling our API three times and it generated two SPAMs. Let's look at this one. And we can see this is the, for the REST method instrumentation. This one is for database instrumentation. Now I will change the backend. Interest in Alcular. We have to restart, generate some requests. We can see it's different UI. We don't have currently TimelineU because we are focusing more on high-level aggregation of data. But you can still navigate to the trace instances and see what is happening in your system. We provide you like duration and you can see all the tags. Okay. As you can see in this code, if you are going to instrument your business logic, you have to propagate the parent SPAM context, right? Which is not so, not so cool. There will be probably some changes, changes in open tracing standard API. So it will like find the first parent context. So you don't have to propagate that manually. Okay. I will go back to slides. So instrumentation. What we have seen is explicit when we are calling open tracing API directly. But there are also some framework integration, for example, like JAXRS, server filter and so on. The third option is to some kind of like Java agent where you run some code before actual application is loaded into JVM. It's kind of cool because you don't have to change your application code, right? But I think the hardest thing to maintain and to develop. I would recommend to go with explicit with framework integrations because it's kind of, it's not trivial to do it properly. To properly instrument, for example, JAXRS or Express in JavaScript. So to conclude, we have talked about many things but remember that instrumentation is hard. And if you would like to do it, then go with open tracing because you are not logged to one specific solution. Other thing is that start spams at the right places because there can be some little code which is invoked before you start spams and that little code is causing you latency problems. So this was all. If you have any questions, I'm happy to answer it. Yes, there is Java filter, JAXRS, Spring Boot, but we are going to do a lot more. Exactly. And you will be able to also access that parent spam so you can instrument your business logic. It's a relatively new project but this instrumentation will come up this year. Exactly. You provide implementation, yes, and a filter. Any other? Is there some way to... Inzipkin? That's a good question. Inzipkin, there is this annotation query when you can put, like, tag's name and basically I think it takes value and it would find the spams representing this query. Does it... Yes. Inzipkin, there is no such a way to do it. In Haukler, we have some high-level... I will start some microservice application which consists of four different applications and generate some requests. So in Haukler APM, we do some high-level aggregations so you can see, like, average response time for each application. You can see, for example, your database duration and this kind of stuff. But you have to properly do query, provide some properties, what transactions are you interested in. It won't show you, like, the exact... This is the problem in your infrastructure, yeah? Uh-huh. Person length. I think here is duration, so you can specify the duration and you will see the longest spams. You also provide, in Haukler APM, provide integration with alerts so you can alert on specific duration. So some operation is taking long, so you would get notified. Other questions? Thank you. I wasn't looking at you, so... Yeah, okay. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you.