 Okay. Hi, guys. So, I'm going to start so we can finish so we can go eat lunch. So, my name is Natasha. I'm really happy to be here. Today I'll talk about context propagation in open telemetry, and I'll show some examples that hopefully can be helpful to you. So, a little bit about myself. I'm a senior software engineer. I'm working in a really cool startup called Helios. A little bit about that in a second. I was part of the core team, and I joined two years ago. Before that, I was the data team lead at Oribi, which was acquired by LinkedIn. A little bit about Helios. So, like I said, we have a really cool product that is based on open telemetry, which I will talk about today. Basically, what we give you is as a developer, the ability to have an end to end visibility to your distributed systems. So, you can find troubleshoot and fix your issues. If you want your systems to look like that, you're more than welcome to check us out. So, today I'll talk about distributed tracing, like I said, and its implementation in open telemetry. I'll start with some basic example, and I'll move on to some more complex examples that hopefully will help you understand a bit better how to properly do that. I found that a lot of the times, the getting started examples are lacking and can leave us as developers. We're trying to implement something with this, like, okay, so this is very basic. How do I take that and move that to my more complex system? I'll end with a real-world example that I hope you can agree with me that it's really complex and not that trivial. And that's it. I hope that by the end of this talk it will give you the energy to implement distributed tracing in your own system. So, I'll start my talk with a story about Maya. Maya is a software engineer. She's working at a really cool startup. Her team is working with microservices, as is very popular, and I'm sure all of you are. So, this is an overview of her system. You can see that you have multiple microservices, you have your HTTP communication, you have your communication via Kafka messaging system, where one service is writing messages and another one is consuming them. You have writing and reading from a document DB, and then you have the final services that has some sort of interaction with Kubernetes cluster. More on that later. So, on the last day of the sprint, Maya starts working on a bug in her system where one of the flows is broken and data is not being saved to the database. So, she moves the tickets to in progress and she starts investigating. So, she knows her system pretty well and she knows that in a distributed system that is composed of multiple microservices, the first thing you want to do when you troubleshoot is to find where the flow is broken. So, what she's doing is starting to look at some logs. And that is not very helpful, because the logs don't contain any indicative errors that can assist her. And at this point, she's not really sure how to proceed. She understands that even though she has logs to the systems, she's already annoyed because it's 12.30 and the food is not anywhere in sight. But also, she knows that because the logs are specific to each of the services, it's very hard to connect the dots together and to see something that is coherent to the flow. So, at this point, she doesn't really know what to do and she's very, very frustrated, like I said. And then she remembers that last year she went to KubeCon and she heard a talk about distributed tracing. So, she searches through Google and she sees that open telemetry is a project that is an implementation of distributed tracing. And she sees that it's the second largest open source project after Kubernetes and has a lot of big companies invested in them, which makes it a real industry standard. She searches through the Getting Started and she sees something about how everything is automatically working and she doesn't need to do a lot more than just install these SDKs. So, her eyes light up. In the meantime, her food arrives and she decides that after lunch, she's going to install these SDKs on her services and see if it helps her find where the issue is. So, after lunch, she installs the SDKs. And the installation was very easy and very simple. And she runs the services and she starts seeing some data. But here you can see, by the way, the Yeggy UI, which is also an open source project that enables us to see a visualization of distributed tracing data. But as you can see, I hope that only part of the flow appears here. And the rest of the flow starting from writing to Kafka is missing. So, she starts getting frustrated again. It's already 3 p.m. She's dying for coffee. She's waiting for happy hour, which is only at 4 because she wants cupcakes with her coffee. But also, she's annoyed because she had spent some time on this task, on installing and running, and it takes time, as you know. And she's not really sure what the problem is and why she's not seeing the additional data. So, she goes back to the documentation. She tries to understand if she's missing something. And no matter what she's doing, it's not working. The additional data is not appearing. And she goes home in the evening and the end of the day feeling really annoyed and frustrated. And it is annoying to try to install something that's supposed to be helpful, only to discover that it's not as easy as advertised in the Getting Started page, and that the examples that they show you there are really, really simple and easy. But the minute that you need to do something that is a bit more complex, it's just not working. So, the next day, Gavin, the CEO, announces a rear org in the company, Maya is being moved to another team, and her task and the bug are going to Jira's graveyard, and along with it, the open telemetry integration. So, I assume this sounds familiar to some of you, at least, whether you've tried to install some observability solution, maybe even open telemetry, to your system, and it wasn't the smooth ride you expected. So, let's try to understand what happened there in Maya's system, and before that, let's go over some very basic concept in distributed tracing in case some of you don't know it or don't remember. So, distributed tracing is the ability to track requests and flows through your system and through various components in microservices in the cloud. And service instrumentation is the act of measuring a service and actions within a service. So, when you instrument a service, there is a unique, an object is created with a unique identifier for each action, and this object contains information about the action, like when it happened, how long it took, and any other additional properties that you can decide on. And open telemetry, like I said, is an implementation of distributed tracing, and it enables instrumentation in two ways, manual and automatic. So, manual means that open telemetry has an API that you can use to wrap calls and parts of your code and that way instrument your service and create these objects, these spans, and automatic means that for various libraries, open telemetry does this for us and has its own implementation, so we as developers don't need to do anything, and this is true for various databases, reads and writes, and HTTP calls, and this is also something that open telemetry provides. Another very important concept, maybe it's the most important concept in distributed tracing, is context propagation. So, we understand that in order for us to piece a bunch of actions that are part of the same flow into one coherent flow, we need some identifying information that will pass between them. So, in open telemetry, this identifying information is called the context, and it contains, among other things, the trace ID, which is the identifier for this trace, for this flow, and the act of transferring it between actions in the same flow is called propagation. And, again, here, open telemetry allows us to have and to implement context propagation in two ways, manual and automatic, and, again, here, automatic means that open telemetry does this for us, and we as developers don't need to take any action to make it happen. So, let's go back to Maya's system. If you remember, we did see the data from the first and the second service. And if you remember what I said, since this is an HTTP communication, this is something that is automatically implemented within open telemetry, and we as developers don't need to do anything to make it happen. And before we move on, let's try and think how we would implement it, okay, how this happens. So, if you had to pass data through HTTP requests, what would you use? So, I hope the answer pops immediately to mind that you would use the headers, right, the request headers. And this is exactly what open telemetry does. It inserts, injects the context into the headers. It's called trace parent, the name of the specific header. And this is an example of exactly how it would look like. And another thing to note is that it's not enough to inject the context. You need to also, in the service that is basically the server that accepts the request, you need to extract the context and then apply it to the rest of the code's execution. And again, open telemetry does this automatically for us, so we don't need to do anything. And this is why Maya saw this transition. Okay, so we didn't see the rest of the flow. Okay, you have here a service writing to Kafka and a service reading from Kafka. So, we can assume that something is wrong with context propagation and true enough, open telemetry does not support this automatically. I should say as a side note that this depends on the language. And there are some open source solutions that do implement it, but just for the sake of this example, let's assume it's not supported automatically. Okay, so what can we do? So, I said already before, I hope you remember that this is also supported manually. So, open telemetry exposes an API that we as developers can use to allow context propagation. We just need to do it manually. So, what we would need to do is to somehow inject the context before sending the message and then extract the context before reading it and then applying it to the rest of the code. So, let's think how we would do it. How we would do it in Kafka. So, the simplest way would be to just insert it to the message, right, and then read it in all the consumers. But this can be a bit annoying because each of the consumers would need to adapt and adjust the code so that the message structure is appropriate. But apparently in Kafka we also have something that's called message headers and we can use that. It's very similar to headers in HTTP. And what we can do is inject the context to the message headers and then read it in each of the consumers and that way avoid any pitfalls that can happen when the structure of the message is changed. So, here you can see an example. This is specifically in Node. But you can see how easy and straightforward this is. You inject the active context to the message headers and then send the message. And that way you're done. You injected the context. And like I said, it's very important to remember that on the receiving end, the reading end, we also need to extract it and to apply to the rest of the code. Here again you can see how easy this is and how easy the API is. We simply extract the context from the headers and then run the rest of the code with this context. I should say that when it comes to Kafka and specifically messaging mechanisms that allow batch processing, there are some additional things you should probably consider when you implement distributed tracing. But this is out of the scope of this talk. But if you have any questions, I'd be happy to talk about it later. So cool. If we run our code again, we should be able to see the data. We do indeed see the data because we manage to propagate the context. And we manage to make it happen. And let's try to see how we move on. And this is almost becoming a sort of a puzzle, right? Where should we inject the context and how? So the next stage is writing to a document DB and then reading the document from it. And let's think how we would do it. So it depends on the database that you're using. And maybe there's a mechanism that is similar to headers and the specific database you're using. But specifically here, since this is a no SQL database, the easiest thing would be to just inject the context into the document itself under some predefined hierarchy, right? And this is done very, very easily like I showed you before. Trust me, this would work. Okay. So if you're still not convinced and you don't believe me and you're saying to yourself, these are two also very, very basic examples like HTTP. You didn't show us anything that is too complex. This is not very helpful. I hope now that I can show you a scenario that we came across with one of our customers and hopefully I'll deliver with it. So let's assume that the task handler service, what it does is it reads a document from our document DB. And then what it does is make a call to a Kubernetes cluster and runs some, runs a logic as part of a task in a job, okay? Depending on the settings that it had, that it had read from this document, okay? So let's think what we would try, what we want to achieve here, okay? So what we would want is basically see the flow starting from our API gateway through the filtering service, then through Kafka, then through the document DB and all the way to the code that is run within the job, within the Kubernetes cluster, okay? That would be super cool to see. So I hope you can agree this is not trivial. First of all, we're using here a very specific API to communicate with Kubernetes. It has a very specific set of objects that allow us to do that. So we need to understand and think how we can propagate the context there. So in order to achieve this, what we would do is we would try to start scrolling through the Kubernetes API and see which objects we have when running a task. So we'll go through a bunch of things and then finally we would see that there is an object called V1nvar and these are the environment variables that we can define for the task. And that would be an amazing solution for us, okay? This is something we can definitely use. We can inject the context as an environment variable and then have the code that we run within the task, try to extract it, and apply to the code execution. And I showed you before a code snippet that's pretty simple and you're probably wondering, okay, so I can just use anything I want as a carrier. So the answer is you can use almost everything you want as long as it answers a very simple API which is get and set. And this is something we can easily implement to an array or array of environment variables. And doing that would make it work and we should see the flow all the way from the gateway down to the specific code that is run within the task, within the Kubernetes cluster. And there are a bunch of other examples that you can think of, try and think about your system and any scenario, I'm sure you can find something that would work there. So I want to summarize now and if there are three things I want you to take from this stock is that open telemetry allows you both manual and automatic way to instrument the service and also propagate context. So whenever you come across some specific scenario in your system where you see traces break, you should think, okay, is the context propagated? If not, if it's not automatic, don't despair, you can probably do it manually. And make sure you check the language that you're using and the supported instrumented libraries because this is something that is updated constantly in open telemetry. The second thing I want you to remember is when it comes to context propagation, you need to both inject and don't forget to extract the context and apply to the rest of the code because otherwise you're breaking your flow. And the last thing is that you should always try to find the right carrier that is specific to your scenario. And like I said, usually you will be able to find one. We came across some really specific scenarios using SNS, SQS and Lambda, using data breaks. You can find a solution almost to anything. So think of it as a challenge. And that's it. I hope that this talk got you a bit motivated and energized to implement distributed tracing in your system. This is definitely something that can boost up your development process. And I hope that you're not feeling too bad for Maya because she's here today and tomorrow her day will be amazing. So thank you very much for listening. I'm here for any questions or things you want to know better. And that's it. Thank you for listening.