 Hello, everybody. My name is Michael Abaman. I am the co-founder and the CTO at Aspecto and today I'm going to talk with you about how to get end-to-end visibility into complex transactions using open telemetry. That is a big statement and let's break it down. So we are focusing here on distributed application, microservices application. We're talking about the scenario when you have services that's communicating using some kind of a message broker, such as Kafka, Q, Redis Pub, Sub, SQS, any one of those kind. And we are trying to make sure that we are able to visualize an entire process. And the way that we are going to do that today is by first understanding what is end-to-end visibility, what it means, what our expectations are when we are saying end-to-end visibility. And then we are going to do a live demo. And in this live demo we are going to show an application that using Kafka.js and Node.js. And this demo is going to have two services communicating through Kafka, and we are going to implement open telemetry to gain end-to-end visibility. And this is going to help you, of course, understand what is open telemetry, how it's correlated to end-to-end visibility, and eventually have the whole understanding of how to achieve it. So why am I speaking to you about it? So just because I have some experience with it, I've been working with microservices and distributed application for the past five or almost six years. Most of the time as an independent consultant helping companies start up, break their monolith into microservices, the microservices grew and they had tons of issues. I came to help them and took this knowledge into aspecto, and again, focusing on microservices. Cool, so that's about me. Let's dive right into it. So let me give you an example when I say end-to-end visibility, what do I mean? So everything starts with having a service, right? I have a microservice, and this service needs to communicate with the rest of the world. And you may start, or in some cases you start using HTTP when service A wants to communicate with service B. That's fine and that's great, at least for some cases, but in other cases you're going to find out that if service B is not available, or if service B is under a lot of load, you may find yourself losing data, because if service A sends an API code to service B and service B isn't there, that's a problem. We lost the data. So what we want to do, or what we can do, is service A is going to send a message through some message broker. The message broker is going to persist the data, then service B is going to consume that message and handle it. That's all great, so service A is going to produce a message, in our case, to Kafka, but again any message broker applies here. That's how the messaging persists. When service B is available, it's going to consume that message and then do something with it, let's say persist it to a database. For me, that's the basic of end-to-end visibility. I want to understand a specific message, how it propagates throughout my system, what services took place, what they did with that message, and that's the most basic things. It usually starts to get more complicated when I may have more than one consumer, because maybe the service C also consuming this message. And it's very, very hard for a developer or for a DevOps or SRE to understand for a specific message, who are all the consumers, who is the producer, that could be very, very complex and this is what I want to get from visibility. To give it some kind of definition, so I want to visualize a message and I want to answer those three questions, who is the producer, who are the consumer, and for each consumer, what did it do with this message? And then that means that they can take every message in my system. I can track it, I can know why it was produced, I can understand who consumed it, and I can understand what they did with it. It could be very, very beneficial when you have evils or some issues in production, trying to understand what happened. It could be beneficial when you're working in a local and trying to figure out how it works. And that's what I want from having end-to-end visibility. End-to-end visibility, basically, we can call it also traces. Maybe you're familiar with similar traces, maybe not I'll take you through it right now. So distributed tracing is the ability to collect data that will tell us the end-to-end visibility. There is a project by CNCF called Open Telemetry. This project knows how to collect traces. I'll show you in a second how it looks like. And Open Telemetry is responsible for the collection part. It knows how to collect it, it knows how to ship it, then you need to figure out what to do with it, how to visualize it. So you implement an SDK within your code. There are a lot of programming languages supporting Open Telemetry. And once you send it out, we need to visualize it either as end-to-end or by an open source, as you will see in a second. So to give you an example of how a trace looks like. So this is a trace. We're seeing here Yerger UI. Yerger UI is an open source that knows how to display traces. And the trace that you can see here is a quite simple one. I wouldn't say there is tons of end-to-end visibility because everything here is synchronous and HTTP based. Well, almost. But let's try to figure out this UI and what we can see here. At the left-hand side, we have a preview. And in this review, you can see that older services are getting an API code to slash purchase order. Then it communicates with the user service and sends an API code to slash user. Then we have older service and order service, then sends an API code to stock service to update the stock. Stock service runs a DB query against stock site and then updates an item. And then older service publishes a Kafka message called new order. So basically I told you a story and unlike logs, I have the context of what happened here. So when I'm looking at each, let's say log entry, at each line here, at each event in my trace and events called spans, I can know who is the parent. I can know who caused it. I can have the context why it happened. So every line here represents a part of the story and you know the previous thing that happened that caused it. And this is very powerful. This is taking you to a whole new level talking in the stupid application, because when something happens and you ask why just look one row above and you have the answer. And at the right side we have a timeline. At the time we can see for each span, how long it took, and this is why it's called a span because it spends over time, it has a start point and end point. And it's not only telling you how long it took, but it's also telling you in that story, what was in parallel, and what was synchronous. And you can see that only when the user service call ended, then the update stock started. Basically, it's telling you the story. Hey, most probably this update stock can start unless slash verify is complete. So this is a trace. This is what we're trying to achieve. And it's easier when talking about synchronous stuff. It's getting a bit more complex talking about asynchronous stuff. And this is what we're going to achieve today. So, I want to take you through open telemetry and how it works in concept just before we're diving into the code. So you'll have an understanding how it works in theory, and then how it works in real life in in actual code. So we have this setup, we have a browser, send an API code to Service A, Service B, and then a DB call. And in each service we have open telemetry implemented within the code. And when, when we said earlier that we are having a context, we have a tree view to have a tree view, you need to have a parent child relation, and the service a interactions are the parent of service be interaction. Service B is going to report A, I voted DB query, it should report that the parent who caused it is service A. That means that the context between service and service B needs to path with the HTTP call that we have between those two. The open telemetry, when it sends out an API call, it sends out an HTTP call, it's going to inject in the header the trace ID. So service A inject in the header the trace ID to service B, so, and also the parent ID. And that would cause that when service B is reporting on what what happened reported under the context of service A. For the purposes of making it simple, let's say that the open telemetry points directly to trace database, of course, it's a bit more complex than that, but to make it simpler. So service A is reporting, hey, I'm sending an API call to service B, that's trace ID number one. Then service B call is reporting, I got an API call, I got an HTTP call from service A, that's trace ID number one, but also it's the child of the span that service A sent. And then when service B is reporting, I did some DB query, it also trace ID one and saying that it's the child of the span that happened in service A. That's the, that's the theory of how it works. And that's what enable us to create this beautiful digitalization that we saw earlier in in Yegev. Cool. So let's dive into how this thing looks. And as we said, we are going to have a project in OGS, we are going to use Kafka. And then we'll have end-to-end visibility. So let's jump right to the code. So what you can see here is a very, very simple project. We have two services, we have the consumer service, the consumer service is creating a Kafka consumer. Next to Kafka, next to Kafka, then it subscribes to a topic called the test topic. And on each message it received, it just logs it into the console. Very straightforward, very simple. The producer service is creating a producer, connecting to Kafka, sends a Kafka message through the test topic with some value. And it's doing that only when we send and get HTTP call to slash produce. So just to make sure everything is working, let's round those two. And we will just jump to our slash produce browser. And we ran it, we got an okay to make sure everything works. We just expect to see the consumer that it console the cool because that's the value that we send. And we can see that we wrote the cool message. And that's cool. But it's very, very simple. And we wanted to see how open delimitry works. So let's do it. Check out two. And we are progressing to the next level. So now we added both in the producer and the consumer in the first line of code, we are importing a file called import a tracer, and we are initializing it with some service name. So let's open the tracer to understand what it is and what it's doing. So what you can see here, we are having a basic implementation of open delimitry. The first thing that we do is we are defining where to send the traces to and as you saw earlier, we are using the anger. So we will be able to visualize it. Then we're initializing the node SDK with the service name that we got. Here it was consumer service here it's producer service. And we're asking open telemetry to do all the instrumentations automatically. So let's see that something works and something don't works. Kafka doesn't work out of the box. And we would start with making a bit complex with manually integrating Kafka and then we'll see a far easier solution. So let's run the consumer. Let's run the producer. And let's see what happens when we implement open telemetry. Run the producer. Yeah. Okay, so I'll go into cool and do cool number two. Okay, we did cool number two, we expect to see cool number two, we have it. And we also have Yeager available. So looking at Yeager, we have the producer service. That means that we got some traffic. And looking at the producer service, we can see that we have a get to such produce. Get to such produce we had some middle well some middle well some request center for such produce, but we don't have anything related to Kafka. Also, we can see the consumer service right here, even though we should have seen that. That's not that bug that's expected, because we as we said the Kafka is not supported by default, and we're going to now add it the hard way and then the easy way. So let's start with adding gates so let's do check out number three, and I'll spin up both services but let's review the changes that I implemented in the code. So the chances that the input I implemented are fairly simple. I grabbed from open telemetry my tracer. That's allows me to create spends by myself, and I am in my API call within the implementation of my API code. I'm starting a new spend called produce message. And after I completed sending the data, I am closing that spend the same goes for my consumer. In my consumer, I created a new spend called consume message, and I'm closing it. So basically what I did, I did the manual instrumentation I manually went and changed my code, my application layer code in order to have end to end disability. In some cases that would be required in others we will be able to avoid that. But I want you to see how it really really works will have a good understanding of open telemetry before you're getting the cool stuff that everything just works out of the box. Okay, so we did that what we expect now to see after sending an API code we expect to see those spends in our grace. So let's go back to our test page run cool number three let's validate that it's still working. Yes, it is. Let's go to Yeager and refresh Yeager. So first thing first, we can see that we are now have posted producer and the consumer. That's super cool. Let's see what we have in our producer. And then we used to have four spends. And now we have five spends. Cool. That means that we have one new spent. And yet we have the special produce and now we have to produce a message, the producer message. That's exactly the text that I had it right here. That means that we are able to mark that we send the data to Kafka. However, we don't see the consumer side. And that's what we wanted to do. We wanted to make sure that the produce side also is available here. So looking at at the search, we did see the consumer so let's see what we have the consumer. In our consumer we have the consumer message. And we can actually even see the value that I wrote down into the spend with cool number three. So the data is here but for some reason, I don't have any official together. I have broken into two different sections. This is a bad thing, we need to fix it. And the reason is that if you recall in the slide that explained you that open telemetry automatically injects the context into the HTTP headers. We didn't do that. And very happily Kafka allows us to have headers. So when I checked out code tag for you can see here that I am injecting the current the active span into the headers of the message. So I don't know if you know but Kafka allows you to have headers. Pub sub message book is allows you to have some kind of a metadata that is being sent along your, along your messages. So in the producer I injected it, and in my consumer, I extracted it. And the code here becomes a bit complex, right. This used to be only this portion of code. And now the things starts to look a bit awkward. Again, this is for the purpose of showing you how to integrate it how to be open telemetry experts. And our last thing that we will do today would be to simplify it all over again. But let's see if everything works. So I am doing yarn consume, I'm doing yarn produce everything is initializing looks like it's going to work. So let's go back to our tester and let's do cool number four. We got an okay, we can see here that we got cool number four, we can see here that I logged the headers. So the headers. You can see here the trace parent. When I said that we're saying the trace idea but simplified it with any trace parent this is some encoding for more than just in the 90. So our eventual goal was to go to Yeager. Go to the producer. And, as you can see here now the search produce has six bands. And now we can see both services, I can see the consumer and the producer, all together. And now the producer is here and the consumer here as a child of the producer. And now we can see that we were able to get end to end visibility. Anything that will happen in the consumer service will be also available right here. If I have multiple consumer, they'll be right here. We got through a lot of work, we got it working. To ease your mind that it could be far more simple. Let me just do check out number five. And you can see that we are back to the very, very simple code of just logging it. Same goal for the producer, nothing special here, nothing that relates to open telemetry. I did do one thing. I created a tracer file. I added open telemetry instrumentation for Kafka JS, and just added a new instrumentation. That means that this thing could be auto instrumented. Just let's see that everything works. And then we'll be super happy. And you can see that we could do HTTP automatically. We could have done, we are, we are able to do Kafka automatically as well. So let's do code number five. We still have number five. We don't have the log to trace the trace parent as we had before. Let's search for the last one. And yet everything looks the same. We can see that everything works as expected. And that's all great. So every time that you're implementing open telemetry, and you are missing some kind of data, most likely you're missing an instrumentation. If you can pass here an array, a list of instrumentation, some of them are automatically some of them you will need to add manually. And if you are looking where to search, you can go to open telemetry.io and in open telemetry.io, you will have the registry. If you're looking at the registry, this is where you can find all the different instrumentation. So if you're interested, for instance, in JavaScript, and then you're looking for instrumentation. Now you can see the whole list of all the different instrumentation. So if you're using MongoDB, you have that. If you're using MySQL, you have it. You can see or GraphQL, whatever. So if you're missing some data, don't start to manually instrument it yourself. As you saw, the code was rather complex and integration of open telemetry in the application code could make it just harder to read because all the developer needs to suddenly understand open telemetry. And if you have an auto instrumentation, it keeps the code very, very clean and very, very simple. So I would urge you to first look for an instrumentation before you start to make changes by yourself. So we are looking at what we learned. So basically we learned open telemetry. That's the basics of having an instrumentation by ourselves. And if you want to be able to collect this data, you need to understand how open telemetry works. You need to understand very little about open telemetry. You can use SDK. You can just grab the code and put it in your own code. Different languages, the implementation would look at the different overall, a simple, you know, a simple process. If you're doing distributed systems, I think you have to use it. It will make your debugging and troubleshooting life way more easier. And, you know, just give it a try just in SDK. Always look for an open instrumentation, as I showed you in the registry, if you are lacking it and you don't have it. I think it's worth the effort to either do an, you know, a manual instrumentation as I showed you, or maybe write your own instrumentation and publish it to the registry. If you have the time and effort, that is a very fun thing to do and you will learn a lot from that. So thank you very much for, you know, listening to this talk. If you have any questions about distributed application, open telemetry, message brokers, feel free to reach out. And thank you. I hope you learned something new today.