 My name is Johannes. I am a software engineer at Grafana Labs. The team I'm working on, our mission is it to make Grafana users successfully with Open Dilemetry instrumentation. I'm involved in the Open Dilemetry project for several years now. I started out as a contributor to the Open Dilemetry C++ project. Currently, I'm mostly focusing on Open Dilemetry semantic conventions and in particular, semantic conventions for messaging. That's also what prompted me to give this talk. There is a lot of pain around monitoring asynchronous workflows. Way too much to put it all into this single talk. It would be hard to bear. So, what we're gonna do here, we are gonna focus on a single pain point that is the centerpiece here. Then, first, we will look where does this pain come from. We will try to understand the source of this pain. And then, at the end, we will look out for hope. Is there any hope to alleviate this pain? We will talk mostly about distributed tracing because this pain point that I picked, it relates to distributed tracing. Also, we will talk about distributed tracing here because in my experience, it is the area that causes the most pain when it comes to monitoring asynchronous workflows. And to define what we mean by asynchronous workflows in the context of this talk, we define an asynchronous workflow as a workflow that involves asynchronous communication between different services. There can also be asynchronous communication inside a single service, inside a single process. And while many of the things we will see here also apply to those scenarios, it's not what we are primarily focusing on. These asynchronous communications between different services, there are also the terms messaging or eventing used for it. I will use the term messaging here a lot, mainly because it rolls off the tongue much easier than asynchronous workflows. So, let's dive in. Let's talk about this integration and let's look at some characteristics of asynchronous workflows. What do we see here? Firstly, we see that apparently I didn't go to art school, but let's leave that aside. For a moment, we see a little comic here, a little workflow described, a little story. We see two actors involved. Somebody pushes their bike to a bike mechanic and says, I need my bike repaired. The bike mechanic says, sure, I'll text you when I'm done. Then the next day, the owner of the bike receives a text message that says the bike is repaired. He says, great, I will go there and pick it up right away. And he goes there, gets his bike and pays for the service that he receives. Quite a boring story, you might say, not much happening, not very spectacular. Let's try to visually graph what's going on here. Let's put all this on a time axis that already reminds us lots of a lot of gantt charts now that we are used to when we look at distributed traces. We see black boxes on this time axis. Those are actions operations done by the owner of the bike. We see gray boxes. Those are actions operations done by the bike mechanic. So we start out here. The owner of the bike comes, requests a fix for the bike and the bike mechanic accepts the bike. Then for quite a long period of time, nothing happens. And after a day, the mechanic fixes the bike, sends a message. Again, short period of time, nothing happens. Then the owner of the bike receives the message, picks up the bike and a payment is processed. And what strikes us here is that there are quite long periods of time where nothing happens. Actually, during most of the duration of this workflow, nothing related to this workflow happens. Why is that? That is because of a phenomenon that we call temporal decoupling. Here we see two tasks connected, two times two tasks connected with a dotted orange line. Those are tasks that are logically related, but they are temporarily decoupled. What does that mean? Fixing the bike, that's a task that depends logically on accepting the bike. We can only fix the bike that we accepted before. Same for send message, receive message. We can only receive a message that was sent before. However, those tasks are temporarily decoupled. Their duration doesn't overlap. They are disconnected on our time axis. Sending a text message is a very good example for this because if we look at the characteristics of temporal decoupling, here we see that producers and consumers of messages, they aren't restricted by each other's availability and they don't have to run concurrently. If we think about the text message, we can send somebody a text message and we are not restricted by this other person's availability. If they are in a meeting or they have the phone turned off, they will receive the message later. This is an example of a synchronous communication and probably decoupled. If we think about calling somebody, then we are restricted by this other person's availability. If the person is in a meeting or the phone has turned off, he doesn't receive the phone call, communication doesn't take place. This is an example of synchronous communication, those parties have to be available at the same time. This temporal decoupling allows us to increase the reliability and resilience of our workflows. I will not go much deeper into that here, that would be a talk on its own, but just take my word for it. What is important for us is to keep in mind or to look to see here that temporal decoupling causes disintegration. The temporal decoupling is the reason why our workflow here is basically three disconnected chunks, three chunks that are disconnected on our time axis here. With that in mind, let's move on and let's see what pain this causes us. And the pain comes in when we ask, how can we model those asynchronous workflows? The way to get and to end inside in our workflows is we use distributed traces. And here we see our workflow as a distributed trace. We put it all into a single trace, we have those operations that we saw before, each operation is modeled by a span or part of a single trace here. Again, we see these big gaps in our trace, but we already know temporal decoupling, it's a feature, not a bug, we know what's going on here. However, let's increase the complexity just a little bit now. Let's assume that we have a really good bike mechanic who can fix three bikes at once. We see that depicted here, we have an operation, fix three bikes in one batch. During this operation, three messages are sent out when each bike is fixed and we see the owner of the bikes receive the message and they pick up their bikes. This already is not a very complicated workflow, way less complicated than workflows that probably many of you can count out there. But we already see that it gets quite confusing looking at this picture, it gets harder to see what's going on. We can try to help us a bit again by connecting our logically related but temporarily decoupled operation that helps us a bit to get a better oversight but still it requires that we already know quite a bit about what's going on here. And often when we look at traces, the traces should actually tell us what's going on. So we have a problem here with complex workflows. What alternatives do we have? Well, at this point, I wanna talk a bit about ways we have to correlate spans in a distributed trace. We essentially have two, we can correlate spans via parent-child relationships and we can correlate spans via span links. A parent-child relationship is a one-to-end relationship. One span can only have one parent, but one parent can have several children. A parent-child relationship is also what constitutes a trace. All the spans in a single trace are directly or indirectly connected or related by parent-child relationships. Span links, on the other hand, are end-to-end relationships. One span can have any number of link and a span a link to can have any number of other links. Also, span links can go across traces. So a span can link together, a link can link together spans from different traces and thus a link can link together different traces. So let's try to utilize that. Let's utilize span links and let's see if that helps us in our modeling challenges. Here, we split up our initial tracing tool. The first trace starts with requesting the fix for the bike and ends with sending a message to the bike is fixed. The second trace starts with receiving the message and ends with the payment being processed. And the send message and receive message spans to connect together via a span link. We can push this further and now separate our trace another time. And now we have three traces here. And here we start to see something really interesting. We now see that each single trace models a synchronous part of our workflow. So with this modeling option, we got rid of those kind of awkward time periods where nothing happens. Each trace now represents a synchronous chunk of our workflow and those traces are connected with span links. Let's again bring our talented bike mechanic in and let's see how this solution scales. Well, looks a bit better maybe. We have again the fixed three bikes, three messages are sent, the send message spans link to receive message spans which each are part of a separate trace. And what we also see here, we have the upstream link to the accept bike. This is something that we couldn't model with the single trace solution because as we said, a single span can only have one parent. We cannot have three parents. We cannot model those fan-in operations with span links we can. So we can zoom into a single trace here and look at the synchronous part of our workflow and we can zoom out and look at all those workflows and all those parts of the workflow connected together. So where does the pain come in? We saw that we had quite a simple scenario here but we already came up with three different ways to model it. That in itself is the first pain point. There is no established best practice. There's lots of different options. So people get confused. They don't know what to do. Which option should they pick? If they go for the single trace solution, we saw that they can end up with large and confusing traces for complex scenarios. We also saw that it's impossible to model certain workflows. We saw this with those fan-in workflows. If we go with multiple traces, some might say that's an overkill. If I can put everything in a single trace, I should do that. Also, a pain point that we have here is that we have varying levels of support for span things from different tools and vendors. Some tools have great support. They allow you to visualize link traces in one Gantt chart. You can expand and collapse traces. Other tools allow you to navigate from one end of a span link to the other via hyperlink and vice versa. Again, other tools allow you to follow the link but only in one direction and you cannot go into the other direction. This causes pain because we would just have different user experiences across different tools and vendors. And very importantly, if we break up our traces in this way and use span links, this doesn't just break up our traces, this also breaks existing and established sampling solutions. Because those sampling solutions are built on the premise that a trace contains our complete workflow and they aim at giving us a sample of complete traces. They don't give us a sample of complete linked trace casters. So with this approach, with multiple linked traces, we are much less likely to get a complete picture of our workflow. And last but at least here, we could say, let's just embrace the plurality and let's use both options, sometimes this, sometimes that, whatever fits best. That also comes with problems because we have a lack of consistency that will impede vendors to implement good support for messaging scenarios. And we also run into scalability challenges because most workflows start out simple. Then we might tend to use the single trace modeling solution. However, as our workflows mature and scale, they get more complex. So maybe those simple solutions break at some point. We have to rewrite our instrumentation. And in the worst case, we will even break our existing observability workflows. Lots of pain. Is there any hope out there? Let's try to call standardization to the rescue. And where does this standardization happen? It happens in open-dynametry semantic conventions. You might have heard about those. Those semantic conventions, what are they? They basically define span names, attribute names, and very importantly for what we look at here, span relationships, and they define the meaning of those names, attributes, and relationships. So many conventions don't just cover traces. They cover all observability signals, but we, for our purposes here, we will focus on traces. There's also an open-dynametry messaging workgroup that I'm part of, as I mentioned before. And this workgroup aims at coming up with stable semantic conventions for messaging, for traces, and for metrics. And it's our goal to provide the stable semantic conventions for messaging this year, 2024. We also put that on the open-dynametry roadmap for 2024. So we are under a bit of pressure to deliver something here. So how can those semantic conventions now help us with this problem that we picked before? We again here see a part of our workflow with some small changes in terms. We don't say send message anymore. We say publish because that conforms to semantic conventions. And we see here also in brackets, span types. So our publish span is of type producer. Our receive span is of type consumer, and both of them are connected with a span link. And this is exactly what semantic conventions for messaging require for modeling such scenarios. They require a producer span, a consumer span, and a span link in between them. What does that give us? Well, if we have compliant instrumentation, then whenever we look at the resulting traces and we see a producer span and a consumer span and a link between them, then we know that there is a message passed from one to the other via an asynchronous communication channel. We can even make sense of the time between the publish and the receive span because this is the time that this message spent in queue. This is quite an important metric in the messaging world. So this is our invariant. This is what semantic conventions require from compliant instrumentation. However, semantic conventions don't tell you or don't require putting this into a single trace or modeling this using multiple traces. We can put all this into a single trace. Then we end up with a span link inside a single trace. We can also model this with multiple traces. Then we have a span link that connects different traces together. What do we achieve by this? Well, what we achieve here is that we have an invariant that's the same across all scenarios. And on top of this invariant, we preserve lots of flexibilities for instrumentation authors and for users to find the right modeling options for their scenarios. Simple scenarios, we can still put into a single trace. Complex scenarios, we can split the trace up and we always have this invariant that tells us when asynchronous communication happens. Let's look at our paint points again and let's see which of the paint points this solution solves. We've flattened out this list and we go through the points one by one. No established best practice? Well, not really. We have open telemetry semantic conventions. We have some conventions encoded there, so we have a place to go for looking for established best practices. We can check this off. Large and confusing traces for complex scenarios that doesn't need to be, we can split our traces up as we saw. Impossible to model certain workflows. Not really. We saw we can even model those fan-in scenarios. Span links are a very powerful tool here. Varying levels of support from tools and vendors, this is still a paint point that persists. We in the messaging work group, we deliberated for quite some time whether we should prescribe the usage of span links or not because of this reason. However, in the end, we decided that we are gonna embrace span links because we also think that stable semantic conventions to prescribe the use of span links and instrumentation out there that uses span links, this can act as a catalyst for tools and vendors implementing better support for span links. We have a bit of chicken and egg problem here because there's no good support for span links because it's not widely used and it's not widely used because there's no good support for it. So we want to break the cycle. Overkill for simple workflows, not really, we saw when we can, we can still put everything into a single trace. Breaks establish sampling solutions. I have to say this is the biggest paint point that maybe remains because we don't currently have a solution for that yet. If we pack everything into a single trace, well, then we can use established sampling solutions. However, if we use span links, we are unlikely to have a complete picture of our workflow. Of course, we can sample 100% but we also don't want to do that. So here is a paint point that remains. Lack of consistency, not really, we have our invariant defined that's the same across all scenarios and we even preserve some flexibility on top of that. Scalability challenges, they will always be there but what we now can do, we can actually switch from single case modeling to multi-trace modeling without hurting our invariant. So if we implement workflows, observability workflows that depend on this invariant, we can flexibly change our modeling. Well, now you might say that's all nice theoretical information but what does it mean for me? I'm in pain, what should I do? So advice that I would give for you as practitioner. Number one advice I would say, rely on open telemetry compliant instrumentation libraries. The way how you use instrumentation libraries, it differs from language to language but in all instances, there's a way how you can rely on instrumentation that somebody else provides to you and why should you do that? Because authors of hotel compliant instrumentation libraries, they go through great pains to implement those semantic conventions, to conform to those semantic conventions and they do it so that you don't need to do it. So benefit from the work of others if you can. If for some reason, you don't want to use instrumentation libraries or you cannot use instrumentation libraries, then write instrumentation that is based on standards even if those standards are still draft standards. Why? Because you benefit from standardization because other people put lots of thought into those standards or so vendors and tools are very likely to support those standards and especially draft standards and experimental standards also benefit from you using them because the more people adopt those standards, the more impetus there is for vendors and tool developers to support those standards. If you need, if you want to avoid any possible turn, then I would advise you to wait for stable semantic conventions for messaging. They are coming, they're coming this year and we are now in the last phase before this declaration of stability and that's also for us the last opportunity where we can actually break things. We don't do it light-hearted, but if we see something that will really hurt us in the future, then we will change and break it now. So if you want to avoid this turn, then wait a little bit. And if the pain is really getting to you and you say that all goes much too slow for me, then come and help us. Help us to define and establish an industry standard in that field. The Open Dilemma Tree Messaging Work Group is a very active group, but it's also a quite small group and there is lots of work to be done. We already talked about instrumentation standards here for traces, we are currently working on a set of standard metrics for messaging scenarios. Then we want to push instrumentation libraries that implement those instrumentation standards in the form of semantic conventions and then we are at a point where we can declare initial stability of semantic conventions. But then we're not done. There's lots of other work waiting for us. For example, context propagation for messaging is also an area that's in dire need of standardization. We have something here. I listed three examples. We have first public working graphs in the W3C space for MQB and MQTT, two very popular messaging protocols. However, those emerging standards are somehow stalled in their emergence. Those rafts are two years old. There's not been much activity in those two years, so we in the messaging work group, we recognize the value of those standards and we want to push those. There's also something in the space of cloud events. Cloud events for those who don't know it, it's a CNCF project. It's a standard for describing events. There is a distributed tracing extension for cloud events which describes context propagation over with cloud events and the members of the messaging work group are involved there. They're in touch with cloud events folks to push this forward. A big point left here, I mentioned before the biggest pain point sampling. We from the messaging work group, we want to work together with the open telemetry sampling SIG to come to a good solution for link-based sampling because I think that is what still will hurt many practitioners the most. And last but not least, again and again, broker instrumentation comes up. I'm personally not sure how far we can standardize there but what we definitely want to do, we want to have a good way to seamlessly integrate broker instrumentation with producer and consumer instrumentation. So if you want to get in touch with us, with the work group, the best way is through CNCF Slack, Hotel Minus Messaging, that's where folks of the work group hang out. Johannes Tux, that's me. Slack, by the way, is a synchronous communication. If you prefer synchronous communication, we also meet weekly. Our work group meets each Thursday, 5 p.m. central European time, not this week because of cube combat next week. We always welcome people who come by who want to give us their opinion or just tell us their story. And last but not least, you can also send us issues on GitHub. With that, thanks everybody for coming. Thanks for going through cycles of disintegration and pain with me. And I think we have some time left for questions if there are any. I think the microphone is on, it works. Thanks Johannes, great talk. I have a quick question for you. So I actually wrote down, to that to work, we need to ensure context propagation. Yes, perfect. So you have any ideas or any insights, any thoughts on how to implement that on cases where the context propagation is not possible. So for instance, in Brazil, there is a payment process that is called peaks where the transaction goes through a central bank and then comes back to the bank that sent the request. And this central bank doesn't allow any specific or hotel headers. So there's no way of sending. So the only way to do that would be linking the transaction ID somehow and from the caller to the rest of the call. Any ideas on how to do that without relying on the backend to kind of query this transaction ID or something like that? Great question. So the question is context propagation. How do you propagate context if your broker or protocol doesn't support that? So I think the long-term solution will be to push brokers and also like this bank that you mentioned to give you some way to propagate context because it's important, observability is important. The short-term solution, I think the only way we currently have this and I see that people doing is just somehow integrating this context into the message. Cloud events, for example, does that. Cloud events defines a payload. The whole cloud events thing is part of the message payload that you send, a JSON payload. And what cloud events does is in its specification, it just requires to put the context inside this payload. So you put it inside this payload, it's part of the message and you send your message then to your broker or intermediary and the producer receives the message and then can then unpack it and interpret this context. It's not an ideal solution, I think, but it's as good as we can do in those scenarios now. But I think the long-term solution is to create, to increase the awareness and to have native ways to support that. And I think those MQB and MQTT standard drafts our first step into that direction. Hi. Thank you for the talk. I understand you didn't want to define whether we should only use the link or make it one single trace. What are the instrumentation libraries that are in the country supposed to do? Are the consumers gonna have to define it per library or how are they supposed to handle this? We from the semantic convention side, we don't make any assumptions about this. And this is something that we leave to instrumentation authors to decide. So what I would expect is that I think here we cannot prescribe like one solution because requirements are very different. This differs for different messaging systems for different use cases. So maybe instrumentation libraries will provide you with some configuration options around that. Maybe some for whom it makes sense. They will, by default, if there's no batching involved, put everything into a single trace and thus give you this experience. And only if batching is involved, they will break things up. I don't have an answer here that covers all the cases. I can only say this will be instrumentation library dependent and we trust instrumentation authors here to come up with good and solid solutions that will hopefully help everybody. Okay. Thank you. That's my turn. Thank you for your talk and your work. Usually how we implement consumers, we read a batch of events. It's not just one event. We read 20, 50 events per one time to optimize the network and other stuff. And for each event, we have different trace IDs. And usually we have an operation, like a read batch of events. And the question is what trace ID should we assign to this operation? Because then we want to visualize our data. We have like 20 traces with published events, consumer events, and we have these synchronization operation when we read a batch. How does it come with the span links? Well, I think that's such a scenario you're talking about. We have a fan-in scenario here. So we have our fixed three bikes and here it's just a small batch, it's three. But with span links, basically you can link to all the three upstream traces that constitute this batch. This looks pretty good on the modeling level here. However, we still have challenges here and maybe that's what you're looting to when it comes to visualizing that. Span links are a very great modeling tool on the abstract level, but how we visualize this, I think we also, it's a question that we in the semantic conventions work group cannot solve, but we rely on tools and vendors to implement good support for this so that you as a user can kind of seamlessly see what kind of messages are coming in here and where are they coming from? It's kind of a usability experience that we in semantic conventions cannot solve. We can just make it consistently modeled, but this is something that then observability tools and vendors need to provide a good solution for you. Thank you for a great talk. Yeah, the each span link is often associated with a common dependency. Sorry, I'm here. Common dependency. So in your example, it was a bike. So there's something that is producing a bike and then something else that's consuming the bike. So yeah, in my use case, these bikes could be very huge data sets and an unknown number of them. So somehow the context propagation needs to persist for a long time, maybe a whole day or a whole week, and it has to have some metadata that refers to that bike, that data. And all of this is I guess going to be part of the spec. Is that right? Okay, and I also wanted to say that this all sounds similar to another project called Open Lineage, which is about data lineage and working out the dependencies between data sets with processes in between them. And I've seen that kind of project going on independently of open telemetry, but I think maybe this is starting to bring the two worlds together of data and processes. Yes, Open Lineage, it's a very interesting project and there is definitely overlap between what we do here and what Open Lineage does. There is overlap in the requirements regarding context propagation, both Open Lineage and also those semantic conventions require context being propagated so that you can make the connections. However, the use cases are quite different. So with Open Telemetry semantic conventions, we strictly focus on telemetry data. This is our focus, so we want to make sense of logs, traces, metrics, maybe profiles in the future. That is our focus and Open Lineage has kind of a different focus. There you want to understand like the history of the event, how it changed, how things stick together. So there is overlap and this overlap, maybe we can isolate in coming up with unified context propagation. We can try to use the same terms as Open Lineage so that we don't operate in two different worlds, but I think we can benefit from keeping the use cases apart, making us aware of what we're focusing here, telemetry, because we don't run, we easily otherwise run into the danger of coming up with a solution that seems to fix everything, but then does nothing really good. Yes, I've heard the Open Telemetry team talk about potentially thinking about new signals. So not just metrics, logs and traces and now profiling, but maybe other signals that I'm not sure whether you think that might be potential for new signals in this scenario. That could be a potential for new signals. So if you have any ideas there, please join our group and talk to us. So we are grateful for any kind of creative and new ideas in that space. Thanks. Hello. Hi, Jonas. This side. I think it was good to sign. I think that's the last question that we will be able to take. Hi, Jonas. Great talk, Jonas. Thanks for that. Quick question. I think you already touched base that the sampling is a problem, right? But do you have any recommendation, Miss, any use case with how can we handle by the time if we have some solution which is a permanent one, right? Any, Miss, whatever, Miss, your experience has been with the sampling part. Yes, as I said before, this is the biggest current pain point. If you need to sample, if you really need to sample and if you want to see your complete workflow, then currently the only option you have is to try to put everything into a single trace and then established sampling solutions will work. If that does not work for you and you need to break up your traces, there is prototypes of link-based samplers out there that could be used. However, they are very immature and they will not work for all use cases. They will also not give you a consistent sampling rate as output. Because when you have fan-in scenarios, here we have three links and if we say, okay, fix three bikes, now I get in three contexts, upstream context and maybe two of them are not sampled, but only one of them is sampled. So what do you do then? If you decide that you sample, then you will end up with an actual much larger sampling rate than you have specified upstream because you kind of multiply the sampling rate by three here because you have three contexts incoming. So unfortunately, the only advice I can you give now, if you say, okay, I need sampling, I would say try to keep everything in a single trace. Yeah, here we see that fan-in works. It's supported by the spec. Is fan-out also supported where we have one producer and multiple consumers? Yes, fan-out is also supported. Fan-out scenarios, you can even model with just a single trace. We see that, give me a moment here. So here basically we have one producer creating three messages and basically three different consumers pick this message up. So fan-out scenarios are supported both with the single trace solution and also with the multiple trace solution. Thank you. Okay, I think we're out of time. Thanks everybody.