 Welcome to our talk about open telemetry and how we use it at Adobe today before I begin may I know who all uses the open telemetry collector in one way or another at in their organizations? Okay, awesome. At least I'm glad to see people using it in production today. My name is Shubhan Chusurana and it's a pleasure to talk in front of you and obviously share what we have learned in past three years. I when I'm not working, I'm part of the observability team here at Adobe and I primarily work on the tracing side of the infrastructure as well as do some matrix. When I'm not working, you can find me hiking and exploring the nature and I'm excited to be in Vancouver to the explore all the amazing things around here with that I'll hand over to Chris. Hey everybody, I'm Chris Featherstone. I was an SRE for a long time before I decided to start working in PowerPoint and going to meetings all the time. I've been at Adobe for about six years. I'm a senior engineering manager there. Our team runs the metrics and tracing platforms at Adobe. And when I'm not working, I'm riding dirt bikes. I've been riding a lot of electric dirt bikes lately actually. I'm from Utah. This is near the salt flats. I wanted to start by sharing some of the statistics around observability at Adobe. I've got a couple stats up here. These are metrics, traces and logs. So for metrics, we ingest, or we scrape, I should say 330 million unique series every single day. The intervals vary. Typically, these are about 30 seconds apart. So every 30 seconds, we're hitting 330 million. On the span data, we're ingesting 3.6 terabytes every day. We keep the retention right now for five days. This is a sampled number. We'll talk more about tracing in the presentation. Our peak spans per second about 150,000. It goes like a roller coaster. It depends if we're on the weekend or weekday. But we see about 150,000 spans per second kind of at our high points right now. And for logs, we're doing north of a petabyte of logs, 1.2, 1.3 petabytes every day of logs. Now, not all of these signals flow through our team. I would say that, you know, this is the majority of the observability data at Adobe, but there's a lot of acquisitions and things like that. So this is a pretty good chunk. And not everything here flows through the collector today. A lot of this does, but we'll talk more about that as we go through. So how it started, how did we begin looking at the collector? Our initial driver towards the collector really began with distributed tracing. We are largely made up of acquisitions at Adobe. I don't know how many people know that. Almost everything was an acquisition at one point. And every time we do an acquisition, somebody comes in, they prefer this cloud. They prefer that tool. They think Emacs is the best text editor, which it isn't. They have all of their opinions. And with distributed tracing specifically, that becomes a huge challenge. Imagine trying to stitch a trace across clouds, vendors, open source. So eventually that's what led us to the collector. But we were trying to build a distributed tracing platform based on Jaeger agents. So back in the time machine, the first box, in November of 2019, we deployed the Jaeger agent in all of our Kubernetes clusters. We had it deployed as a daemon set. Developers could just spin up an application. The Jaeger agent was there. They could send traces to it. As we were rolling that out, we started looking at the open telemetry collector. It became super popular. It was the buzz in the community. We started looking at it. We were still in the middle of rolling out Jaeger agent. So we were doing an evaluation to see, would the collector meet our scale? Could it do the sampling that we needed to do? Was it feature compatible with the Jaeger agent that we had? Eventually we got through all that testing. We were two thumbs up on the collector and we started rolling that out in April 2020. And it began replacing all the Jaeger agents that we had deployed and eventually got that out everywhere. So originally the collector was there just to ingest traces. And then later, September 2021, we started another one and different configurations to also bring in metrics. And we'll talk more about this later, but we're trying to also get logs into that flow as well. So really tracing is what motivated us to get there. So that's how it started. How's it going now? We've got this kind of crazy diagram with a bunch of errors. We're going to talk about all of these boxes in the presentation, so don't get too hung up here. Kind of going left to right through these boxes though. The open telemetry collector sits in a number of critical paths and in the upper left we have applications. So we instrument our applications using open telemetry libraries, primarily auto instrumentation and primarily Java, which I'll talk about in a second. We do some application enrichment. We bring in Adobe specific data and we enrich our pipelines as it flows through the collector. We have some custom extensions and processors that we run, which we'll talk about. We do configuration by a GitOps where possible the upper three boxes on the right. The collector is very dynamic at sending to multiple destinations with one set of data, which we'll talk about. This was huge for us. And then the lower right sometimes we send collector data to other collectors to further process. So it's the Swiss Army knife of observability. So the first box, the integrating the open telemetry libraries in our applications. So the organization that we fall under at Adobe is called Developer Productivity, where we have the charter of aiming to help developers write better code faster, kind of our mission statement. So we try to do a lot of things to just reduce friction around that idea. So for the Java services in particular, we have a, we call it a bless based container. We have this container that's, if you're using a Java image, you should go use this one. It has a number of quality of life features already rolled into it, including the open telemetry collector or sorry, the open telemetry Java instrumentation, the jar. So we roll that into the image and we configure it in this way. This box down at the lower right, these are the, I pulled this from our docs. This is exactly how we configure it for Java. So we set the Java, sorry, we set the Jaeger endpoint to the local demon set collector. We set the metrics exported to Prometheus. We set the propagators. We set some extra resource attributes. We set the tracer to the exporter to Jaeger and we set the trace sampler to parent based always off. If you want to talk more about tracing specifically, we're not going to dive really deep into what we're doing on tracing in this talk, but we're happy to talk more about like why we've set these flags the way that we have because I'm sure there's some eyebrows being raised at a few of those. But we can talk more about that later. Point is this is all rolled into that Java image that we have. So at this point with these configurations, any Java service that spins up in Kubernetes at Adobe is already participating in tracing. And you can see the metrics exporter is already set for Prometheus. So everyone's participating in tracing just by spinning this up. The metrics, we've tried to reduce the friction. People would still need to somehow go get those metrics out of that exporter. We've made that pretty easy, but it's not automatic. And we have a number of Java services at Adobe. I would say 75% of what we run is probably Java. We're trying to take the same concept in some of our other images as well. We're doing it right now for Node and we're doing it right now for Python. We have similar developer images that we encourage people to use where possible. So the important thing to call out on this slide, this isn't maybe a collector slide, but everything that this sets up passes through the collector. So let me hand it to Shabanchu. He can talk about once it's in the collector, where does it go from there? Thank you, Chris. I want to talk about how we do the data enrichment, especially on the open telemetry collector. We do a lot of app enrichment and as well as ensuring no secrets are being sent as part of our tracing data or the metrics data. We make use of multiple processors for that specifically reduction processor, as well as we have built in a custom processor in the open telemetry collector. Both of these processes allow us to get rid of certain fields which we don't want it to be sent to the back end, which could be personally identifiable information or user user details and things like that. We also make use of these processors to enrich the data because adding more fields such as service identifiers, Kubernetes clusters and region help us in better search to talk a little bit more about the service identification. Adobe is built out of acquisitions and we run multiple different products in different different ecosystems. And there is a high possibility of service names being colliding under different products or under as similar micro service names. So we wanted to ensure that doesn't happen. And for that we make use of the service, Adobe specific service registry where every service have a unique ID and we get that ID from service registry and attach it to the service name that allows us a lot of benefits. First, just to quick give you an example how it works is the service name is front end service. We get the registry, the details from the registry for that particular service. Maybe it's one, two, three, four or something else and we append it to the service name. Now coming back to the benefits. It allows us or any of the engineers at Adobe to uniquely identify a service in the tracing back end because everything is flowing to a single back end. Now second thing it allows the engineers to quickly search on things even though they don't know the service or they don't know who owns that service. They can go look into our service registry, find out the engineering contact for that particular product or team and get on a call to resolve their issue or the thing they are facing. We also send the data to another set of hotel collectors, which I'm going to talk about in the next slide. Export our destinations. Now this is probably the most common use case and I bet most of you must be already doing that, but I want to stress on. I cannot stress enough on the fact that this is one of the biggest use case we feel and we have been very happy with the open telemetry collector before the introduction of the open telemetry collectors. Engineering teams at Adobe have been using different processors, different libraries in a different format and they were sending it to again vendor product open source projects. And it was very hard for us to figure out how we how I mean it was very hard for us to get the engineering teams to change their back end or to just do any small change in the back end code or their application code because again engineers have their own product features and product request which they are working on. With the introduction of open telemetry collector as well as the OTLP format, this made super easy for us. We are able to send data to multiple vendors, multiple toolings with just few changes on our side. I mean in last year, we were able to send the tracing data to three different back end, three different back ends at the same time to test out one engineering specific use case and which was huge for us. Earlier it used to take multiple years and so we are super happy about that. We the final block the sending the data to the OTL collector now sending the data to another set of OTL collectors at the edge. What we can do is we can do a lot of transformations as well as some tail based sampling rule based sampling throughput based sampling. So whoever here might have worked with the open telemetry collector know or tracing world knows like sampling is the key to ensure the data doesn't get that we don't send a ton of data and we are always looking into ways how we can enrich that. We are also looking on on our side to see if we can implement this in the logging functionality can we send the last 50 lines of a particular log so that we don't send a ton of the data to the back end. So we are always trying and playing with this configuration so that we can get richer data in our back ends. This entire configuration is managed by get we make use of the open telemetry operator Helm charts primarily for our infrastructure use case. Shout out to Alex Burka from Adobe's observability team who is one of the maintainers in this project. The open telemetry operator provides a lot of benefits for us. It makes the it makes the it takes up with the responsibility from the engineers to the subject matter experts because they know what they are doing and they they make the configuration super easy super easy. We also we also get the added advantage of making the debug ability easier for the infrastructure engineers who are responsible for meeting maintaining the open telemetry collectors. Additional of the open telemetry operator allows us to do different kind of deployment models teams can just add few annotations on top of their deployment. Get get the hotel collector as demon sets and we run the demon sets but they can get it via sidecar or the deployments and that gives them a lot of more flexibility in terms of what they want to do with it. We also support the auto instrumentation via the open telemetry operator. It allows engineers to just pass in couple of annotations to instrument their service automatically for all the three different signals without even writing a signal line of code like and today at Adobe few teams are already doing that. They are using it to send all the three different signals to three different back ends with just few lines of configuration. This is huge for us. This improves developer productivity. This takes the developer productivity next level. And again, as I mentioned using the operator allows us to maintain our conflicts in in a in a git ops model. It allows us to store everything in git and then again reference it back and deploy it using the our default kubernetes CSE model. Now would like to mention this operator is not the only way we do all the infrastructure deployment. We do use other mechanisms depending on the need and use cases. Let's talk a little bit about what we have done on on top of the open telemetry collector. As I mentioned security was one of the key aspect of the open telemetry collector at Adobe and we wanted to ensure we keep security in every single step wherever we use it. We have made use of the open telemetry collectors highly plug in plug, bring your own plug in capability to embed custom processors and custom extensions. We have built out a custom extension on top of the open telemetry collector. And I would like to thank the open telemetry community for writing some wonderful blocks for people who are not new with their journey. They can go look at this particular talk which talks about how you can write your own how you can embed your integrate your own internal enterprise authentication system with the open telemetry collector. When we started we had two key requirements of make two key requirements for this authentication system. We wanted to ensure we can use a single system or single system which is Adobe secure and Adobe verified by our security team to send the data to the different backings and with a single single authorization key. With this also we wanted to ensure it can send it to all the open source toolings as well as the as well as the vendor products. And when open telemetry collectors started with OT LP, this was again a huge for us rather than adding OT LP support in our authentication system. We decided why don't we decouple this from the hook from our authentication system and add a reverse proxy over there and make it a little complicated. Why not just adding an extension on top of the hotel collector. And we were able to do that making use the of the custom authenticator interface which is available in the open telemetry config and just by extending it for our use case and modifying it. And here is a simple example how we can do it today or how we are using it production today. As you can see in the authorization section the users just need to configure their their their cigarette and it's all set all set to go building a data processing pipeline. Now the the open telemetry collector I talked about the extension open telemetry collector comes with rich set of authorization research of processing processors which we use at Adobe. I'm not going to talk about all of them, but here are a few key things which I would like to mention. First is the attributes processor attribute processor is one of the most powerful processor. It allows you to add attributes on top of your trace data log data and metrics data. It allows you to transform and rich or modify the data in transit without the application engineers doing anything. We make use of it to make our data more richer more queryable as well as improving the search capabilities in our back ends. We also make use of the memory limiter processor. Memory limiter processor allows us to ensure the open telemetry collector which we run never run out of the memory as well as keep in check about amount of amount of storage we are giving to the open telemetry collector for keeping things in the state keeping things in the memory. Then we use span to matrix processor and service graph processor to generate data out of our traces. Both of these processors allow us to build out red matrix dashboard on the fly. They allow us to create a service dependency graph on the back end which anyone at Adobe can see and visualize it. And the span to matrix processor allows us to build dashboards such as red matrix dashboard and request durations and all those parameters. Here is a simple screenshot from our production for this particular use case. Now as I mentioned we are not the only ones who are running open telemetry collector at Adobe. We run the majority of the infrastructure but there are teams at Adobe who use it in a different way which we don't know. Some of them we have spoken but not everyone we have spoken but in all the entire set fits really well. And again we still have lot more to go and lot more to do with it. I am going to hand it over to Chris who is going to talk about what lies next for us. So a couple of things that we are still in the middle of doing. One I would say is improving data quality. Now this is going to take different forms depending on which signal we are talking about. So let's break them down. So if we are talking about tracing for us we are looking at getting rid of known junk. I would put into this bucket things like forward slash health, forward slash whatever. Things that no one is ever going to look at. Traces that might have one span, two spans, something that is not going to be valuable. The collector gives us the ability at the edge to create rules to drop stuff like that. We want to just keep stuff that is going to be valuable and actually going to be looked at. There is no point in having that come through the whole system if no one is ever going to look at it. For metrics, imagine that we had the ability in the collector to aggregate right in the collector itself. Maybe we don't need quite 15 second granularity. Let's dumb that down to 5 minutes and then send that off. Similar types of things like that we are looking at. Another one might be sending some metrics to be stored for long term and sending some on to be further processed in some operational data lake or something like that. We have the ability to just pivot right in the collector and do all kinds of things. For logs, we are looking at some kind of sampling strategy there as well. This is taking shape. I imagine something like what if we could grab a stack trace and go plus or minus 50 lines from that. Let's send that off. What if it has debug in it? Let's just drop that. You know, depending on if it is stage, prod, whatever. We have all kinds of levers in the collector that will let us just keep the stuff that we want to actually keep. Again, there is no point in traversing clouds and data centers and storage and all this stuff if no one is ever going to look at that. So improving data quality will be a big initiative for us. Just probably ongoing forever really, but we are doing that quite a bit right now. Second one, rate limiting spans at the edge. I have got this image of this overflowing dam. One of our edges is taking like 60 billion hits per day and we are trying to do tracing on that. That becomes a lot of data when you are talking about piping that all the way down to somewhere to be stored. We are trying to figure out where is the right places to implement rate limiting in which collectors and at what levels and how sane. Just to prevent unknown bursts of traffic, that kind of thing. For example, I mentioned earlier we have parent-based always offset for most services. People can override that, but the problem is we have somebody spin up and they set that to something different, 100% sampling, right? And everybody downstream from them is set for parent-based always off. We could get these internal denial of service attacks where all of a sudden we are just sending like crazy amounts of spanners. Imagine what happened somewhat recently as this happened exactly. And it was like, hey, what are our options for rate limiting? Let's go explore that some more. So we are doing a lot of rate limiting. We are trying to protect ourselves where we can right now. And the third one on here I would say maybe not related specifically to the collector. We have talked a lot about tracing. You keep hearing us say that, but we are trying to pivot more to a trace-first troubleshooting. We have so many east-west services that trying to do it through logs and trying to pull up the right log index for whatever team and do I even have access to it or whatever. It is so slow and so hard to do that we are trying to really shift the way that people are troubleshooting within Adobe to something like this where we have made a lot of effort to make these traces pretty complete. And so we are kind of just questioning how people troubleshoot incidents and we are trying to sit down with them and say like show us how you troubleshoot an incident and let's see what the tools that are available is still the right way to be doing what you are doing. That again will take forever and will never be done with it. But these are some of the things that we are in the middle of. And here is a few things that we are keeping an eye on just as they mature. So these are features that we will be interested in evaluating once they are ready, let's say. So I mentioned before metrics aggregation processor, I have a couple of links on here. We would love to be able to aggregate in the collector and do some of that that I was mentioning, talking about before. Second one, integrating the open telemetry logging libraries. So we have, I mentioned these pre-baked images for developers to use. So as the log spec becomes stable and everything is good to go, we will plan on just shoving that into our images and we will definitely be looking forward to that. Third one, running hotel collectors as a side card to send metrics, traces and logs. So imagine here an information superhighway of observability. Ideally we want all of these signals to flow through the collector. We are doing a lot of fluent bit, we are doing a lot of all kinds of crazy stuff. The way that these signals flow right now is a lot of it goes through the collector but not everything. It would just be so much more convenient if it was all coming through the collector. And I think we will get pretty close to that one soon. We are doing a lot of that already. Fourth here, explore the new connector component. I've got a link on this one as well. If you haven't read on this one, we think we'll be able to generate some really interesting metrics around just what's flowing through the collectors here. We were super interested to read about this one as it was kind of pitched. We're not using it right now but we will go mess around with that. And building the trace sampling extension at the edge to improve data quality. I talked a lot about data quality already. What we'll probably end up with is some, what we hope to end up with I should say, is some smart head sampling and some tail based sampling. We just have too much volume to keep 100% of everything and we'll have to write something. There's some pre-built stuff for this in the collector already in the contrib repo. If you've not looked there, there's tail based samplers and stuff like that. We'll have to come up with something so we'll more to come here I guess I would say. And to kind of wrap, so I would say we would leave with a few key takeaways. One, the open telemetry collectors plug-in based architecture is awesome. The way that it's designed, the way that you can plug into it without really affecting what else is going on in there. It's great. Second one, ability to send data to different destinations with a single binary. The binary Shubhant you touched on this and this, I really can't state how much time we've saved just because of this feature. Being able to send to commercial vendors and at the same time send to open source and send wherever. Like this almost is the feature alone that you should go mess around with. Third one, there are a rich set of extensions and processors which give a lot of flexibility with your data. I mentioned the contrib repo. If you've never browsed through that, just go look at what's there. Every vendor you could think of is probably already there. A lot of open source projects are there which I guess ties into this final point. The support for this project, not just the collector open telemetry in general, feels a lot to me like the early days of Kubernetes where everybody was just kind of buzzing about it and it's starting like we're on the hockey stick path right now. We're just about to go really crazy I feel like. The community is awesome, the project is awesome. I would say to sum up everything. If you haven't messed with the collector yet, you should definitely go check it out and see what it can do for you. So thank you for coming. I think we have time for questions. We probably went a little short. So thank you. There was a microphone. Oh yeah, there it is. Don't look at that guy's hand either. Sure. Thank you so much. So for the audience, I will repeat the question. The question is, can we make use of the data generated from spans, rather than storing entire metrics and looking at the events. So we are doing that today. But again, there are there is always going to be challenge here because it's we want to capture little data, but also we want to give the richer information to our engineering users because those metrics helps them to write their own alerts, build their dashboards to analyze the performance of their application and system. And also we use those metrics for cost accounting for for application history like performance of the application. Yes, we can do aggregation and we're still doing. I mean, I'm not gonna lie. We are doing a lot of aggregations back end as well. But again, it's it's a never ending battle between us as well as the consumer to ensure they get what we need. And again, we can build it on top of it. You talked about the different deployment methodologies that you use, you know, as a Damon said, as a sidecar and all that type of stuff. Can you talk a bit about the use case your application teams have that merit those different deployments and the tradeoffs that you face when dealing with those? Yeah, that's a really good question. So as as Chris mentioned, our org is the developer productivity or and we want to mention ensure that developers can get things running quickly and faster without worrying about all the nuts and bolts. Now for specific use case like and why are the API gateway or the core components, we want to run it as a sidecar so that we can get the data out without and without overloading the central collectors, which are running in the demon set mode. Also, we recommend teams who are going to be generating a lot of spans or who are heavy request request generators or get a like Chris mentioned one of our application which image gets 60 60 million requests per day. And if we send that data to all the demon sets, all those demon sets are going to be crazy and they are going to be running with a lot of memory issues. So it's it's always like a balance between the application as well as where that particular application resides in terms of how they want to do the how they want to use the open telemetry collector. Perfect. Thank you. So you talked about sort of cleaning up your the data sort of reducing the amount that's going to be sent to your storage engines or your clients at the end. Would that be configurable for the developers because would there be certain edge cases where you're you're reducing the data in a certain way where they might be missing certain edge cases. How do you keep that developer experience they can, you know, drive into their traces is kind of my question. Let me say one thing and then you can give a good answer. So every time we've talked about sampling, there is this crazy FOMO that people have every single time. And that I would say is the biggest challenge to even starting that conversation. Like I mentioned my example of let's grab a stack tracing go plus or minus 50 lines that freaks people out. And so, yes, I'm going to I'm going to take a crappy stab at your question. So for tracing specifically, we do have some mechanisms in place where people can kind of dial up and down what's being sampled for logs and some of the metrics things. I talked about like a lot of that is still thoughts in our heads mostly. So I would say nothing's really there. But for tracing, we do have a plan to be able to just dial that up and down. No, I mean, all good. Yeah. To give users more flexibility at this moment, we give them flexibility. It's not like we are just going to drop it. We talk with them before dropping certain things we drop. Obviously, if it's a security or a PII information, which we have no exceptions again. But we have worked with the engineers as well as the application library that like the best containers. If you are familiar with the spring Accurator framework in that they generate the met they generate the traces from one of their core endpoints, which has no value for the engineering. And it's just a junk for us. So we kind of remove that as well as at the back end level as well as on the library level. We also add in some libraries alongside our best containers to give them more flexibility. But these nits and bolts are always tricky for the engineering team to do and modify on a day to day basis. Even though we give them control, but they don't want to work on. Oh, do I want to send this trace or not? They just they just care. They just want to send the data and be over with it. So it's it's a challenge. But at the same time, we want to decouple this functionality from the engineering so that they don't have to do all the heavy lifting. They can focus on the product. We can focus on the data quality and improving and working on those things. Awesome. Thank you. Yeah. Hi, I'm just wondering what advice would you give to teams that want to start using open telemetry? And more specifically, like if you had to go back to a journey, what are some of the lessons that you would tell new newcomers would start that they don't? So let me take a stab. So what I would say is we tried early, early days to focus on the auto instrumentation and to focus on having people just have this turned on by default. So if you're in the there was a guy yesterday from into it that spoke, they took a different approach and went customer and they tried to take their customer flow and get everybody in that flow to participate. For us, we have too many crazy products. And so it was that would have been impossible because I don't even think we know what all those flows are. So for us, we really had to push hard to get trade. We wanted tracing on for everyone always no matter what. And then auto instrumentation was the other thing. The developers want to build features. They don't want to build observability. So it was a much easier sell to say, Hey, you're already doing Java. Let's get this just rolled in and it'll be on. And after some performance testing validation, whatever, generally people were fine with that. So I would say, I guess the theme with both of those comments is the effort has to be almost none for that, at least for tracing to work. Yeah, I would echo what Chris said. Basically, primarily the challenge is always getting the engineering teams to work on open telemetry or work on these features. The more easier we make with that make for them, they are ready to onboard because again, there is always a priority risk over here. Again, if we want to embed it into entire organization, it's always a different challenge. We again, and this has been journey of three years and a lot of things have improved. So things which we have done earlier might not might not apply to you. Another thing I would like to you to focus on is probably figuring out which workflow you are targeting rather than targeting all the three signals at the same time. Like if you want to get tracing enabled, let's start with tracing. If you want to work with the metrics, then start with metrics rather than going everything turned on at every application is not going to work out. So start with one signal and take it from there. Thank you. Thank you for coming. Yeah, thank you for coming.