 Hello, everybody. Thanks for joining our session, and we get to start it. So today's session, we are going to be talking about realizing the value of open standards, specifically around open telemetry. I am the moderator of the panel, and there's our amazing panelists. I'm Hope. I work at Microsoft, and I also contribute to the input telemetry and user working group. And I would allow the panelists to introduce themselves. Starting with Iris. Hello, everyone. My name is Iris Durmishi. I'm a senior observability engineer at Miro. So all my workday is revolving around observability. And if I have to say a fun fact about myself, I am also a crazy cat lady. Hi. I'm Archie Sotkevich. I'm working at Ryanair. I'm president and post office engineer there. Apart from that, I'm AWS community builder in serverless area. And I have introduced open telemetry to Ryanair. Hello. I'm Daniel Gomez Blanco. I'm a principal software engineer at SkyScanner, where I lead observability. I'm also a member of the Open Telemetry Governance Committee since quite recently. And if I have to say a fun fact about myself, I like drumming. I guess drumming with some weird instruments, if possible. I'm playing bass. OK. So I forgot to mention, there is a QR code that has the Q&A one. So during the session, if you feel like you have any questions or anything you want us to sort of touch on, or just some comments, you can just drop it. It's like simple to use. You can just add your comments and all that. And we also have some time at the end as well for Q&A. So in case you want to ask your questions live, you can also do that. So just something to put in mind. OK, so starting this session, let's start with a very, very short story. That's to be very short. Short story on your journey to Open Telemetry in your organization. So starting with maybe where you were before you adopted Open Telemetry. And then what actually motivated you, like what was the catalyst that motivated you, and where you are currently in terms of adoption and all. Let's start with Harris. OK, so my story with Open Telemetry started almost one year ago. And I've had experience with two projects migrating from other open source tools to Open Telemetry. So it might be a bit mixed. If you hear, for example, Prometheus and Victoria metrics and say, whoa, what were you guys doing? It's because of two different projects. So what pushed us towards Open Telemetries in both my experience is the lack of a central tool to collect all of our data. Also, we were, let's say, not vendor locked in, but we were very limited in the amount of backends that we could use for our data, for correlation. So when we saw Open Telemetry, we jumped on it to standardize our observability platform and also to give ourselves options to use open source backends or an observability vendor depending on our needs at the time. It can change all the time. So my journey with Open Telemetry started many years ago. We were migrating to AWS Cloud from our own premise. We're using Neuralix since the very beginning, so we had quite nice platform for observability. But then we started to integrate our systems with Lambda and different languages. We were starting to experimenting because if you're going from data center to the cloud, you have much more options. And we had a problem with having common solution for observability in general. We were trying to find something that would be future proof. There was a couple of projects at the time, Open Tracing and Open Senses, but I've seen that there is new project, Open Telemetry, which caught our attention and we started to following from the very beginning. So at the moment, a couple of years later, like my department is fully on Open Telemetry, but other departments are not yet the journey is ongoing because they were too much coupled with Neuralix in general. There was always bigger priority in the backlog. So right now, we are going full path on Open Telemetry. We believe in the standard and we make really great decision a couple of years ago in my department. And right now it's propagating to the whole organization. I guess my journey with Open Telemetry started in 2021, early 2021. A SkyScanner being a travel company in 2021, there was a lot of change and there was a pandemic going on. So there was, I think the decision to adopt Open Telemetry came from simplification initially. We wanted to simplify the amount of craft that we were running internally to basically to collect and export telemetry data. We had custom libraries, custom protocols, lots of different pipelines, many vendors, many open source platforms that were running internally, mostly running on an Elk stack, OpenTSDB, Prometheus. And most importantly, we're not finding a single place to basically have all the telemetry solution that will allow us to correlate between these signals. So we had all the telemetry, we're exporting massive amounts of data but not making sense of it all because we had no correlation between these ones. So it was a bit of both simplification journey and also a sort of like correlation journey. And since then we started to adopt Open Telemetry from Open Tracing, that was our initial migration path. And then from then on, moving on to metrics recently as well and starting to correlate with other logs that could not be instrumented with Open Telemetry but sort of like getting more into that, onto the semantic conventions for logging. So yeah, that's our journey so far. Very similar to our journey, really. Amazing, amazing. So I mean, it's great to, yeah, especially the reason why you felt like you decided you needed to jump on Open Telemetry, which brings me to my next question which is understanding the value of Open Telemetry. Like why should somebody use it? And more importantly, because some people sort of know that this is amazing, this is great, this is gonna help us, it's future proof, all these kind of things. But how do you communicate that value to maybe your organization leader, which is so many people's biggest challenge. You want to start? Yeah, I can start first. So I think in terms of communicating value, Martin already said before, having that sort of like future proof strategy to your instrumentation layer, right? So if you're using Open Telemetry APIs, you get in that sort of like stable with strong stability guarantees API layer, but it's decoupled from your implementation. So then if you decide to change your observability back end, for example, you're not going to be tied to a specific client implementation, right? And then from the point of view of us contributing back to the community, so like our approach as Guy Scanner is to basically, or used to be, to do it all in-house, right? So then you have to maintain it. And that came to like the, how we communicate the value to leadership is saying like, if we contribute upstream, we're no longer have to maintain just for ourselves, but we can maintain it for everyone and have a common sort of set of libraries that in standards that everyone can work together to get value off. So that's the initial sort of like selling point. Then of course the point of like increasing our time to resolve and our time to detect for a production. Also my thinking was back in the days when we were thinking about the Open Telemetry is that in the future, if someone will build great new open source project, why he has to pick New Relic for observability? Why has to ping Splunk? Why Datadoc? There needs to be some standard that will be basically injecting the telemetries from something, right? And it has to be standard that will be not vendor locked into one single company. So right now, let's say in Java, everything is working based on aspects as far as I know, right? So we are injecting stuff into our libraries. It has drawbacks on the performance, but if we will have in the future this kind of injectors like listeners or whatever that will be embedded into all of the projects will get performance and observability as well, right? This is something great setting point. Yeah, I think that's something that sometimes we're missing is that we're thinking about instrumentation libraries and we forget that, you know, we're moving into a future where like the libraries that you use come natively instrumented with Open Telemetry. So then, you know, if you're using Open Telemetry, you're not missing out on like the telemetry that's added by the library author themselves. They're probably the best person to actually decide what to instrument from a library, right? So you're getting that out of the box because they can do it without tying themselves to an implementation. Or it would be great to see, for example, that even products, let's say you are deploying new database that just came to market and it can send all the telemetries also to your auto back and you understand it and you can correlate it with your other signals, right? Exactly, yeah. Amazing, thank you. Yeah, thanks so much for that. So I'm gonna shift jazz and I'm going to hack your iris because you talked about the fact that you've worked on two projects now, you know, moving migrates in there to Open Telemetry. So can you talk about what's rolling out Open Telemetry, like rolling out Open Telemetry in your projects? What it was like for you? Absolutely, rolling out observability, it's not magic. We should not expect that one day we wake up and we say, okay, Open Telemetry is amazing. Let's put it there. Everything that exists, all the good signals that you are already getting and your engineers are relying on, let's disconnect them for a little bit because we're gonna get Open Telemetry, it will be better in the long time, no. I'm a big fan of rolling out Open Telemetry slowly and in the pace that you prefer that it is compatible with your organization and your architecture, of course, because there can be a lot of very complex architectures that could take a long time. So my first step always, when I think about migrating a pillar of observability in Open Telemetry is making sure that the collector is configured, of course, and that is not something that requires a lot of science, there's a lot of documentation there, and then it is the most important part, making a plan to put this collector in a very strategic place that it will not disrupt your current architecture. For example, if you are in the migration of tracing, which is my favorite observability pillar, so I might mention it a lot, and you are relying heavily on Yeager and you don't want to just disconnect that, you don't want to have your teams change their roadmaps just because observability wants them to use Open Telemetry. Yes, it is amazing, but we have to give engineering team times. So what you do, you implement an observability, an Open Telemetry collector right there in the middle and then you use the amazing tools like Yeager to send all this to Open Telemetry and then you give time to your teams to start using the auto instrumentation or instrumentation for Open Telemetry and this is going to be a very smooth and slow, let's say, transition, but it will not affect your current architecture. I come from a background that we had and still have very good observability platforms that our engineers are used to working with us, so this is very important for me, maybe for another company that has no observability or it is in very low levels of observability and collection, maybe it's going to be faster, but this is the most important part. And the second one about rolling Open Telemetry, I would say is that architecture of the Open Telemetry collector is as flexible as all the receivers, exporter and processors that it has. Just because it can collect all the information doesn't mean that it have to. Depending on how big your observability platform is, maybe you will have to have an Open Telemetry collector for logging, one for tracing and one for metrics. Imagine you have terabytes of data passing every day through one deployment of collectors and for some reason they fail, everything fails in the IT world and then you lose all the visibility. It is very important for us as observability engineers or whoever is involved with this project to see, to study our architecture and to make sure that it fits to us. There is no one size that fits all. It needs to be based on the architecture that we have currently going on. And if you are accustomed to it and you do this very slowly, you will avoid some of the pitfalls that will expect you like data loss or you see that your data is not compatible with the back ends and so on. Good. Would you like to add to that, Tim? Yeah, so I think from the, well I like talking about rolling out up and telemetry and best practices. I think it's worth sometimes talking about or decoupling in a way, enablement from adoption within an organization. So you've got on the enablement side, you probably have your platform engineers and if you've got observability engineers that are in charge of, for example, configuring collectors or configuring the deciding on a common strategy. So you've got cross-organization alignment for your teams in terms of like, what standards do you want to apply? What are your propagation, context propagation protocols or like export protocols? Or what are your instrumentation libraries that you enable by default? So that team sort of like, being in charge of making the standards and sort of like the golden path that you want your company to follow being the path of least resistance, right? So having that enablement team, for example, allows you to, in our case when we migrated from open tracing to open telemetry, we did it sort of like under the hood. So we changed the tracer for open tracing with the open telemetry shim for open tracing. Our engineers really didn't, like the end users as guys kind of, the ones that we use in the open tracing API originally didn't really see a change, let's say from the instrumentation layer. And as it was saying, then we started to roll out sort of like open telemetry gradually, right? Saying like, you have time to migrate. Now under the hood, we're basically funneling everything to our collectors and putting that telemetry there what we wanted it to be. And then we've got the adoption site, what I think is worth having a team, like cross-functional team of observability champions, ambassadors, like you can name it whatever you want. But as a team that is able to translate those best practices of observability and rolling them out into their domain. So they don't need to be that particularly, like say experts on open telemetry collectors or on the SDK config, but they know how to use the API. They know the concepts around best practices for telemetry and they're able to sort of like translate that into their domain, right? And then adopt it to their custom needs. So I think those, these two groups are actually quite important to work together. Great. So I know you've mentioned some challenges which I would come to in a minute, but we have a question from the audience that got so much attention. It says, any insights on how you can implement or adopt this open standards, as you mean, specifically open telemetry, across a large organization that hasn't done it before? Anyone? Question for rolling out the... Yeah, so it's more or less like we've just heard about open telemetry in this amazing conference and I work in a large organization who hasn't heard anything about open telemetry before and I really want us to implement this. So how do you adopt? How do you have any insights on how we can implement or adopt that? I guess if you're a large organization, you're not like talking about green fields. So like, I'm assuming that being a large organization, there's already going to be quite a lot of craft in how you approach telemetry, right? So I think this is where I... Not always. No, not always. I guess, you know, there could be large organizations that don't have anything in place. But if you're wanting to adopt open standards, I think, you know, some of the things, as I mentioned, that could help is having that abstraction layer that allows you to do things that say under the hood with an observability team. Open telemetry collectors as well can allow you to start to ingest data from multiple sources, then put it into a common format, applying some common standards for semantic conventions, start to benefit from that. Maybe that's one of the aspects of it. Yes, so I can say something from my experience how it looked like. So when we started, we started with the lack of correlation of the data, especially it was visible across the different domains when we were... Different domains. Different teams, especially, right? So, for example, had some bigger domain and some domain were taken by different teams. And basically, some applications were written in Java, some were in .NET. There was lack of correlation and passing that correlation between them. It was a long time ago, so it was a way different world, actually, regarding telemetry than we have today. So what we started with is to standardize the format that we are logging in and putting semantic conventions, but we didn't know the term yet then. But it had some problem with indexing some information in Open Search, for example. After having that correlation in the logs, what we started to do then is tracing. And the value that you are getting from the distributed tracing is getting attraction in the organization. It's very easier to track issues. It's a great tool also for QAs that can work on the dev environment and they can see what is the issue. It can boost your time to market. And this is something where business is starting to sink a lot of value in it, right? This is something that I will advise is to start with the tracing and getting the value out of it, because there's tons of value in it, right? Regarding another thing that might be interesting for the large organization is the cost, we'll talk about it later, right? Yeah, and I was going to mention as well that depending on the organization, it could be the some parts, implementing the some parts or not, because I work in Microsoft and in my team, we've not had a lot of open telemetry. And then there were several other teams that are way more advanced in terms of this, right? So I had to start from scratch in terms of advocating for open telemetry, let's use this. And like Mesin said, the traces part seem to be like the bait. It's like, oh, this is gonna help us. Actually, did they approve of concept? And I was like, this is what this is gonna help us do. And then that's there to give people more interest into, okay, let's try this, we wanna do this. So that's probably unhungry you can come from. Actually, looking for that telemetry data, like maybe traces, that is going to get them excited about it. And if I could say from my experience as well, I come from a big organization as well, my advice would be provide the infrastructure, provide open telemetry collector and start sending everything there slowly. And then the adoption and the standardization will come very slowly. Another tip that I could give, start with a pillar of observability that it is the least valued or the least performative in your company. Usually it is a tracing. That's why tracing is like the forgotten pillar. Very often it's getting its momentum now. So start with the one that it's not providing value and show your company that, hey, look at this value. If you need to create some proof of concepts to see how much more you can get from it, the open telemetry collector has amazing transformation processes that could help you see what the standardization could bring to your company. Of course, it can be difficult to change minds, but providing the infrastructure and providing, showing the value that Dan and Marcin already mentioned could be a great first step towards this goal. And I think that is one thing that you're saying, showing value. And we talked about communicating that value to leadership. And in my experience, it was actually easier, maybe this is the type of leadership that is in my organization, but it was easier to communicate value to leadership than it is sometimes to convince engineers that the ways that they've been observing their workloads are maybe not the most optimal ones. And we recently published a blog post on the open telemetry website and in the blog post section where we basically took the open telemetry demo and then have different teams play against each other on who is gonna solve the issues that you can inject into the open telemetry demo the fastest. So it gave me find that root cost analysis. And it was a great example of teams that were perhaps used to login. And then when they see the open telemetry demo and they go like, well, I'm just gonna go to the logs to see what failed, were the ones that normally didn't win. And the ones that were used in tracing, the ones that were used in context and correlation to actually find out what happened, were the ones that ended up winning our observability game day. So it was, yeah, that's basically showing people by doing rather than seeing what the value of observability is, that can also help. As we say, humans are normally harder than technology, right? Exactly. So which brings us to the point of challenges. So there are challenges people are most likely going to encounter doing this, whether you've started adopting this in your team or you're just about to get started. So what are some challenges you feel that people would encounter, maybe from your experience or from what you've gathered and how can people overcome these challenges? And... I know that there will be this topic, I have full list, but I will try to pick some of them. But one of them is even the collection of the observability that we have. So we mentioned collectors multiple times. In our case, we have migrated our collectors a couple of times. So first we started with agent approach and it was deployed as a sidecar. It is a failure, in my opinion. It is a bad practice and the only thing that I think should be the way that we should collect data is through the gateway collector. Of course, if you want to have extra collectors on the sidecar, that's fine. But in my opinion, gateway is the only way. Why? Because of many things, like security, like you have to pass the credentials to your vendor, right? There is a lot of things that you can do in the collector in a single place, especially in the big organization, it's one deployment ahead, right? So you can change the way that you're collecting telemetries in one simple deployment instead of doing the tons of deployments for the whole organization. It might be, for example, removing some attributes or remapping attributes or, for example, introducing tail-based sampling where you can trace, for example, majority errors and this kind of things. And then we have changed again to have to go to the multi-region, to have a single centralized account in AWS and so on and so on. And I guess it's not the final form, right? And I think that it brings us to the instrumentation of the code because if we are sending something, what we are sending, right? That's something that is very important. Lots of developers I have worked with think that after running auto-instrumentation, their job is ended. No, it even not yet started, in my opinion. So what you need to do is to send the proper attributes to your spans. You need to properly instrument all of your code. And also, even if you will instrument this code with your business data, that's something that you're interested in and you're looking by, then still it's not the full end of it because auto-instrumentation will work in the synchronous communication where we have HTTP requests, right? We are adding new header, it's getting picked up on the other hand and you are connecting your tracing. But what in case you have asynchronous communication? What we should monitor? What is actually your trace? For example, if you're sending messages through, let's say some queue, or maybe you're processing some file and you're scheduling, it should be like thinking more or less like in a domain-driven design we're thinking about, like the concept called aggregates, domain-bounded context. The same thing applies here in my opinion in the tracing. So this is something that is usually missed during the instrumentation part. And of course, semantic conventions which is the hardest part of the open telemetry and agreeing on our custom attributes. Do you want to contribute? Absolutely. If we are talking about challenges, for me, always the biggest challenge is the human factor. But as I see it, it's part of the job of an observability team to work with other engineering teams with management to find the best solution. So of course, we have many ideas that we want to bring them in. And of course, there is going to be backlash because sometimes things work best. So this is the biggest challenge that I see. And another one that I could mention is sometimes I see usage of very old technologies that, for example, the GitHub repo might have been deprecated or it's read only for two years, three years, but it still works magically. So you have to wonder, how can I make this fit with open telemetry? Open telemetry is great. It has so many ways that you can get the data in. But sometimes with these old technologies, it is very challenging. So it's the things that you really have to put some work. And then once you see these challenges, then you need to prepare a roadmap, make some time for it, which is also the next part. It takes time. It takes effort. Some teams do not have the working hands to implement this. But yeah, all the challenges have a solution. And that's why observability teams are here to save the day. And we need to mention at this point also, cardinality explosion, I guess. Yeah, I think that was going to mention that one of the challenges being what to instrument, what not to instrument, what signal to use. Because you put in the open telemetry API, I mean, this is a good practice. Like your feature teams or teams that are not observability focused should be the ones also adding telemetry to their workloads. But how do you make sure that they actually follow your best practices, right? Because it's quite easy to go and add an attribute that's unbound cardinality, basically. And then you sort of like DDozen yourself and your metrics back end, right? So I think, and the teams are trying to do that. Sometimes it's because they're following the practices that maybe they didn't have tracing in the past. They don't know of the existence of tracing. And they think that, well, I'm just going to add, I don't know, the user ID to this metric. So that's the challenge, how to communicate that. But sometimes it's even trickier. Because regarding cardinality, it's not like you're one single value, right? They're also correlating with each other. So sometimes I'm a Neuralic user. So sometimes we have accounts that is used for multiple deaf environments. And adding one environment is one value, right? In cardinality. Two environments, you guys have divided by two. Three environments, three, four, and so on, right? And then it's getting another multiplier on top. Then you run it on 10,000 pods. Yes. Yeah. So you're already touching on that, because there is another question from the audience. I've been a lot of attraction. It says, do you perform audits on your telemetry data? Because you're already talking about, oh, we need to teach users how to actually use telemetry properly. So do you do audits on this telemetry data so that you can be able to say, hey, this is the data. You are using too much of this. So you're doing this wrongly. And if you do, how do you approach it? Well, yeah, I can go with it, because yes, we do. We do a lot of auditing. Currently, we are working to make it as automatic as possible, but now we are in a big evaluation of our telemetry data, how much we're spending, how much we're sending, what is actually used. So what we do is, for example, for metrics, we check the metrics with the highest cardinality. We try to understand what is happening. We make a plan how we can lower that, and we approach the teams. And the great way to do this kind of approaching is to use the amazing relationship that we have with Phenops, because it's very important to understand in observability that the teams are owners of their observability pillars, observability data. So it's not a cost that the observability team will have in its backs. So it is very important that when you approach about the use of this, we're using this much. Here is their bill. Not literally, but of course, we have dashboards that everyone can see, everything can see how much they are spending. So you approach and you say, hey, this can be done better like that. In the beginning, this auditing can be very exhausting, and it requires a lot of time and hands, because it could be so much data. But once you do it the first time, and then you do another iteration, it gets better and better. For logging, it is easier. You just see everything debug. OK, this shouldn't be here for this much time. It's too much. And you approach the team. For the tracing, I would say that it is a bit more tricky. So what we are planning and have done in the past for the tracing is that we have just implemented policies. Instead of doing the auditing there, we actually do the dropping before it reaches the back end. Everything that we have deemed unnecessary or that didn't provide any information that was useful, of course, consulting with our engineers. It was dropped before going there. And hopefully, we can implement that for our logging and our metrics as well. We are observing mostly metrics in New Relic. There are a couple of functions that are called tracking cardinality. We have dashboards. We are seeing errors, basically. If you will extend the limit of your attributes per metric, you won't see them. So that's kind of easy to spot one. If you are seeing drops in the data. And we have dashboard for it. For logs, regarding following standards, we had a lot of issue with it. And if you have rules and you are not doing this management, it's more like intention than rule in the company. So it will be broken because someone will change team, someone will change job. And in two quarters, you have mess in your cluster. So what we decided to do is to have a common semantic convention model for logs. It was established before the official open telemetry logs. And we are waiting until it will be, I guess, it's not generally available, right? The logging format for open telemetry yet. So we are still waiting for it. And I guess we will migrate towards it when it will be done. It was done based on some draft a couple of years ago. And basically, what we are doing is that currently, in the last research, we have strict mapping. So if you are logging something wrong, you won't see it. Basically, there is no logs and you have others on it. There is just proper way to log in a specific format. And that's it. Apart from that, what we have is that we have this thing called prefixes. So basically, you can prefix your log attribute with name of your project. So if you will do it, we'll catch you quite quickly. And we'll be at your desk. And you will have to fix that log. Regarding logs levels, we are monitoring the cost of our streaming logs to the cluster. So if there is a huge increase in the data flow, we will know about it. That's what we're now doing. We have time for it. I was thinking that's all like tying it back to cost and distributing cost to teams actually. It's quite like an effective way of doing that. Because in our case, we have our telemetry cost distributed to teams in the same place that they've got their cloud provider costs. And in the same way that in the same place that they've got their data pipelines costs. And it's all aggregated into the same place. So they're able to make a judgment of, OK, so how much am I spending on telemetry? And then make optimizations on it. So they come to us and say, is those teams that come to observability engineers and say, can you help me save? And then we go, well, turn off those debug logs, try tracing, as used in tail sampling. And then they'll be able to save money as well as improving their observability, really. Out of my experience, there's one thing regarding that. It's always worth to check what you are optimizing. Because sometimes you are optimizing the wrong thing. In our company, we were optimizing for a long time. We're lowering down the signal that we are sending to Neuralic. And what we have found out, that no engineer knew about it, what is the pricing model for the service that we are using. So we are optimizing the signals. But it was not my major cost in our platform. Major cost in our platform was cost per user per seat, basically, in Neuralic. So what we have changed, we are changing at the moment, is that basically we are changing the way that the data is accessible to the engineers, basically. So we're optimizing for many years, thinking that it was not an issue. So it's always worth to see what you're optimizing for, before starting that. This conversation is sounding so interesting. But then we have to go for lunch. And we have so many questions on the Q&A. And we can get to them. So what we're going to do is we have one minute left. I don't know if there's anyone in the audience that wants to ask the question live. Whether you've acted on the Q&A and just feel like, this is super important and you want it addressed. Just one. You can always find us on LinkedIn or the CNCF Slack as well and ask us questions there. We're completely open to answer all your questions. And we've got the open telemetry observatory, like AcuCom. So drop by there, we're all happy to talk to others. And we talked about semantic conventions quite a lot today. I think there is still some spots in a feedback. Use a feedback session for semantic conventions. And there are other feedback sessions as well. So make sure to join us there as well. Yeah. So I think. Maybe one more question because. We're on time. So maybe we are just going to close it out here. And if you have any questions that we didn't address, I know there are so many of them. Like we said, you can always find us in the open telemetry observatory from tomorrow. So with that, thank you so much. Thank you so much. Thank you. Thank you, Hope.