 I'm Daniel Dyla. I'm on the Open Telemetry Governance Committee. I'm also a maintainer of Open Telemetry JavaScript. I've been working on it for a little over four years now, and I'm a contributor to Open Feature. Hi there. I'm Mike Beamer. I'm a product manager at Dynatrace. I'm involved in open source contributions and also a member of the Governance Committee on Open Feature. And you probably, or you may or may not have heard of Open Feature, so I'll really quickly cover what that is. It's an open specification for providing vendor-agnostic, community-driven feature flags. So it works with commercial vendors, in-house solutions, and maybe something, you know, close source, open source, whatever the case may be. There's nice integrations to management tools. And then, if you're not familiar with feature flags themselves, I just want to give a really, really quick level set here. So the main idea is it's a pivot point in your code. It's something that can be updated without a source code change or without a restart to your service themselves. So it provides a lot of benefits that we'll cover in just a second here. So the reasons you may want to use a feature flag is to help coordinate feature releases. So you can kind of help decouple a binary release from a feature release. It's really common for, like, trunk-based development. You can also reduce the risk of a feature release. You can do that in a lot of ways. One way is to control the impact radius. It's something we'll show in the demo in a little bit. But you could basically enable the feature for a subset of users, either very specifically or, like, randomly select a bucket of users to show a feature to. And then, finally, as you kind of get more, I guess, mature in your feature flag usage, you may run experiments. So you may want to have one, two, three, you know, different variations that you'd like to test. You'd like to measure that impact and then decide if it's something that you'd like to keep. But with all the benefits of feature flags, it does introduce a few challenges. So in this diagram that we have here, we're showing, like, a really highly distributed, like, microservice architecture. And so we're making lots of calls. It's already very complex, but if you start adding feature flags in there, you can change, like, code paths at runtime, something that maybe hasn't been tested. You could also have multiple flags evaluated on a single request. So it just becomes, it takes something that was complex and makes it even more complex. And so that's really where monitoring comes into play. And I'll hand it off to Dan to talk about that more. All right, so as we all know, monitoring is already critical to the deployment and running of our infrastructure. But as Mike said, with feature flags, it becomes even more complex and becomes even more important. So I'm here to talk to you a little bit about open telemetry. Open telemetry is a collection of APIs and SDKs in various languages. I think we have 13 now, maybe more, used to collect telemetry data in a vendor agnostic way so you can ship to whatever telemetry vendor that you have. There's a few basic types of telemetry signals. I'm just going to talk about events, traces, and metrics. And for the purposes of open telemetry, events and logs are more or less the same thing. They're actually transported in the same way, collected in the same way. So first I'll start with what is an event? An event is any point in time. It sounds obvious, but sometimes not. It allows for sort of arbitrary collection of any individual point in time with attributes attached to them. Attributes are used to describe your data and used together with the open telemetry semantic conventions. Your back end can understand what that data is to enable analysis later on. Events typically don't require a lot of processing on the client, which makes them very cheap to collect. But because they can be a lot of data, it can be more expensive to transport them and store them. And when you do analysis later, sometimes it can require scanning or large sets of data in order to do analysis. The second type of telemetry I'm going to talk about is traces. A trace is a collection of spans which describes a transaction in your system. A span is essentially any operation with a start and an end time. So in that way it can really be thought of as two events, the start-span event and the end-span event. Again, very similar to events, data is stored as attributes, and these are linked together in a tree structure. It does require propagation of what we call span context in order to link these all together later, which introduces a little bit of complexity on the client side and also some additional processing restrictions. But because it has very specific meaning, it allows for specific analysis types later on. And finally, we have metrics. Metrics is typically numeric data aggregated from a series of events. So for example, you have failure rates. Each individual failure may be an event, but the failure rate, which is how often they happen, is a metric. Most often you would throw away the original metrics that you collected in order to generate that metric, although not always. And typically there are some restrictions involved here. Attributes are typically more restricted. Very often you have to control your cardinality of your attributes, which is how many different values each attribute has. It can be possible to generate metrics later from events and traces, although this moves a lot of processing to the back end, which can be expensive. So here we have a quick summary. On the left we have events, and on the right we have metrics. These are sort of from more unprocessed and raw data to more processed and aggregated data with traces somewhere in the middle. And you can see on the unprocessed data there's very little client processing restrictions, and it opens up a lot of analysis options later on at the expense of storage, transport, processing, and all of that. And on the right you have aggregate data, which is more efficient from a transit and storage perspective, but because you may have thrown the original events away, you really need to know what types of analysis you want to do in advance. And in the middle we have traces. So when you're deciding which signals to use, you can ask yourself several questions, like how much data will I be collecting? If you're collecting events and logs, it may end up being a lot. If you're collecting metrics, a lot of those can be condensed down into fewer data points in order to save on processing costs. What types of analysis do you need to do later? If you don't know, events may be a good option for you because it opens up a lot of opportunities to make those decisions later down the line, although obviously you do have to store all of your events. Is your data structured or unstructured? Again, events are very flexible. Traces and metrics a little bit less so, especially as you move toward the metrics side. But if you already really know what types of analysis you're trying to do, what your data looks like, sometimes the cost savings of metrics can be helpful. And obviously, am I collecting numeric data or some other type? But most often, you want to collect more than just one signal type. As many as possible, really. Instrumenting with open telemetry is fairly easy. It varies language to language, but typically you have a resource which identifies your application. Exporters which tell your SDK where to send all of your telemetry data and how to get it to your backend. And instrumentations. Here I have the HTTP and express instrumentations as an example. Instrumentations hook into common libraries in order to generate telemetry data. But if your library is not supported or if you're doing something custom, you can always do custom telemetry as well. A good place to look for instrumentation libraries is on the open telemetry website. You can see the ecosystem tab and click on registry. You can type in the name of your library and see if there's a supported telemetry library for it. So let's hear an example. All right. Thanks, Dan. So really what I want to do now, we've kind of set the stage on what a feature flag is, what telemetry and open telemetry are, and then we really want to combine those two to make feature ruleouts safer. So what we're going to do today is we're working with this hypothetical sneaker sales shop. In this, they have a really simple architecture. They just have a couple of users. They hit a load balancer. They have the sneaker shop service and a database. And it's doing just fine. So the response times are fairly reasonable. Everything's good to go. But basically as load increases, so the site becomes more popular, they're selling more shoes, and the response time is becoming worse. So what we end up doing, we need to look into this a little bit more. So we can drill into a trace. That's referencing back to what Dan was just talking about. Looking at a distributed trace, this is just the collection of spans. It becomes quite apparent that the database is actually the culprit here. So in this case, it's almost a one-second response time for this database call. So now we know what the problem is. We know it's the database. So in this case, we're just going to add some read replicas. We'll scale horizontally this database. In order to do that, what we want to do is put the new read access behind a feature flag. We want to enable the read replica for just a small subset of users to try to control the impact radius. We're going to look at the impact using open telemetry and a couple monitoring techniques. We're going to go ahead and enable the read replica for everyone. And then it's always considered a best practice to remove the feature flag once it's no longer needed. All right. So we'll start off by adding the feature flag to your code. This is an example from Open Feature. This is using the Open Feature SDK. And I'm just going to call out here that you're seeing we have the use read replica or use dbreadreplica feature flag identifier. So this is what's referenced in the feature flag management tool that you're using. I'm also going to quickly call out context. So this is a way to supply runtime information to the feature flag system. And it allows you to use this as like a pivot point to make pretty advanced targeting rules and decisions during the runtime. And then finally we're going to use the results of the feature flag evaluation to determine which database connection we're going to use. Next, we're going to tie it back to what Dan was just referring to earlier. So Open Telemetry. Open Feature has the Open Telemetry hooks. So we're going to tap into the life cycle of a feature flag evaluation. And in this case, basically what we're doing is we're adding an Open Telemetry event. It's actually an Open Telemetry span event onto the request. And that basically allows us to associate a feature flag evaluation with a bunch of metadata with the overall request. And so we'll look into that in a little bit. But it becomes a very powerful technique for detecting the impact a feature has on the rest of your system. Finally, we're going to go ahead and enable the feature. This is just showing a JSON representation of what a flag configuration may look like. Depending on the system that you use, it could be a text file or it could be a nice rich GUI in order to control this. One thing worth noting is this is the identifier, the same one that you're referencing in code. And we're also just enabling it for 25% of sessions. So we're going to start by just releasing it to just a subset of users. When we do that, unfortunately, we see that we had failures. Something went wrong. Thankfully, we only enabled it for just a subset of users, so the impact was relatively small. And because we're using feature flags, we're able to basically roll it back almost instantaneously. We didn't have a binary that we deployed that we had to redeploy the old version or something. So it's nearly instantaneous and we're able to abort it and then roll back to normal, almost instantaneous, controlling the blast radius. Now, we can go ahead and look at what went wrong. We captured all the telemetry. We can actually inspect it later. It doesn't matter if it's actively an issue. We can look back in the past to start figuring out what went wrong. So in this case, what we're doing is we're actually going to look at all of the requests to our backend and split by the different feature flags that we're looking at. The hypothesis, of course, is we enabled the feature flag and something failed. So in this example, we're looking at the traces combined with the flag evaluation to very quickly identify that it only failed when the feature flag was enabled. From there, we could look at the exception message and we could aggregate that as well. So when we start looking at something like that, it becomes really apparent that node 3 in the database was the issue, which is also why we weren't seeing any reference on failures. One thing that you may see in open telemetry is a thing called an exemplar and it's basically like a representative trace. So here, we can actually use the exemplar to see a trace that basically represents this 32 count error count failure. Drilling into the trace itself becomes very apparent and very obvious that it was indeed the issue. You can see that the database call failed and the connection refuse on database 3. So with that information, we can go ahead and work with the appropriate team. They can look at node 3, they can try to figure out what in the world went wrong and hopefully fix it. Once the issue was investigated, addressed, and we're ready to relaunch, we can go ahead and try again. So basically, in this scenario, this is the feature flags off. All the traffic's going to the old slow database connection. We're ready to go. So we enable for 25% of the users. Notice that there's no failure rates, thankfully, so we don't have to abort the rollout this time. You can also see the throughput on the read replica has increased by approximately 25% and the response time is quite stable and quite a bit faster. So that's the expected experience. So we're going to go ahead and continue out with the rollout. In this case, roughly 50% of the traffic has been split. And you can see the response times are looking stable on the read replica and decreasing on the slow database due to the lack of throughput now. Still looking good. So we go ahead and roll it out for 75% of the users. Same deal. Everything looks good to go. So we're feeling confident. We can go ahead and enable this feature for everyone. You can see the throughput is completely shut off for people accessing the slow database. 100% throughput for people accessing the new read replica and the response times are stable as we expect. So I'm going to hand it over to Dan to summarize what we just talked about. Yeah. So as Mike mentioned, we rolled out an important performance fix that was affecting all of our users. But we controlled the impact of problems with the potential fix to a very small number of users. We used open telemetry to monitor our assumptions both with the problem and with the fix. And then when we were confident we rolled that feature out for all of our users. And of course we'll continue to monitor impact into the future and we'll eventually clean up the future flags so that it's no longer in our system and shut down the old database. That's it. Thank you for your time. Yeah. We're happy to hear any questions if anybody has any. Yeah. What about this? Is it possible to leverage, I would say, an issue in a bug that you face on the new feature and automate the fallback to the previous version? You're talking about automated rollbacks of the feature flag. Yeah. For example, the user A is redirected to the new feature. At the first try he fails because the new feature is KO. And then with the retry button the next request that he will do could be fallback to the previous one that is working. Is it something that it could be automated or... Oh, I see. So you're saying you have a user that they go to your website, something's broken, they reload it, you want the old version. It's the second time. I'm not sure maybe Mike is probably a better person to answer what specific rules are available from that perspective. Yeah. If you look at open feature, we work with a lot of different flag management tools and it really just depends on the implementation that you use. Most of them are quite sophisticated though so you could build some pretty advanced targeting rules and then you could also, if you pair that with a telemetry tool, what you're looking at is you could detect an issue and automatically maybe change the targeting rule to minimize the impact or completely eliminate the impact. Okay. And because you don't have to redeploy, it's quick. Thank you guys. You're welcome. Hello, thank you very much. It was actually a great talk. I was asking exactly the same question on some previous open feature talk this week and then you just answered it and I'm not sure if you support this but I assume that we can do several feature flags at the same time, meaning support not as an open feature but having a telemetry on open feature flags, like including feature flags as the dimensions in your metrics. So how to fight cardinality in this case? Because assuming if you have several feature flags together multiplying by number of your machines and everything else, that will be a challenge at some point. So what are the best practices or suggestions how to run experiments and not to hit this problem and still get some valuable results? Yeah, it's a great question. So in the demo, actually, I was not using a metric that was collected. So the metric was essentially generated on read from trace information. So that's basically how you could work around a metric explosion because exactly like you said, if you basically generated a new metric on the feature flags, it's not that scalable. So that's where those things tie together. Hopefully that makes sense. Yeah, just to double-check, so this feature doesn't support metrics. I mean, you always generate only traces and then we need to post-process them. So we do both. So that was kind of what Dan was referring to. So it kind of is up to you to choose. Like if you have a sophisticated enough system to be able to analyze traces on demand and slice and dice that, it's not just about capturing more information. It's more of the discovery of the unknown unknowns. If you need something a little bit simpler, you could just collect open telemetry metrics, but you're a little less flexible. You wouldn't be able to tie that back to individual requests, for example. And it, of course, depends how many feature flags do you have, how many variants do you have for each flag. If you have a controlled number, then you already know what the cardinality might be. If you have an unknown number, then that may be better than a metric for you. Thank you. You're welcome. In the example you showed, where you showed the error rate per surface and feature flag, is that something that's already supported by current tooling? Yeah, all the examples I showed were all real. So it really just depends on the monitoring tool that you're using. But all the data was collected with open telemetry and it just depends on kind of the sophistication and observability tool, essentially. Thank you. You're welcome. All right. Well, thanks for joining us on Friday. I appreciate it. Enjoy Paris.