 So my name is Jason. I'm here with my colleague Kevin. We both work on the observability team at cruise which is under site reliability engineering and Just by way of kind of understanding the audience a little bit here I wanted to just get a sense of a few things. Can you just like raise your hand or otherwise visually? You know inform me if you feel like you have a good understanding of open telemetry All right, so that's about 50% now the same question. What about specifically the open telemetry collector? Thing that you can deploy to manage signals. All right slightly less. Okay, cool that's good because the This talk is gonna get pretty technical This is kind of a report from the trenches of somebody who is doing a kind of a large-scale migration using open telemetry shifting a lot of existing instrumentation to it and So we're gonna get in the weeds But at the same time it's gonna be kind of not so much Exactly about how we do it because we don't want to be necessarily prescriptive Just kind of showing you some ways you can think about doing this in your organization and hopefully it kind of is Applicable to organizations of various different shapes and sizes and you know maturities so As an overview in this talk, we're gonna walk you through the thought process that we followed and executing a wholesale migration From one observability vendor to another in our case We had historically used datadog and are migrating to Chronosphere and it's actually funny. I think on Tuesday. There's a panel with a Charity majors kind of referred to the observability teams as doing vendor engineering and that felt actually a little too accurate Because a lot of what we do is we manage our the relationship between the infrastructure and the teams and ultimately a lot of vendor products that we are leveraging for You know opportunity costs. You don't have to spend all this time owning all aspects of Observability internally because it doesn't make sense a lot of the time That said we're not talking specifically about vendors in this talk It's more about just the process of migrating and why open telemetry is a great assist in helping you kind of have more Technical choice to allow you to make this kind of shift in the future We're going to tell you some lessons we learned along the way Kevin is going to give you a deep dive into how we run the hotel collector at scale for a lot of different use cases at cruise and Ultimately what we hope you get out of this is if nothing else Some confidence that you can start using open telemetry in your environment and iterate on it over time as a suit your needs So let's set the stage we work at cruise an autonomous vehicle company that provides ride-hale and delivery services Via a fleet of driverless cars in multiple cities I said Kevin and I work on the observability team So we're responsible for ensuring that we enable the development and operations of those services To be reliable and also make sure that all of our engineers can be productive with cruise infrastructure To give you a sense of scale, we're talking like we have on the order of a hundred Kubernetes clusters with thousands of nodes tens of thousands of pods and we also have about hundreds hundreds of thousands of VMs that are spun up and down to sort of assist in things like test training and simulation So if you've been in this room in particular in kube-con, you've probably heard a lot about the benefits of open telemetry In my view hotel provides two key capabilities One it it unlocks choice unlocks technical choice for you as observability, but two it also provides synergy So the ecosystem of hotel Has a lot of weight behind it and there's a lot of nice Virtuous cycles happening Due to the ubiquity of hotel as an open standard It's great because vendors increasingly are supporting it for ingest or export or whatever While at the same time because of that ubiquity we get a lot of investment in the instrumentation side So it's a really nice healthy ecosystem and healthy community right now And so we basically approached hotel motivated by a desire to shift our entire Observability stack from day a doctor chronosphere It could be that you don't need to do anything nearly that extravagant But we can talk about various ways to start depending on what your needs are So to open up like whenever we think about all I wanted to use this new technology in In my company you always have this this dream is that you you have a green field you've got unlimited possibility you know that hotel you've heard good things about it and You don't have any legacy systems to deal with you can just create new projects You can use the hotel SDK you can use the collector blah blah blah it all works great You see lots of blog posts about it However, the reality is like we're dealing with stuff like this where we've got lots of investment In legacy systems things that have evolved over time in our case at cruise We've kind of had some lava layer going on with the instrumentation So you've got like doc stats D from Atheist. We had some early open census users and open tracing There's also a hotel already in the mix in a few different cases But beyond just the instrumentation and the code you have all this additional investment on top of that with dashboards and alerts and things that people are depending on to do their work on a day-to-day basis at the same time you've got Typically a long tail of a bunch of junk that you don't really care about nobody's looking at and you aren't really sure like how important is it and Lastly at larger organizations You're also sometimes wind it up with the issues of ownership where maybe entire systems don't have a clear owner And that applies as well to the observability of those systems and assets So some of the problems that we faced Didn't say this but cruises like to relatively old company. I think it's on the order of 10 years I really shouldn't know this but it's it's been around for a while There's been a lot of organic growth that we had to take on here And so one of the problems we faced was how do we just just get started to use hotel and take advantage of being in this ubiquitous virtuous cycle How can we introduce this technology with minimal overhead for existing teams and I'm not just talking about Talking about code overheads. They don't have to do many changes to their code But also, you know performance overhead and things like that That also applies to you know disrupting them and their existing dashboards and monitors How can we make it so they don't have to rewrite everything that they do? Third, how can we do this safely so that we can transition our monitors and make sure we don't have gaps and Observability that could lead to us missing issues during a migration and lastly, how can we do all this at cruises scale? Which is significant? We have some pretty big clusters So we're gonna talk about taking the first steps and throughout the talk We're gonna kind of refer to a diagram that looks a bit like this I think this is basically what observability systems look like if you zoom out in a Kubernetes cluster So you've got a cluster. It's got a bunch of nodes. Typically you've got a daemon set running There's kind of no way to get around that on every node and that daemon set runs an agent That is either vendor provided or you're running it yourself and that agent has a lot of responsibilities It handles ingest of metrics like stats D or it handles maybe scraping things It handles getting host metrics getting container metrics It does a lot of work and in other cases like data dog They schedule custom checks and they all kind of participate in this sort of check mesh that is essentially organized So it can really vary how much they're doing on The side we have a cluster agent, which is sort of you usually have some of these things They're like kind of things that are cluster scoped like kubestate metrics things that you only want to be running like once somewhere Another example is if you're pulling cloud metrics from GCP or Amazon or something Now all this stuff funnels up into what it's just a simple box here that says observability system But actually that is you know also a complex beast that has lots of moving parts But for our purposes here, we'll just assume that is something that can take in signals like metrics and tracing We're not really talking about logs today by the way But the same thing would apply Takes the metrics and tracing and then it provides a query interface that allows you to Visualize the data and write alerts and so on and so forth So one of the first things that we thought about when approaching You know adoption of hotel was like well, what if we just introduce this at the client layer We can change the instrumentation from DD trace or doc stats D and start aligning on open telemetry take advantage of the instrumentation there and then we we can do this actually because The data doc agent actually supports otlp ingest for both metrics and traces and it's pretty good So this looked like a promising avenue however at At an organization of any significant scale you're going to run into a lot of problems with actually doing this migration You have to get buy-in from the teams. You have the issue of service ownership Maybe some services are just it's not not clear who would do the work And if you have a lot of systems, you're really signing up for a lot of work that ultimately is realistically going to fall on infrastructure The other thing this doesn't solve are things like pole Like things that expose prometheus metrics, which a lot of internal infrastructure does So while it could be a way to sort of get your foot in the door Especially if your vendor supports otlp ingest it didn't really work for us because we had more use Cases and our organization was just too big for this to be a viable approach A more interesting idea is to still hold on to the fact that you've got otlp ingest on the agents But you can use the otl collector to kind of sit in between The agents and your workloads. This is this allows you to do some cool things. So you can Normalize at this layer so you can ingest stats D. You can ingest in our case data dog APM trace data But you can do other things there as well. Like maybe enriching the metrics from some other sources Performing, you know, other normalization sampling perhaps you could do here as well And this kind of gets you a little bit of indirection between for example a vendor provided agent in the rest of your infrastructure and that can start to Open the door to more interesting things down the road, which Kevin will talk about in a bit So I do want to take a little step back though because I talked about how like, okay We didn't go down the SDK route because it was like instrumenting the client because it was too much work But you kind of should have a bifurcated approach when you're looking at a migration like this So you have one way that you handle sort of legacy stuff And you have the other way that you you kind of should be prescriptive about what you want new systems to do because otherwise New systems are going to come up and they're going to just keep repeating the same patterns that you're trying to stamp out so what we did was For new systems, we provided a distribution of the open telemetry SDK that we wrap up. It's an internal In this case, we've got it in a few different languages that we use at cruise And it's basically the observability library that teams are supposed to be using whenever they're Riding a new system. This is really nice because we can put opinionated defaults in there So we can make sure that everybody's tagging things in the same way We also do stuff where we will provide wrappers of hotel instrumentation libraries so we can make sure to You know for example, I'll get into this in a bit But some instrumentation libraries had problems with high cardinality labels We could like take care of that and make sure that they only take the the ones that are kind of blessed by the observability team The other cool thing about this is it allows us to provide a pretty good dev experience to users And this is something that I think is not often thought about when especially coming from like a stats D model Where developers are kind of used to just metrics are kind of like in the background somewhere And we see them when in prod and we see them in dev or whatever when it's deployed, but locally the workflow Kind of sucks. So the cool part about moving to a hotel is that we can do something where When you're deployed, we set up like a push exporter that pushes to our infrastructure But when you're running in dev mode, we can automatically set up a local prom scrape endpoint We can set up things like stats viz z pages and that makes it so that when you're running things locally You have a really nice feedback loop that actually didn't exist before and so this is one of those nice Carrots that you want to give to people when you're trying to get them on board with the new thing so Some of the the challenges that we had I kind of think it's required basically to roll your own distro at this point because If you want to do anything slightly complicated the off the shelf tools just aren't enough I don't think they can be enough. It's just kind of nature of the beast. You can look at hotel config go It's a nice way of simplifying Setting up the metric and trace pipelines in open telemetry. I think that Frankly open telemetry by itself is kind of a beast to set up a configure You need probably at least 20 lines of code to do it And so that's another reason why we rolled our own SDK, but it comes with overhead. We have to version it We have to make sure that we Modularize things well enough so that we don't run into transitive dependency issues, but ultimately these were worth the upside This is kind of like what our library looks like we just have a start function and then a shutdown function That's basically it. We expose some hotel functions like otherwise people mostly use the hotel API so they just generate metrics they generate you know counters and things like that and We have some other sugar things that we put in there for to kind of Basically make the edges of the SDK work a bit better Some gotchas that we had with this Probably the biggest one was just being really careful with trace propagation. So again, we use data dog. Data dog has their own trace propagation format and What's cool though is that Sort of more recent versions of DD trace data dog's own instrumentation. They support sending W3C trace context, so they support the new way of sending trace context when you're emitting data But if you have any old clients in your infrastructure, that's no good So what we had to do and also the other thing I was going to say about this was Istio as far as I know You can only set up one trace propagator format like hotel You can kind of combine them and have a chain I don't think this is possible in Istio and we were relying on data dog at the Istio level as well so Necessarily we had to basically build in support for Backwards compatibility for data dogs format, which is another reason. It was nice to have our own distro of the client. I Mentioned a few of these other things before high cardinality metrics have been a problem with the built-in hotel instrumentation This has gotten a lot better with the realization that this is a problem But that was something we had to work around in the past an example is like Putting the connection ID or like the peer IP address on metrics emitted from HTTP instrumentation, which you almost Basically never want for for metrics There's this idea of a view in hotel SDK, which I think is kind of confusing I would like to see some something Some better work there people are kind of used to defining metrics and like histogram buckets like next to each other And the view kind of separates them, which can be a bit awkward and Lastly if you're coming from a stats D world you expect synchronous gauges But guess what they don't exist in hotel and this is definitely kind of a confusing thing for people Fortunately, I believe this landed in the hotel spec because they recognize the need and now we're just waiting on SDK support In the meantime, we just worked around this We wrote like a wrapper that made a synchronous interface out of an asynchronous interface again another reason It was good to have our own client Okay, so We talked before a couple slides ago talked about how you can sort of get a foot in the door by intercepting things with the collector But now we're going to talk about like what would it take to actually run Everything with just using a hotel collector, and that's what Kevin's going to talk about. All right. Thanks Jason Yeah, so I'll talk about how we run the collectors at cruise If you're not familiar a collector is basically piece of instrumentation infrastructure that receives Metrics log traces processes them and then sends them on through exporters It's very composable and it is a nice layer of indirection Especially if you have let's say things like multiple vendor back ends So here's a high-level view of our architectures I'm going to start with our edge. We call them our edge collector systems on the left These sit in on all of our clusters. So they basically are there with our services and They collect basically at the edge. So there's certain things we need to collect like system metrics as well as You know the services report directly to them Next I'm going to move to our centralized ingest collectors, which is sort of our second tier And it is important to note that the deployment architecture Has implications on sort of the collection strategy and the telemetry content itself We kind of learn the hard way that things having the hotel collector as a demon set Versus a deployment sort of comes with its own nuanced capabilities So I'm going to basically go through all the metric and trace signals that we need that we needed to replace in this migration and How we architect our collectors to account for this So the first thing we needed in our migration was replacement collection for our existing Ingest and this is basically stats D metric ingest and day dog trace ingest from our services So you can see here the existing vendor agent on the left and the right is basically what we came up with as the replacement It is a it basically this collector takes these responsibilities on through the stats D receiver and the day dog receiver Protocols like stats D require consistent send targets for aggregation Which made demon set a natural option for us to have these edge collectors With consistent, you know source destination endpoint mapping so we know the metrics are going to the same place Demons that bloated something we've always wanted to avoid You know the overhead per node, but this ended up being really the cleanest solution and someone avoidable for our case We use this level of indirection to basically normalize and enrich metrics before saying them on to the central ingest So next we need a way of implementing system metrics on Kubernetes clusters These are things like Kubernetes metrics status of a pod how many jobs failed Container metrics like you know the CPU a container is consuming as well as host metrics like you know the memory usage of a VM or a node These previously came out of the box with our vendor agent. So now we basically have to figure out how we're going to implement them ourselves For this we leveraged the same demon set collector instance You saw earlier and we basically added a series of Prometheus scraped jobs to it using the Prometheus receiver There already exists an endpoint on the cubelet for host metrics and container advisor metrics as well So that's where we basically scraped Locally From the demon set to get our node and container compute metrics For Kubernetes metrics we use cube state metrics which provides as I mentioned a lot of those You know metrics that are relevant to the state of the cluster we run those as a sidecar on this demon set and We actually have this sharded along the node So there's an option in cube state metrics which basically queries the API server and provides a scrape target to say You know, I only want to produce metrics based on the node that I'm on this you know Sort of horizontally scales the instance Which we originally had is a centralized instance and it greatly decreases the scrape time and the scrape size And basically unlocks horizontal scaling So we also have another KSM instance KSM cube state metrics for nine Node paradigm metrics so things like job metrics or service metrics. It kind of runs off on its own and is not sharded So coming from the vendor solution, we assume these metrics would have a lot more attributes than they did sort of out of the box The metrics from the scrape job don't have exactly the attributes and the way we wanted Mainly because they're from the pre-hotel days. So, you know, really sort of some nuance between Prometheus and open telemetry But of course the collector provides really great functionality and it's really powerful ability to process metrics Via processors in a very flexible way First we have the attributes processor which updates these data point attributes to match open telemetry semantic conventions Open telemetry semantic conventions are basically standard naming patterns that exist so that way we can unify our telemetry data So you could see it change like pod and namespace to the kates.pod.name and namespace Next we use the group by attributes processor to basically group these data points into a parent resource So in open telemetry there are resource attributes and data point attributes Whereas in Prometheus, you know, there's just a single layer of labels And this is also a pre-req for the next processor we use which is the kates attribute processor This basically queries the API server on behalf of that node similar to how KSM did sharded along the node and It allows us to map things like the pod as you see here the status of the pod to the demon said it's on as well The cluster it's on so this sort of decorates it with more contextual information Before we send it off to our central ingest Next we have a lot of local Prometheus scrape targets on our clusters So, you know a lot of utility deployments like Istio and open policy agent have Prometheus endpoints Also our legacy system supporter Prometheus So many of our existing services are already instrumented in prom and have the endpoint So for these we actually created another edge Stateful set collector. So this is another deployment of the collector And we basically ask service owners to annotate their services with standard Prometheus annotations like the ones you see in the operator And using the Prometheus receiver, we do service discovery to find these great targets much like you would with normal Prometheus We then use the target allocator which actually comes with the open telemetry operator This is a tertiary deployment that distributes scrape targets among the collectors So once again, this unlocks horizontal scaling and the more Prometheus scrapes there are the more collectors we can spawn and Distribute that those scrape jobs Just like the other collection scenarios we enrich these metrics and then send them on to our central ingest Finally, we have use cases like database or device or network device metrics And as you can see from the previous part of my talk, these are somewhat orthogonal to our existing collections infrastructure, which mainly sits in Kubernetes Maybe these need to run on another hardware. They require niche collector components and We also have some other teams that own these and would like to handle the collection of metrics themselves So we use the open telemetry operator for deployment and collection or management of collectors It's a really nice way of simplifying deployment for us And we basically allow and even encourage other teams with these niche scenarios to implement their own instances of their own collectors They still report to our central ingest collectors, which still gives us full control of the ingests and the way it's coming So this covers most of the use cases for telemetry on our collectors One case study we've had with the collector where we really could not find an immediate solution available upstream is metrics aggregation So we run simulations on hundreds of thousands of VMs and while we want to get metrics on them We don't necessarily need the host-level granularity for the cost You know, you can think it's hundreds or even millions of unique time series So we came up with a way of basically aggregating them to still represent, you know, that population of VMs without all the unique time series aggregate on the back end So initially we built this with some collector contrib repo processors Which worked well So things like the metric transforms the group by attributes the batch processors We basically kind of put them together and had to alter them a little bit But eventually we made our own custom processor for this We use the open telemetry collector builder, which is just a really easy way of spinning your own open telemetry collector image based on just adding and removing components via go modules and Basically this custom processor that we made is part of that image and it aggregates via accumulated windows in memory And then he mits them based on time The bottom graphs show that VM counts over time with sort of the out-of-the-box first Version we created and then with our custom processor that handles it a lot smoother and Some of the metrics we actually aggregate in histograms Which allows to have heat maps on really large VM distributions with only like a few unique time series on the back end so like this a Heat map on the top right. It could only be maybe 60 70 time series You know representing hundreds of thousands of points or hundreds of thousands of VMs all these collection systems sent to our central ingest Here we'd normalize it if it hasn't been done already sample and then route to our multiple back ends Due to the shared services model we own this cluster as the observability team So this scenario works best for us and we're able to roll out quicker Have more control over it compared to the other multi-tended clusters where you know deployment is a little slower But it may differ by scenario and your organization organizational structure One major benefit of this is it encapsulates observability as a service So we basically front the vendor back ends with our own OT LP endpoint and This allows for consumable solutions with other teams as I mentioned earlier It also gives us a lot of control around the telemetry we send Basically everything that goes to our back ends is coming through here So we have the ability to also do composite things across clusters like trace sampling if we need to But it comes at the cost of maintaining a really high availability as it is a single point of failure So that's it for me Jason is going to talk about a few more considerations when doing a wholesale migration Thanks Yeah, so with this section We're talking about doing the whole thing so we talked about replacing the agent on all the nodes and having that Be based on things like open terminate open telemetry collector open source stuff But now we're talking about replacing the entire observability system in all of the new complexities that you get with that As I said before we used to use data dog for this and now we're switching to to chronosphere as our main back end for metrics and tracing and with this The sort of the single biggest Issue that happens is on the read path. So data dog has their own query language And then chronosphere uses the they expose promql the sort of the sort of standard Prometheus format And these two things not the same so there's a huge amount of effort that goes into Translation and managing the translation process and trying to do that in a way that doesn't disrupt the team is very much our approach to that is kind of a combination of leaning really heavily into automation as much as we can and Designing a process that's largely automation driven but at the end of the day you need a lot of eyes on The work that's going through and so part of in our case We're getting some assistance from our chronosphere team to help with translating like dashboards and monitors over But even with that with the amount of stuff we have to move We need automation and by automation I'm talking like we have a process where teams will ask teams to like hey Go look at all your assets that you have In data dog and tag them all up in some way so that we can find them some sort of you know uniform way that Put some pressure on the team to basically figure out like what are you actually using and provides a really good first pass at cutting away some of the Cruft and then when we have things kind of tagged up We have automation that pulls all the dashboards and monitors from in this case data dog And we have them in some sort of format and then we are working then we work to basically translate that format into In this case a Grafana JSON dashboard format which chronosphere supports so the other cool part about the the sort of adherence to Things like Grafana JSON it allows us to have more flexibility as well So, you know, we actually chronosphere has a front-end that has dashboards and a trace product But we also have other use cases that involve and kind of require Grafana Like we have situations where we've got data in BigQuery or we want to maybe pull in metrics directly from GCP and that's just not really supported in a lot of vendor products and So we have an internal Grafana as well Now, it's kind of cool that the Grafana JSON format exists because we could have the same dashboard Effectively live in both systems if we want To help further manage this down the road. We've wrote internally a Kubernetes operator That's like the observability operator and what that does is like instead of We have had in the past like some Some adherence to config is code and we're trying to lean more heavily into that And so the cool part about this is you can define like your monitors and your alerts as like Kubernetes custom resources That our operator understands and then the operator can reconcile them against whatever the back end is So it could be that it creates a Grafana dashboard CR Which then the Grafana operator is happy to you know, create a dashboard for or we Actually provision in chronosphere and this opens up the ability to do more interesting things with monitors and things later on Like we could Have some like monitor redundancy within our cluster if we don't want to have them all firing centrally or something like that So that's pretty cool We are using sloth for SLOs because chronosphere is prom based and if you haven't heard of sloth it is a way for you to define SLO targets and then it will spit out a bunch of sort of recording rules and Prometheus alert manager rules that are Are you use those recording rules and that help you sort of roll up and get the nice SLOs that you want? this was kind of important for us to tackle because Data dog has their own SLO product that we had to that we had a lot of investment in As far as monitoring goes The approach that we've taken and in general for migration what worked for us was our clusters We have like a few like gen pop clusters, but we also have some that are more like Just a little bit more special for certain teams and we're trying to migrate everything on like a cluster by cluster basis And that's like what we're in the middle of right now That works the best because it's easier to just turn a daemon set like off if you can make sure that everything all the use cases are taken care of and Especially in the case of data if they charge by the host So you kind of have to turn everything off if you want to stop double paint We have about like a one month window that we're using to basically Move people over to a dual shipping system all the stuff's going to both places We make sure that we didn't break anything in the old system And then we make sure that the new system has parity and once we have enough confidence in that We're then able to cut the cord and do them over so that's the way that we've been approaching this But it really is going to depend a lot on on your organization. These are just some ideas So just conclusion It's actually a really good time to invest in hotel actually They're always going to be little things you need to deal with but in our experience. It's been Pretty nice to there's a lot of off the shelf stuff that just works And there's been a lot of bugs that have been fixed already by the time we got to them The open tunnel a collector is an absolute workhorse We use it for basically everything on the right path and we're looking at more interesting things to do with it in the future and Lastly, I'll say to read the code because the docs could use some work But the code is there and the code will not only help you understand Sort of how everything is working under the hood and help you with tuning things But it also maybe give you ideas of how you can leverage things in the future So with that, thanks very much I don't I think we kind of wanted to give a lot of attention to the content so we didn't leave a lot of time for the Microphone questions, but Kevin and I are more than happy to Sit here and have a little huddle with anybody who wants to know more details because there's a lot of stuff I could we couldn't really cover here just for the for time, but thanks very much for coming