 Good afternoon, everyone. My name is Vijay and I work for eBay as eBay's observability architect. So today we're going to discuss about how we moved to open telemetry collector specifically and how we did it while we are running at the scale at which we are today. So we'll talk a little bit about what observability at eBay means, what's our architecture, things that we use, and what was the previous strategy that we used, how we evolved it over a period of time, some of the design changes that were associated with it, and then talk a little bit about the actual changes that were involved in the open telemetry migration and talk about what we are going to do next. But before that, we are hiring specifically to work on open telemetry collector. eBay is a great place to work. I've been here for 10 years and I consider it a blessing from God to be able to work in such a great company. So if you are interested, please use the QR code to find out more. So observability at eBay. So what does a developer do? A developer first does some instrumentation. They use either an open source library that's bundled into the managed framework or if it's a drop in application, if the application is pre-instrumented, then it's used. They onboard their application into the observability platform by going into our eBay's cloud console and say that this is the application that I need to onboard. If it's a metric onboarding, they say that my application exposes, say, 8080 slash Prometheus and they basically hit submit. And then a lot of magic happens in the background where most of our scraping, log file tailing and all that is part through annotations on top of Kubernetes. So all the annotations are delivered into the appropriate Kubernetes clusters so that the actual harvesting can begin. So we have agents deployed on all our infrastructure, whether it be for logs, metrics, or traces. Once the agent has enough information about what to look for, it goes and begins the harvesting process. So after harvesting is done, a user is free to set up any alerts that they want to do, want to be alerted on top of. So they can either use threshold place alerts in conventional prompule style or they can even do anomaly detection using some of the models that we have built for the end user. Or if you have a TV screen on the hallway that you want to visualize or just on your laptop, you can build up dashboards using our console so that you can observe what is going on. Scale, this is something that I'm really passionate to talk about. When we were taking stock of how much we have grown, we realized that we are scraping about 1.25 million open metrics endpoints across all our clusters. And that translates into roughly 32 million time series per second that we are, or samples per second that we are scraping. And those translate to 1.5 billion active time series. We do 7,500 odd queries every second. Most of these translate into recording rules, either to roll up or to alert, very minimal dashboards. So a lot of it is actually being used by systems to power intelligence. And we provide a retention of one year for all the raw metrics that are being collected on behalf of developers. So this roughly sums up the architecture that we have for the Sherlock IO platform. On the leftmost side, you have actual applications that are being deployed on our compute infrastructure. It can either be VMs or it can be Kubernetes nodes on which the workloads are scheduled into. We have two major flavor of applications that are there. One is through managed applications. So we have managed frameworks for Java, Node.js, which come pre-packaged with some of the open source clients. Or you can have a generic application that you are deploying. Once it's deployed, the agent that's either sitting on the same node or the same cluster, they start harvesting the data. It's sent into the platform, into our ingest APIs. And depending on what signal is being processed, you either have a metric store, event store, lock store, or trace store. We store it in. And for a certain unique set of use cases, we have something called a probing engine, which is nothing but stock Prometheus, which can either do SNMP walks or use a SQL exporter to query some metrics out of databases. And those also follow the same ingest path. We have a query layer, which implements the Prometheus API for us to be able to par either recording rules or anomaly detection or just simply view dashboards. So when it comes to ingest and doing it in a cloud native way, what does it actually mean? So for us to be able to collect metrics, we either need to be able to say that without knowing what the application runtime is, we should be able to scrape it in a generic way, regardless of what language they're using. Or you should be able to push it in a well-defined format. So we have things like OTLP or Prometheus remote write. And we have open metrics and points that are there as well. But in some cases, it's not necessarily easy. You have legacy applications for which you might need to have to do some sort of a one-off handling. But for the most part, we would typically rely on some open source client. And then we'll have some agents that are there, which can understand some of these well-known protocols or handle these one-offs through custom plugins. And we also need to be able to discover them in specifically Kubernetes-like environments where things are very ephemeral. So either you drop in an annotation to say that this is very easy to look for. Or if something is being sent, given minimal information like a pod IP or a pod name and a namespace, you should be able to enrich all the additional metadata that was described as part of the pod spec itself. So in 2016, we initially started using the Elastic Beats family. At that time, I think there was only a file beat and metric beat was just coming along. At that point in time, it looked really good to us. I think Prometheus was still a single server that queries the entire cluster. And we were already at a stage where it was like we had several thousand nodes, large number of Prometheus endpoints that were there. It was difficult for us to say that, OK, we can figure out the sharding and then set up multiple Prometheus instances. So for that reason, we went for the beats route, because it was more deployable in the demon set pattern, so to speak. And it was an agent just built for being able to harvest observability information. So today we, or at least until this point, we have used file beat for collecting logs, metric beat also initially it was deployed as a demon set to be able to harvest metrics. Audit beat deployed again as a demon set to be able to collect file integrity monitoring events and any audit rules that were added into the kernel to be able to figure out who is associating in or escalating to route and things like that. And finally, heartbeat for doing uptime checks. So a quick refresher for the demon set pattern. A demon set in Kubernetes basically allows you to deploy a single instance across all nodes that match the given node selector. So by default, it can go to every nodes, all the nodes that are there. Or you can even say that, OK, I just need it on a particular set of nodes that match a certain label selector. And what this agent that is deployed into a given node will do is first it will communicate to the API server telling, asking it, give me all the metadata for all the pods that were there in this specific Kubernetes node. And it will again say, OK, I will monitor these specific parts on these specific ports. Once the data is collected, it will tag it with all the part metadata and then ship it out. And the demon set pattern doesn't necessarily limit you to tagging only two parts. You can also say that, OK, along with the part metadata, I need to tag some labels that are there in the namespace or certain labels that are there in the deployment object that the part belong to, so on and so forth. So for those kinds of metadata, you need to pull all the deployment objects or all the namespace objects because those are not necessarily node scoped. So what is the problem with this approach? The first biggest problem is resource fragmentation. In the sense that every part that you're deploying on every node for the process that is running in it, it has some arbitrary cost that is associated with it. So for example, if you say that it costs 50 MB to run the VITS pipeline or the hotel pipeline or whatever, then this means that if it's a 3,000 node cluster, it's going to cost you 150 gigabytes. So depending on how many Kubernetes nodes you have across your data centers or your public cloud, you're going to spend that much on just running the process itself. And there's a CPU cost associated as well, especially when it comes to pulling all the metadata from Kubernetes and then maintaining it over a period of time. The other problem, when you run as a daemon set, you're going to say that, okay, I'm going to limit how much I'm going to spend on metric scraping to say one CPU and one gigabyte for every node that you are deploying into. So one GB. While we say that Prometheus endpoints need to be concise, the reality is that people tend to abuse instrumentation over a period of time. So we have seen endpoints as small as tens or a few hundreds and endpoints as large as few millions worth of entries. Cube state metrics is a good example for that. So if you are giving one GB for metric beat that's running on a node where Cube state metrics is deployed as well, it's going to crash. And when it crashes, you basically lose visibility into all the parts that are there on that Kubernetes node as well. And given that you are watching the same artifacts across all Kubernetes nodes, this is going to add unnecessary API pressure on unnecessary pressure because you're doing a lot of redundant watches across all the nodes that are deployed on the cluster. So in order to mitigate some of these problems, we basically moved into a cluster local mode in the sense that rather than deploying as a demon state, demons said deploy it as a stateful set. Depending on the cluster size, you choose for every thousand nodes, I deploy 10 instances of metric beat or open telemetry collector, whatever. And you basically shard the work across all of them. So if you do a simple hash mod, you can use the stateful set size and then say that, okay, this is the number. And if I am that instance ID, I basically monitor that part or I drop it and then someone else will end up monitoring that. So this is how it would look rather than being at the node. You sit at the cluster level monitor only the parts that you are responsible for and then send the data out once you have harvested it. So what were the advantages of this approach? Tremendous cost savings. So when we did this, we saved 90% of capacity as compared to running it as a demon state specifically for metrics and the amount of pressure that we put on the API server was substantially lower because rather than 3000 nodes doing that now it's say 10 or 20 or 30 depending on how big the cluster is. And given that we are running a finite number of parts, we can choose to run bigger agents say 16 cores and 32 gigs or depending on the cluster do a different t-shirt sizing. But this meant that we could now process much bigger Prometheus endpoints without crashing. So this was one of the bigger advantage that we saw, but somehow this is still not enough. We are never satisfied. So one of the problems that was there is that when you end up doing a rollout or a new version upgrade of the agent, what's going to happen is that when one part goes down, all the endpoints that were being monitored by that part suddenly black out. And if someone has said that I need an alert on an absent query, if the metric is absent for say one minute or two minutes, if it took two minutes for the part to come up, then it's going to fire. So this was problematic. And even though we were limiting the amount of API server watches that were there, it's still a lot of redundant API server watches. And sometimes what happens is that if a given cluster has several hundreds of thousands of parts that are there, just being able to do this, even with multiple workers on the control loop would still take several minutes for the entire thing to happen, which definitely means that someone is going to get alerted at the time of a rollout. And naive scheduling hash mod is still hash mod. It's not rocket science. So what we ended up doing is we thought, okay, maybe we should just decouple the discovery process from the agent itself so that you have a centralized control loop that can do the discovery and let some workers know that this is what needs to be handled by you. So that's exactly what we did in the sense that we took the discovery process, put it in a separate control loop. And we also started adding intelligence on top of it, where we also started looking at other parameters that are there, like CPU, memory. If a certain metric data agent is taking too much CPU resources, then maybe take a few endpoints from there and then put it into someone else so that the distribution is more complex than just hash mod. And we also made the configuration generation a lot more pluggable in the sense that you have a language which you're using to define your scrape logic or the scrape rules. So now you can use that, parse it, and then generate a configuration for maybe metric beat or file beat or even open telemetry collector. So when open telemetry came about, it was very fascinating to us in the sense that similar to Kubernetes where you say that I have a pod. A pod is a pod regardless of if it's running on a VM or it's running on public cloud, private cloud, it's an API. Now in the observability space, open telemetry brought the same notion in where you have an API, you have an SDK, you also have a collector that you can use to do arbitrary transformations and then write to any backend, a backend that's either owned by eBay or vendor or an open source technology. This was a very powerful concept and at the time when the metric SDKs were becoming stable, we began to reevaluate to see that, hey, maybe we should get on to this. We were already in the process of adopting tracing, moving metrics and eventually logs made natural sense in that we would be in one family altogether. So that being said, so we have put in a lot of effort to say that, okay, we are optimizing on cost, we are solving problems that make it easy for switching into any agent. It should be very easy, right? But the journey wasn't as simple as that in the sense that now all features that were available on metric wheat, we needed some suitable alternative on open telemetry with the assumption that a certain feature was being used by at least one developer inside the company. Are there any showstoppers? We identified one big gap in the sense that open telemetry collector does not allow us to do dynamic configuration reloading. This was one of the best features that we really liked about elastic beats that wasn't there. Scrape parity, does a scrape from metric beat look exactly the same as it would look on open telemetry collector or for that matter a scrape that's being done by Prometheus would it look exactly the same on open telemetry collector? If it is not, is it a feature or is it by design or is it an actual bug? So these were some of the things that we needed to figure out. And the last one, open telemetry collector is a rapidly evolving community. There are so many releases that if we decide to move to a more recent version, how do we ensure that we are not regressing on any feature and causing issues at the time of rollout? So we came up with all the features that we needed and what it would take for us to move to open telemetry collector. Some of it was already available, like metadata. You can use the attribute processor or the resource attribute processor. Auto discover we had figured out in the sense that we came up with our own control loop that can handle it. Prometheus scraping, there is an alternative. There's an alternative. Kubernetes metadata enrichment, there's an alternative. And for everything else, we were like, okay, we can figure it out along the way in the sense that if it's a feature that the community would accept, we would file pull requests and get it into the community or we can manage it as a plugin internally. So to be able to address the problem of not being able to reload, at least what we have done at this point is introduce an internal receiver called the file reload receiver, which can take a partial pipeline definition to say that this is the receiver and these are the processors that are associated with it. It would plug in into a standard set of processors and an exporter. So what this would allow us is mimic the exact reloading feature that Beats has and bring the same capability into open telemetry collector. So what is essentially going to happen is that the receiver would watch all file changes that happen for configurations that are being added and removed and it would stop those partial pipelines or start them. If there's a change, it would update it so on and so forth. This file reload receiver, we have been running it inside of eBay for a while now and I think we are at a point where we are comfortable to start working with the community to see if that is something that can be accepted into the contrib repository. So at a high level, this is how it looks like. So if a person using the Beats hints-based auto discover language defines that I need to be able to scrape 5001 stats and there's the namespace, the module, the scrape frequency and scrape time out, the left side is what the metric beat would generate and the right side is what we generate so that hotel can understand. So as you can see, all the mappings that we needed to make sure that the configuration is compatible across the agents we are doing on the control loop. And we ran into a bunch of issues, especially with scrape parity where label sanitization, Prometheus was doing in one way, open telemetry collector was doing it in another way. And understanding colon as a first character of a metric name, hotel was behaving a different way than Prometheus. So these are things that we have worked with the community and via file PRs they have been accepted by the community either behind a feature flag or as a straight out bug fix. And at least now we are in a space where we can say that, okay, the scrapes more or less match and we can confidently roll out open telemetry into our community's clusters. We also had to put in a lot of effort to run sort of pre-checks in the sense that we spin up metric beat, we spin up open telemetry collector, we point them to the same endpoint, collect their outputs, compare the left side and the right side, make sure that things are good and if they are good then we move on to actually rolling out into the cluster. If we identify an issue, we triage, fix and then move forward. We kept iterating and we would be surprised how many times we had to roll back because as I had said like 1.25 million Prometheus endpoints there are going to be several quote unquote handwriting that people had in terms of how they did instrumentation. Some of it good, some of it not so good but still we had to make sure that we do not break any metric scraping when we are moving from one agent to another. So right now I think this happened just last week. We are done migrating out of metric beat into open telemetry collector. We even bridged a gap of exemplars not being scraped by the open telemetry collector. One of our engineers actually filed a pull request that got merged a couple of weeks back. Tracing, we are using open telemetry collector and the open telemetry SDK right from day one and file beat also for logs. We are actively figuring out how we can bridge the gaps and sometime next year we'll be fully on open telemetry collector for logs as well. We are still hardening by coming up with the regression frameworks that we can run periodically to make sure that we can upgrade with ease. That is something that we are still actively working on and over the coming days, weeks, months we will continue to work with the open telemetry collector community to make sure that issues that we are seeing we are able to contribute fixes and we are also able to work with the community to make sure that we are enhancing the capabilities that the collector offers. This is our team. Payment Peter, they are the pillars of the agent group, did most of the migration for metrics. Aishwarya, she works on tracing, she helped with the exemplar feature and then Edward, he's helping with the logs migration. If you like what you heard scan the QR code, we definitely would like to work with you on open telemetry together.