 Hello everyone. I'm Shailja Agarwal and with me is Mahesh Pai. We are staff software engineers at Intuit and today we are going to talk about the approach that we took to minimize cost for distributed tracing storage architecture. Before we begin we are from Intuit and we believe in open source and open collaboration. We are active consumers and we have also contributed to multiple open source projects in the CNCF space. This is the scale that we deal with at Intuit. We have more than 2,000 services hosted across more than 300 Kubernetes cluster and 16,000 namespaces. The agenda of today's talk is we will start with a brief overview on what is distributed tracing and its benefits followed by the user journey with distributed tracing at Intuit. We will also touch upon the initial architecture and its challenges and we will move on to the new cost optimized architecture. We will also deep dive into the trace data ingestion and query details and we will be wrapping up the talk with benefits and next steps. So let's begin. What is distributed tracing? So as many of us here might already know distributed tracing is a powerful technique that helps us to track the flow and timing of request as it moves through multiple systems especially in a microservice architecture. So it provides us with the performance insights and helps us to identify any bottlenecks. It will also help us to pinpoint the failure service in a typical workflow scenario. So these are some of the terminologies of distributed tracing starting from SPAN. SPAN represents the most fundamental unit of a trace. It contains information about the operation, the start time, end time and the context that needs to be propagated across the system. Events are the structured log messages on a SPAN. Attributes are the key value pair that contain metadata about the operation that we are tracking and trace is a collection of SPANs in a certain hierarchy as the request. Let's look into some of the benefits of distributed tracing. So Intuit is a microservice company and each request flows through multiple services. Each of this service has its own log endpoint and although logs help us identify what the error is, during an incident triaging it becomes difficult to identify which service to look into. That's where distributed tracing helps. It helps to isolate issues. It also provides us with the top level view of the end to end transactional flow and it helps us integrate logs with metrics. So overall it not only reduces the meantime to isolate, it further accelerates the meantime to repair. Now let us look into a user journey powered by distributed tracing at Intuit. Let us assume I am an on-call engineer and I get an alert. This is one such sample alert from a front-end application. It has details about the plugin name, the web app name, the percentage of user impacted and the failed interactions. So if we follow this alert, it takes us to the exemplar traces that are currently degraded or failing during that particular alert window. If we double-click on any of these traces, it takes us to the full trace view which shows the collection of all the spans and also highlights the error span that is currently failing for that particular time window. So basically it helps us to identify the service that is failing. We have also integrated this with logs. So if we click on the logs option, it takes us to the log endpoint which shows what the error message is for that particular incident. We have also used traces to build the service dependency graph. So as you can see during the alert time period, it clearly shows what services are failing. So in this particular image, we can imagine that there would have been multiple alerts across multiple services, but service dependency graph helps us to pinpoint what exactly is the root cause of failure. Along with this user journey, we have also allowed users to be able to filter their exemplar traces with service attributes. So this gives them the flexibility of identifying different anomalous patterns in their service based on their service attributes and other application metrics. Now let's move on to some challenges into it for distributed tracing. So we have already mentioned that we have more than 2,000 services, and each of these services can have different attributes. So overall, we have more than 1,000 distinct attributes. The traffic can also peak up to 8 million spans per second. So indexing these attributes can have huge implications on cost. Also, due to the statefulness of data, the backend storage has to be pre-scaled for maximum throughput at all times. Also, the cost attribution predictability becomes a little difficult. So we are from platform team and we have to attribute the cost back to service owners. So increase in decrease in traffic on any one service can have huge implications on cost across all these services. Mahesh will be talking about the architecture details now. Thank you. Thank you, Shalja. So what I'll do is I'll first go over the current initial architecture that we had and talk about the problem that we faced, and then we'll go on to how we went about fixing the problem. So this is a typical architecture that you will generally see in any of your tracing backend. So you get traces from different sources. If you look at the left side of the architecture diagram, you'll see the sources. There are traces coming from VM-based systems, there are traces coming from Kubernetes systems, and there are also traces coming from your front-end as well. So all these spans and traces go into a common backend where the traces get stitched together. We deploy a set of hotel agents or hotel collectors, which are used to collate all this information, and then are streamed to a backend pipeline using S3 and Kafka. Most of our services are primarily deployed on AWS infrastructure, so that's why we're using S3 as a storage mechanism. And we use S3 plus Kafka so that we can replay the traces back in case we run into issues on the backend pipeline. So once the data is pushed to S3, we deploy a component called a trace ingester. So trace ingesters goes through all the spans and traces that are there in the S3, perform some massaging, and then stores them in two different stores. So if you look at that, there are two stores. One is a trace store, the other is an indexing store. So indexing store is required because, like Shailaja mentioned, we should be able to search traces based on different attributes. Our user journey is not primarily based on only on trace ID, where you get an exemplar trace ID and then try to look at that trace to see what has gone wrong. We also provide the capability of users to be able to search their traces based on different attributes. For example, a transaction ID that has failed, or it could be that give me a trace which has high latency and so on. So for that feature to be available, we need some kind of indexing engine. So that's why we started storing data in an indexing engine. And then you have the final trace store which is used to pull up the completely stitched trace. So when you have, when you've identified a trace ID, you give the trace ID and then pull the entire set of spans associated with the trace ID using this trace store. Then we have a layer of trace search API, which is nothing but a query layer, which abstracts it from the user. With this architecture, when we started looking at the cost aspect of it, we quickly realized that majority of our cost was coming from the two components that I talked about, which is the trace store and the indexing store. This was contributing almost 70% or 75% of our overall back end cost. So we wanted to tackle this and figure out the way to how to optimize this or probably even replace it or whatever. So given the fact that our load volume is pretty high, I mean we get a peak volume of almost eight million spans per second. And that is in spite of doing head sampling. And given the fact that it's going to grow year on year, we wanted to do this at the earliest possible time. So we started looking at how do we go about attacking this problem. And one of the main reason why the cost is high is because the back end store is stateful in nature. And when they are stateful in nature, it becomes difficult to manage the scaling aspect of it. The bottom, the bottom chart that you see is a typical trend of the tracing traffic throughout the day. So it's always either increasing or always decreasing, which means your back end has to be continuously scaling out or scaling in, which makes it very difficult to maintain. So what eventually happens is we end up scaling your back end to the max capacity that we are expecting in a day, which is around 3 million TPS per second, which means during the low peak time, you're wasting resource, you're wasting, which means you're incurring higher cost. So that was one of the primary reason why we're incurring higher cost. So when we started looking at the new architecture, we decided on what do we want out of the architecture. First thing is, we are a system which ingests high volume. We ingest, like I said, ingestion can go up to even 8 million spans per second. But the actual query load would be very small. For example, even if, let's say, there are 1,000 to 2,000 developers logged into your dev portal or your UI, the actual query load that it translates into would be very minimal because these queries are triggered manually, which means your query load would be 10 to 20 QPS at max is what we have seen, even when there are thousands of users using it simultaneously. So we were okay with having a system which could cater to, let's say, 50 QPS or 100 QPS system. So that was the first criteria that we looked at. The other one was the number of attributes. Shailaja mentioned we have 2,000 services, which means each service can send different set of attributes, which means there are more than 1,000, 2,500 different attributes that are part of the system. But when a user is searching for data, he or she would look at only a couple of attributes, like the example I mentioned. You could be looking for a transaction ID, you could be looking out for, let's say, latency or an operation name and so on. So the amount of data is high, but the actual data that you're querying is very minimal. So that's the other criteria that we were looking at or characteristics of the data that we were looking at. The next thing was we were okay to compromise on the response time. The current system gives us subsequent query latency, but we were okay even if we were able to build a system with the response time going up to, let's say, a second or two, if we can bring down the cost drastically. Because from a user standpoint, it doesn't make too much of a difference. I mean, if it's a second or a second or two, we're not, this is not a system where you're running automated queries and so on. And then the other thing that we also wanted to build was an analytical capability on the data that we were ingesting. So having looked at all these characteristics, what made, what was clear to us was that probably we should look at something called as a columnar data storage for our data wherein the data are stored in a columnar format. I'll come to what is the columnar format here. This is, this slide explains briefly what's the difference between a row-based storage versus a column-based storage. So if you look at the example, there's a table with three columns. In a typical storage-based layout, the rows are stored in a contiguous fashion, which means if you're looking for a subset of data, let's say in this case you're looking, trying to search a data within column one, you would have, you would end up scanning the entire data within a column. So this is how a row-based system works. But on the flip side, if you look at a column-based system, what generally happens is all the data within a column are grouped together and stored contiguously. So when you're searching for, let's say column A in this case, or the first column in this case, with a minimal data scan you're able to retrieve the data which is A, B and C. And that is what typically happens in a distributed system, which means there are thousands of fields, but you're only looking at one or two of those attributes, which means you're looking at only one or two columns from your entire row. So that's why we went ahead with using column or database, column or storage as a format for our data. And the format that we chose was Apache Parquet. It's an open-source storage format and you write data in a Parquet format and store it in a file. So when you look at a Parquet file, it has a self-describing schema. So when you open up a Parquet file, it has metadata which explains to you what is the schema that is stored within the file, what are the different columns, what are the different types of the column and so on. So it clearly tells you what is the data that is residing within the file. And given that this is a columnar store, it is faster when you are querying data on a subset of your attributes. So coming to the Parquet ingestion, so what happens is typically in case of disability tracing, each span would translate into a record which means we have to define a schema for a span. And span generally has some mandatory fields like the trace ID, the span ID, the start time, end time and so on. So they become the mandatory fields within your Parquet schema and then you have optional attributes which are nothing but your span level attributes or your resource level attributes. So all these different fields are described as fields within your Parquet schema. So how do we define a schema? So we started looking at defining a global schema first. What that means is there's one schema that is used across the pipeline. So there are 1,000 or 2,000 services which emit in totality more than 1,500 attributes which means you define each of those attributes as a field in your schema. That's the simplest way to go about it. So the schema, there's only one schema that you have to work with. Everything gets ingested using that one single schema. But we quickly realize that this is not going to scale or this is not going to work. The reason being when you look at a span, a span will typically have probably 5 to 10 attributes but in your schema you have defined 1,000 attributes which means the rest of the attributes have to be marked as null. And we realize that almost 42 percentage of the CPU cycles were being spent in writing these null values. So it was very clear to us this is not going to scale and this is not going to work and we moved to a different evaluation. So then we started looking at another possibility of defining schema wherein we group different services within a BU, figure out what are the attributes being emitted by those services and define the schema using those attributes. This was much more efficient. The number of null columns were much more smaller and you could partition the data in addition to time you could partition the data using BUs as well. But we wanted to go a step further and we tried using schema per service which means you have one schema per service. So there are 1,000 or 2,000 services but this is very efficient because you are defining the explicit set of attributes that each service is going to generate and so the number of null fields or null columns are going to be very, very minimal and you could partition the data based on time, BU and service as well which means when you are querying for the data, the amount of data scan will be very, very minimal because you can query the data that you are interested in and you don't have to worry about any of the other data. For example, if you are looking for a trace of service A, you will be specifically looking at that data alone and none of the other data. But the flip side of this is managing multiple schemas. Like I said, there are 2,000 services which means there are 2,000 different schemas that you have to manage and this cannot be managed manually. So we came up with an automation system to manage that but eventually we went ahead with this schema and this is just an example of what the schema would look like. The schema would have some required fields like trace ID, asset ID, asset alias and so on and then the optional fields which are nothing but the span attributes. So every span that is emitted from the system, we create a record in this particular schema. So this is the final architecture that we ended up with. So if you have noticed the changes only in the trace ingester, the trace ingester writes data into parquet files and there are 3 different buckets of data that we write. One is what I currently explain where each span is written as a record which can be used to run our power query. So any exemplar query that we execute would be against this data. Then there is one set of data which is the traces which is the replacement for the trace store. So here what you do is you collate all the spans for a given trace ID and store it as a parquet data. The third one is the metadata and what is used to derive the schema. So metadata will contain the entire list of services that we are ingesting for a given asset, what are the different attributes for a given attribute, what are the different values the cardinality and so on. So it is a very rich metadata that we generate and this metadata is collated by metadata manager and then used by the trace ingester to derive the parquet schema dynamically. So as and when the data is being ingested into the system, the trace ingester can figure out what is the schema to be used for that particular span and use that. So and this becomes very very efficient when it comes to ingestion as well as when it comes to query. And so the only chain that was required was on the ingester and the query engine for us to replace this and so there was no impact to the end user as such. So the end user did not see any impact to their operation. We were able to replace this without any downtime. So let's quickly look at how data ingestion is done. So this is how it is done. For example, if you look at trace ingester, trace ingester is a set of stateless spots. So they scale in and scale out based on traffic that is being received and the trace ingester would group the data based on the service name and then write one file into the S3. So for each service, a file would be generated by each of the trace ingester files. Then we deploy a set of compactors. A compactor manager would go through this set of files, create a compaction record, basically what it will do is give me all the files that was created for service A and create one file. So it compacts thousands of files into let's say a handful of 10 or 15 files. So what, eventually for a given five-minute window, we have one file per service. So when you are searching for let's say a 60-minute window, the query will have to only look up probably 12 files to retrieve data and which is very, very efficient. So this is the exemplar data and the data is stored in the partition format that is listed down here. So we partition the data by year, month, day, hour, minute and finally the service. So when you are searching for data, for searching for an exemplar, you know the exact path to which you need to go into and search for the data. So this is the exemplar data ingestion. Similar method is followed for the trace by ID, trace by ID data as well. So wherein we group, we run a hash on the trace ID, based on the trace ID hash we group the data and write into a file and the compactor would do its job of compacting multiple files into the same file. And when the query is being triggered, we do the same thing. On the trace ID, we run a hash. You know the hash code associated with that. So you know which path to retrieve your data. So if you look at the partition format, more or less the same, the only change is the partition ID or the trace ID hash becomes one of your partition key. So with this approach, the amount of data scan that is required when we run exemplar query or the trace by ID query is very, very minimal. And coming to the query part of it, what we have done is we have used a couple of stateless query engines, SQL query engines. We played with AWS, Athena, we played with Clickhouse, we played with Trino and so on. And we deploy a couple of these query engines to power our queries. So trace search would run some SQL queries and they would internally go fetch the data from Parque. Since like I said, they are column based the amount of data scan is very, very minimal and given the fact that the data scan is minimal, the response time is also comparatively manageable. So we see a response time of around one to two seconds for every search that we do and which is manageable from a user standpoint. So finally coming to the summary, key take away and benefits, like I said managing stateful sets is very, very difficult given the volume that we have and given the fact that there is continuous flux in your traffic which makes the operational aspect of stateful service is very, very difficult. Column storage makes the data scan minimal and with this, we have been able to reduce the cost by 69%. We have deployed this in our production and it's been running for the last couple of quarters and we do actually see a reduction in 69%. This is not a theoretical number that we are quoting here, it's actual savings that we have done in the last quarter and we have been able to successfully scale from 500k TPS to 8 million TPS without any manual intervention, without any pre-scaling of any of our systems. So the system is able to automatically scale out and scale in when the traffic grows and traffic comes down. And introduction of new services straightforward, service owners have better control over their cost and so on. And the next steps, we are looking at Apache Iceberg. Apache Iceberg can be used to manage a metadata, currently we are doing it manually or in an automated version using the metadata manager. But we want to explore Apache Iceberg as well to see how it can help us in our work. And we also want to build a query layer which can probably replace the stateless query engines that we are using. We are looking at DucDB, DataFusion and other SDKs to achieve this. And the last and the most important thing that we are looking at is tail sampling. We are looking at a very efficient tail sampling approach which can hopefully help us reduce the overall volume. Currently, even though we have head sampling at 35%, we end up ingesting almost 8 million spans per second. And typically most of the traces are actually successful traces and you may not be interested in it. So we are trying to see how we can efficiently deploy tail sampling so that only the interesting or the erred or what we want is stored in the back end which would bring down your overall volume drastically. I think that's all for our talk. Any questions?