 Hi folks, thanks everyone for coming today. My name is Gustavo Pantusa, and I would like to share with you how you leverage open telemetry collectors and protocol to create a observability solution that is able to ingest 6.5 terabytes of the telemetry data per day. Well, I kind of, I can't believe they let me get away with this huge presentation title, so thanks KubeCon for that. I'm the tech lead manager for the observability team inside Vtex. I also have previously worked for GoDaddy and also for global.com, it's sort of Netflix from South America. Those are my handles for GitHub and Twitter. So if you would like to reach out, I also write about computer science topics on my personal blog in Brazilian Portuguese so we can strengthen our local community of computer science. Well, let's jump into the agenda for this presentation. And first I would like to share with you what kind of system are we observing, right? So what is Vtex about? So inside this context, where is the problem, which problem are we trying to solve? And then before jumping into the details of the architecture and solution, just briefly state what is the solution in the outcomes we reached with this solution. Then we jump into the architectural part where I would like to share with you how we are implementing those things and also talking about resilience of the solution. So how we avoid outages and keep these things up and running. So what is the system are we observing? Vtex is an enterprise digital commerce platform where customers can build, manage it and deliver their online stores. It's a multi-tenant platform, so we have a single platform where the brains come and deliver their stores using this shared platform. We have today 3,200 online stores delivered over 38 countries, alongside 38 countries. Also we are up to 1.3,000 employees over 18 locations around the globe. So where is the problem in this recalling? We are talking about a multi-tenant e-commerce platform. So we have only a single observability used to have only a single observability vendor system for logging and so on. And all of the applications alongside all models inside this e-commerce platform, they use it to send code telemetry data directly to this single place vendor and observability system. Well, within this context of having this centralized place, we have too many implementations, libraries for communicating with this vendor, with this solution. And none of them has common things like fields and also resilience controls that would be nice across all of the teams, not by different teams. Also the data governance, whether we businesses, person, also engineers get together in the same place and was able to reach some sort of telemetry data that sometimes we would like to have a more granular governance. Also when it comes to KPIs, let's say for business metrics, teams use it to do queries and analytical stuff on top of a logging solution. So it was not the best place for it. So we would like to split those things and understand better and give better solutions for analytical stuff. Also not a single vendor is able to get the best solution for each of the telemetry data. So for us as well, it was something like we have logs, metrics and traces, and there is no vendor in our opinion, which is the best one for each one of them, for all of them together. Also the inefficience, ingestion control, whether single proxy, just sending data to this observability system elsewhere, and no control such as sampling, for example. Also the libraries, they use it to be reading over HTTP 1.1 with no encryption on it. Also data was going through this pipeline as raw strings, unstructured logs, also some sort of metrics and trying to do traces on top of a logging systems. And also no common fields spreads alongside the teams and engineering teams. So all of those problems together, if I could state in a tweet or something like that what kind of problem are we trying to solve? It is how to evolve to a long-term observability solution without vendor locking while improving the efficiency of our observability stack. So this is the statement, it states the problem. And on top of, based on this problem, what is the solution? I would like to start drafting this solution by showing a diagram, which is kind of compile everything in a single image, and afterwards we are going to jump into the details. So for this first image, in the rows you can see the telemetry signals, and as columns you can see from the left to the right, the application source code where the telemetry starts. And then they all goes to open telemetry collectors. We are going to talk more about those deployments of open telemetry collectors because they are different for each telemetry signal. And then we have the data syncs. So in this architecture, we are now using, as data sync for logs, we are using AWS-managed open search. For metrics, we are using the AMP, AWS-managed primitives. And for traces, we are using honeycomb. And in all of those three, we can do visualization of the data, but also we try as much to centralize things that are common and shared among all teams, or there are, let's say, common dashboards, for example, into Grafana. Also alerting. All right. So what are the premises in this solution? Basically open telemetry protocol on every possible layer. So from the origin in the libraries, we are going to talk about it, but as you can see in the previous image, we have diagnostics library, which is a library we built internally on the company, which goes as open telemetry protocol since the beginning of the request data flow and so on. So open telemetry protocol on every possible layer. Also trying to create common library interfaces where teams can have the same sort of interface while implementing and while sending telemetry data over the pipeline. Open telemetry collectors is one ideal part for getting into the ingestion point for every telemetry signal. Even logs. We are going to talk about it, but everything goes through the open telemetry collectors so we can then export them to the proper data sync. And for each telemetry signal, as I stated before, we believe that not a single vendor is able to handle in the best possible way all of the telemetry signal, so we decided to go with different vendors for each telemetry signal. Also for increasing our ability to keep things up and running, we created a chartered architecture for this solution. I will jump into this as well further. All right. What were the outcomes after deploying this solution? On the slice of our cloud investments, we were able to reduce 41% of our observability costs. Also delivering a long-term solution whether developers at the end of the day doesn't have to change or do deployments or migration if we change vendors in the data syncs. Because we have a library in place, we can simply say, okay, you are sending telemetry data over open telemetry protocol and if we change the back end, we change something in the observability team. But we don't require engineers to migrate anything from the source origin. So it allows us to evolve and innovate fast. Also today, and moving further, this solution is able to handle and are ingesting today 6.5 terabytes per day of telemetry data. Well, let's jump into the architectural part of this solution. And then let's start by the library. We have designed an interface whether just, for example, like engineer can say, hey, logger.info.warn.debug or system metric, a common interface where we could implement in various languages. Those are the top three languages within the company. So we developed the same library. We have the diagnostics library for each one of those languages. And we were able to create common code. For example, resilience control, back off, retries, circuit breaks and so on on a single place, also asynchronous communication and enforce some of the things like open telemetry protocol and GRPC in place. Here is an example of the common interface just out of curiosity. So the libraries implement those interfaces and engineers as our customers for front observability can implement and send telemetry data using the same interface. Also we delivered a common fields, like all sorts of telemetry data. We have common fields that they can come built in with the library and sometimes for the developers to add themselves, add it by themselves, those data inside the telemetry payload. So it's nice when it comes to the visualization part because they can correlate things, base it on those common fields and things that are sensible to the businesses part. Well let's jump a little bit on the deployment for the collectors, the middle part on the previous diagram architecture. So the first thing I would like to mention in this deployment are the open telemetry collectors. They are simply a deployment inside a Kubernetes where we run collectors inside containers and then they are able to handle the ingestion part of the telemetry data they are sending. Whether it is from logs, metrics or traces, this is pretty much the same kind of deployment for all of the collectors. The configuration part is the part that will be different for each one of the telemetry signals, pipelines. So the second part in this architecture, it's a very straightforward way of doing ingestion like ingress part or the networking part on this architecture. So it's simply using a load balancer issue, external DNS plug-in just to update how the library will be reaching the internal services inside Kubernetes and everything is managed by Argo CD, which is where we change configurations for issue, external DNS or even collectors. So everything we update, it goes on to a pipeline and then Argo CD is responsible to deploy and put those things in production to us. Now moving further, how we build our collectors? Because we use a custom collector, we write extensions for the open telemetry collectors, we use the open telemetry collector builder for it, so we build our own binary. So in our pipeline, what we have is we have a bunch of source code where we extend open telemetry collectors and then we have this pipeline whether we build the binary, then we create a container image and then we notify Argo CD for triggering the deployment and then Kubernetes is responsible to do the progressive rollout for us. So this is pretty much how we put our collectors binary in production. And here's an example of how we build open telemetry collector. If you are familiar with open telemetry collector builder, this is pretty much what you have inside the YAML file and those are the parts that are important to us. So in every one of those layers, those four layers, we have private code running as if we implemented receivers for specialized use cases internally, also exporters for internal systems or other vendors or other systems that we need to communicate with and send telemetry data over. Also the processors, which are a special part for us because it's a way we can manipulate data on the fly. So telemetry data goes over the collectors, we can manipulate this data and then take decisions where we can do something. So one example for it is sampling. I will speak more about it in the resilient part, resilience part on this presentation. So here's the telemetry data flow to try to simplify things. So the telemetry data flows from the diagnostics library and depending on the DNS, if the developer is saying, hey, logger.ahor, we know this is a log and then it goes over the proper URL that is able to reach the load balancer in front of this entire Kubernetes cluster, which is responsible for the logs part, the logs pipeline. And then it gets ingested, processed and then exported to open search. The same flow goes for metrics and traces. And when it comes to visualization, we have the role of Grafana. Grafana, as I mentioned, we use it to reusable dashboards and central dashboards and common dashboards for engineers. So they also, they are able to create their own dashboards, but we also have the ability to have common things and they, let's say, get things out of the box just because we have Grafana and can create templates and so on. They also can use in create visualization, specialized visualizations directly on open search, AMP and Honeycomb. So we have the governance for teams and team engineers all over, depending on SSO from the company. So we have groups and ability to create roles and permissions for teams and types of different users to have access to different types of data. Well just to reinforce the diagram after going through the architectural part, just to recall how we designed this architecture. Now with this solution today, we have four terabytes of logs going to open search. We have 150 million active time series going to AMP today. And here is important to mention that even business metrics, sometimes we are sending them there and we are able to have high, high cardinality on some of those metrics, which is something important, we know. And also when it comes to traces, we are importing today, ingesting today 2.15 billion individual events on Honeycomb. All right, let's jump and talk about resilience. Well, with this architecture, we have to make it work, make it stay up and running. So one of the premises we have for this solution is to fail locally and not globally. So and when I say it, imagine a scenario where I can lose a collector or a one of the data sinks and also some ingestion in the origin that goes wrong. So with that, what we mentioned, we're designing with this architecture for fail locally. So if sometimes I lose one collector for the logs part, okay, I can say in communication to the engineering teams, hey, we have an outage of observability, but this outage is only for logs. We still have metrics and traces. And in the previous scenario, we use it to have, when we have outages for observability, probably the teams have no observability at all. So it was a problem for engineers. So by breaking those things into different pipelines, and I will talk about charging as well, we can have a way for failing locally. And it's easier as well to communicate with engineers. Another thing for resilience is the pod outscaling, pretty much regular stuff for Kubernetes. But here I have to state that for different deployments, we have different configurations. So the log ingestion behavior is different from the traces ingestion behavior. The types of machines, the way we do outscaling, they are different depending on the telemetry data. So this was one of the important parts as well for resilience. As I mentioned, we have shards. Well, they are businesses driven decisions on how we shard, but I would like to share something like, imagine, as I mentioned, a multi-tenant platform for e-commerce. And then we have several internal systems, but also models and code. So some of them are core systems, and they should not fail or should not lose observability. And also there are other small APIs, backend systems that are just supportive for the main applications. So in this case, we can split them into shards, so then we can avoid losing telemetry or observability for the core systems, for example. So I not only have the three telemetry separations like logs, metrics, and traces, but we also have shards. So when it comes to failure, I can lose the shard one for the logs pipeline. So it's even more locally than globally when it comes to fail. So it helps as well showing the engineers, hey, we have an outage for the logs pipeline, but only for shard one. You probably might not be affected if you are not in this shard. So this is another way of failing locally and improving our observability. Well, another thing, it's one of our extensions. We allow teams to sample logs. So error logs doesn't get sampled. But if you're sending logger.info or success logs, we just sample. We have this configuration. We implemented this extension for collectors. We allow teams to individually say, hey, this is my index name. I allow you to sample me by this percentage. So in this example, we have a team that's saying, OK, sample me by 75%. But in another case, we have zero percentage. Imagine a scenario, for example, a PCI compliance team that is implementing a payment system that you have to store 100% of the logs. So within this scenario, you just skip sampling. But this is very nice for us as well, when it comes to sazonalities and the traffic grows up, we can increase our default sampling percentage. So observability stack doesn't grow linearly alongside with the platform itself. We are able to adjust those things on the fly. And the default percentage is up to the observability team. But the individual indexes and individual teams, they have the freedom to configure and have their own percentage for sampling. Another thing we do is the write ahead log. We implement a wall right before sending telemetry to the data sync. So for example, if I have one of my vendors systems, for example, OpenSearch that is suffering any sort of failure, we see that on the retry, we are receiving failures from the data sync. We just open a circuit break, and then you start writing to an S3 bucket, the entire telemetry data, and we have a Lambda function or a set of Lambda functions that backfuse this data into the data sync after it recovers from failure. And this is important to mention that to communicate with our engineers, there is an important thing which we are talking about. We are not losing data when it comes to this sort of failure. We are delaying data. It's different things that it's nice to us to have this kind of SLA with our engineers. So we are not losing our data. We are delaying it because we are suffering an outage. So this write ahead log help us with it. Also we have the alerting part where I mentioned it. We use Grafana and allow teams to themselves create their dashboards and their alerts that we have an internal structure for incident management. So they have their systems for incident management and they do the communication with every own call engineer. So the alert goes through these other systems and then reach their own call engineers. And the engineers themselves are able to create and state what are their alerts and what are the interested metrics for them. Well, some tips on the migration step because we didn't switch a key. And then we have one a centralized solution and we are now using open telemetry protocol and with this new entire pipeline and so on. One of the important things for us were RFC like documents. So we call them design docs. So we wrote design docs for this solution called the entire engineers. All of the teams that write the products itself, they jump in to say, hey, this works for me. This doesn't work for me. I have a problem. I need another protocol. I need a different style of ingestion. I'm not using precisely open telemetry. We need help for this and that. So it was really helpful having this sort of strategy and help it a lot throughout the migration step. Also buying from the sea levels of directors. Everyone in the company understood that, okay, this is a company shift. This is not something that observability team is trying to push on the engineering teams. So it helped it a lot throughout this migration. Also understanding our customers. At the end of the day, the customer for observability team in our company are the engineers. So we jump and sit together with them, discuss it and understood individual cases and common cases. So we could draft and design this solution. When it comes to resilience, it was really important to engage our vendors. So how does your data sync, your application fails? How can I prevent failures and so on? So we had to engage within those vendors in order to try to be more resilient and be up and running. Also find early adopters was helpful to have some teams that as any new technology, as any new innovation, there are some people, they say, hey, I would like to be part of it, let me try it. So engage those people, try to bring them close to the team and help you draft and check if the solution is going well and is stable enough for reaching other teams. Well, just a briefly recap, then we just saw understood about VTACs and the multi-tenant e-commerce platform. Then we jump into the problem we are trying to solve. And then we draft the solution plus the outcomes we reached with the solution. Then we jump into the architectural part, sharing everything, how we deploy the collectors, how we do our Kubernetes part, and so on. And spoke about resilience, how to keep this application and this new solution up and running. And finally, I would like to say thanks to the OpenTelemetry community. This ecosystem enabled VTACs to innovate fast and efficiently. So we are really thankful for that. Thank you, OpenTelemetry. And we reached the end of this presentation. And I don't want to stand between you and lunch, so thank you, folks. Thank you, Gustavo. That was an awesome talk. I know we are over time, but I want to give time for one or two questions. Does anyone have any quick questions? Yeah.