 Good evening, KubeCon. 6 p.m. on a Wednesday. If you're hearing my voice, I hope that you're here to learn about observing with rust, open telemetry, and tremor. I am Gary White Jr. I work for Wayfair Technology in the open source program office. You can see me here both in my emoji form and in my true form in the corner. I know that it's going to be hard to keep track of who's who. The resemblance is completely perfect. Some of my most notable contributions to Wayfair include my time working on tremor, my time building stuff for BuildKite for the company, including CI infrastructure and contributing to the BuildKite agent, which is on GitHub. And of course, making emojis. I have literally hundreds of emojis in the Wayfair Slack channel. That spink spinning monkey is one of them and I'm very proud of it. While we're talking about emojis, let's go through some of the best. I've curated my favorites for you, KubeCon. Only the best for you. I know that there's a lot of passionate emoji enthusiasts showing up to my rust talk. You can see a baby Yoda, a nonsensical upside-down alien cowboy, a cat that is projecting such panic energy that you can you can feel it and it feels like something that you would feel too. And the open source program office train where you can string them together and have the Ospo logo just running across the screen. That last emoji is the open telemetry logo, which is hotel for open telemetry. I guess we should talk about open telemetry and tremor. What an amazing segue. Thanks. I have a couple of goals for this presentation before we jump in that I want you to know. I want you to walk away with knowing what tremor is used for, knowing what open telemetry is used for, why those two things exist, how they came to be and how they can work together. So let's just get into it and start with tremor. Tremor is an early stage event processing system for unstructured data with rich support for structural pattern matching, filtering, transformation. Very astute, I know. The best part of being a presenter is reading directly from the slide. It's very engaging for your audience. I ripped that definition directly from, oh, it's that way, docs.tremor.rs. Definitions are fantastic to start with because they show you how many words don't make sense together. I hope that when we come back to it, it shows just kind of how much we can really understand the time that we have together. I think the best way to fit those ideas into people's heads is speaking in a more universal language and speaking in gifts and pictures. So let's start with pictures. We can receive all kinds of messages from our infrastructure. We might want to know how much business we're doing in our storefront. We may want to know what network traffic is coming in and out of Nginx if we're being overloaded or if we're having a spike. We would absolutely want to know if our database is overloaded or it needs some help or it just needs some clearing out, right? If it's stuck on a transaction, that's important. It would be fantastic if we could just magically translate all the ones and zeros into signals, but we would be able to make simple decisions on how to maintain our critical services. Unfortunately, data never comes in that clearly. Sorting through signals and unstructured data normally feels a little bit more like this. We drown in the sheer volume that we receive from our services. The data is unstructured so we make Kibana dashboards associating it. We do influx charts with lousy indexes for the data and we do a lot of work just to try to understand what our applications are saying. And on top of having a regular fire hose, we aren't able to design for system traffic during peak events like cyber five or times when our businesses might do the most for us. So the times when we are the most important to the business are the times that our observability may suffer the most and we kept seeing this happen at Wayfair. So we started developing tremor. What tremor does is, oh by the way before I start you'll note that this slide is pictures and gifts. Only the best for you, KubeCon, I'm going to be very indulgent here. To keep from drowning in our dashboards, we use tremor to stifle the flow of input. Tremor allows us to rate limit our services and applications and even rate limit specific types of requests that we might get from them. And we may not need to log, for example, every gateway timeout. If we know that there's a sufficient amount of them then we can cap it and say there's enough here that we definitely need to look into why we're seeing so many problems. The rate limiter in this situation has to be running pretty fast to be able to stifle any meaningful flow. And luckily tremor is really fast at processing signals. Fast enough to handle the Wayfair production load during peak events like cyber five. In one such cyber five, for anybody who's not an e-commerce, cyber five is what we refer to from Thanksgiving until the end of the year. Five really big days for us. We found that tremor on cyber five was processing about five gigabytes per second. And that may not feel like an overly impressive number because, you know, what's the meaning of it if you don't understand the infrastructure that's running it? I mean with enough machines any application should be able to handle that. So I'm going to use the approximations that I can share and the GCP price calculator to give a better indication of when we switched to tremor. How much cost savings we were able to see with the same load. So before we switched to tremor we ran log stash and log stash we had to keep about a hundred nodes at a time to keep up with the pace of our infrastructure and the logs coming in. And when you do the math of those nodes with the compute footprint on the slide into GCP's price calculator it comes out to forty thousand dollars a month. To understand and parse the signals that we needed from our applications and infrastructure at Wayfair, you know, we would fork over for it. We would spend what we needed to to make sure that we could see what was happening. As the tremor team came by and realized that they could process much of the same load for less the compute workload quickly moved over. Now we're processing similar if not literally the same load with an incredibly small footprint comparatively. When we plug our current footprint into the GCP price calculator we see less than a thousand dollars a month 40 times less significant change and it's done with a programmatic interface that we were able to build inside the company. If the cost savings aren't compelling and the speed isn't compelling enough let's talk about the plug and play on and off ramps that give you flexibility with tremor. So tremor allows us to interpret these raw signals that we get from the infrastructure. Decorate and transform them as needed and then ship them out to downstream consumers. The signals coming in from plenty of structures and forms are translated through tremor into discernible metrics and logs in a way that doesn't overload our ingesting systems meaning kibana, rafana, you know, Kafka, and elastic search. So this allows us to make meaningful decisions about the state of our infrastructure with observability without paying a ridiculous premium. We can do this primarily because of all of our ingestion and export engines so we can create on ramps and off ramps that are hybrid that translate from one to another and if you need to do this in your own company you can just start with a basic HTTP or Kafka endpoint structure the data how you want how it comes in and send it right out. It's also built to be an extendable infrastructure so tremor has the ability to add more on ramps and off ramps dynamically. Very powerful, very cool stuff that we like to publicize. So we have on ramps and off ramps, we have these cost savings. I want to go back to the original definition and summarize everything that we just talked about. Tremor is an early stage event processing system for unstructured data with rich support for structural pattern matching, filtering, and transformation. Tremor can take a lot of data without breaking a sweat and it can do this as a sidecar or as a dedicated application. We can use it to process signals before they go through other systems or we can process signals after they come out of other systems. We can structure the data, we can decorate it, we can shape it, we can do what we have to do to make it useful and we can send that data along to the downstream or weight limit it and keep it to ourselves so that we don't overwhelm our observability pipelines. Better way to say what we're doing with Tremor, right? I mean it's better than the definition for sure. Why am I talking about it? What does it have to do with open telemetry? Tremor and open telemetry were sandbox projects together for CNCF. Open telemetry or hotel has graduated to an incubating project but when your buddy's in the sandbox you stick together. So Tremor has built functionality specifically to work with open telemetry. Before we dig in, get it? Like sand? Digging? Or we dig? Anyway, we should, before we dig in we should walk through open telemetry the same way we did with Tremor so that we can cover things that will be relevant to understanding how the two work together as well as an introduction for folks that may not be super familiar with it. Can I get a what? What for open telemetry? I mean, come on. If you're at home, nobody's going to know. Give it a little what? What what? Throw it in the chat. I can't tell if you did because I'm on the recording but hopefully I'm right there with you. Anyway, according to Open Telemetry IO Open Telemetry is a collection of tools, APIs and SDKs. You can use it to instrument, generate, collect and export open telemetry data for analysis in order to understand your software's performance and behavior. Just like for Tremor, the definition is very helpful for laying the groundwork but practically what does this mean? What does it do? How does it work? What is open telemetry? This time I decided to skip the pictures. On this slide is a glossary of terms related to open telemetry which should give you all the information you need to know to understand how it fits with Tremor. I think this slide is roughly 1200 words and an average adult reading at 300 words per minute should be able to fully read and digest this slide in less than five minutes. So we'll sit here for about five minutes while everybody reads the slide and learns the important terms about hotel. I'm going to take myself off the side so that you can see yep just uh you know take a second I'll be here I'm I'm sorry sorry I thought I took my ringtone off hold on uh seems important whoa okay they don't like it well how I'll be okay so you have got it yep really okay okay no nobody thinks I'm on the phone right now I'm holding it weird okay well yeah let's do that okay love you too okay bye all right put that on mute for real uh and say that was me from the future uh he told me that this slide and people reading it was the worst part of the presentation people were walking out people were saying I hate this and made a really bad reputation for me which is what matters so he time traveled back and changed the slides just to make sure that this presentation went on real hero we won't get time travel now because this presentation is gonna be great now we have a good presentation which is almost as good let's get going hotel or open telemetry is a step towards standardizing the way that these signals pass from applications and infrastructure to metrics and login tools in an ideal world where we already had people using tell uh open telemetry everything would magically work hotel open telemetry hopes to encapsulate sufficiently common application and infrastructure signals in a way that can be standardized to old and new capability tools as we work towards this utopia where application services and framework support observable observability out of the box we still have to use client sdks libraries and the collector so let's go through the client libraries and sdks first by the time that i've recorded this uh hotel open telemetry has made its way into just about every popular language right now so many applications should be able to make use of these injections and frameworks and some of the most popular frameworks like java spring pythons flask and jango one can even use the built-ins supported for these frameworks hotel will provide insight into your application signals right out of the box we'll talk more about what those signals look like in just a second but just know that if you're a developer who likes to kick back and have the framework do the work that's awesome if you prefer to be more hands-on or you have needs that frameworks can't really provide for then you can use the client libraries and build yourself write code to your heart's content let's take a look at how you might put this in a java application that okay i guess this is the whole slide i thought there was going to be more to it but okay um with java it's a language java's a language i think most of us are at least familiar with how to run a program with some very simple changes to how you start your program hotel can uh inject bytecode at runtime to trace important metrics and operations about your application and report them to the open telemetry collector it's as simple as pointing to a jar it doesn't get much better than that but to integrate integrate code into the framework come on that i bet that's harder let's see okay there's not that much to it either um custom logic aside you just do a standard library import which specifically for frameworks like spring it automatically reports metrics and allows you as a developer to specify anything you need my viewpoint as a developer that has had to use hotel recently to learn for this presentation is that it was pretty simple process to dig into and work with having the option of a default configuration being able to help later if i need more stuff is exactly what an observability platform should be doing but i digress i we're all understand the client library angle at this point and we can start looking at how these metrics are collected and sent to a back end this is the juicy bit getting into the collector the agent for hotel the collector itself is not a back end it's a translator where you send metrics into open telemetry and it will send those signals in a way that prometheus yeager fluent bit whatever else you might be using to trace your metrics can understand and this might seem a little straightforward or facetious at first i mean couldn't we just process directly into yeager or prometheus the benefit of going to hotel first comes out a little more clearly as we consider it as a side car as we consider it as something that can be distributed which you can't really do with prometheus or yeager or any observability when you place agents around the infrastructure with the standard library that allows the same signals to come from many applications it's relatively easy to collect them and build a bigger picture instead of bespoke application code built into your back ends that knows how to interact with yeager prometheus these agents are more effective at gathering and forwarding these metrics as a result exporting into the collector means that you can build multiple back ends or change back ends on the fly as you need to in the collector installation which allows you to focus on discerning what's important and less about how to configure back ends so we've talked a lot about what we will be sending but we haven't actually talked about how you send it what the signals look like so if we're going to know what the standardization is we should probably know looks like there's four main types of signals available in the hotel spec so we'll go through them one at a time starting with traces then metrics then logs and we'll finish with baggage traces are composite data structures in open telemetry one trace contains a series of spans connected together the root span is usually a client application request or something and the would track the full transaction lumping in the timing of operations from a main span so this allows us to build a picture of services that we interact with when you're performing a single operation if you make a request to a back end you can trace it along the way using zipkin prometheus yeager this stuff is built in there moving on to metrics i think that metrics are probably the most intuitive signal they're split into a couple of categories where you can count things with a counter it only ever goes up it's like how many requests you receive in an application or how many total bytes are computed how many requests go to different versions of an app you can measure things like how a particular service is processing or how long it takes to send responses things like that and you can observe things like how much cpu or ram you might be using on a machine at a given time these operations together make up much of what i think of as traditional observability they exist in tools that already build into operating systems and software especially in ias providers things that help you maintain your infrastructure these definitions are more formalized but they existed before open telemetry and it's just been converted into a way that we can continue to standardize on logs everybody loves logs they allow us to print structured or unstructured output from our application this is an important topic given that they are the most human readable and least machine readable part of any observability pipeline we'll talk about this more when we think about how tremor and otel fit together but baggage is a relatively new idea to me i think it's a newer idea in the observability ecosystem given that it's only recently been put into otel baggage allows us to associate metadata like api tokens for a request it's helpful for us to index structured data and detect relationships between problems and baggage for the data to identify where those problems came from so that's all our signals let's go back to the definition open telemetry is a collection of tools apis and sdks we can use it to instrument generate and export a telemetry data signals or analysis in order to understand software performance and behavior we use otel to see what's happening in our software and make decisions on that data again kind of hit it out of the park much easier to understand than the definition from the site but now that we've defined these things pretty well how do they fit together to answer that i've stolen borrowed this image from open telemetry.io this is the link on the slide to see where it was originally we can see the collector on the left oh in the way aren't i put myself out of the way there i go uh and you can see that the data on the left is being put into a receiver then a processing pipeline and being spit out the other end to an exporter right in the middle is where we spent most of the first part of the talk talking about tremor let's look at some practical examples of how we might do that one practical case for tremor at wayfarer is rate limiting uh we know that if a given stream is producing any signal then it might be overwhelming given the size of it or the state of our infrastructure and since tremor can handle that extreme throughput and then rate limit the application we can just put an on ramp an off ramp for hotel and then let open telemetry agents figure out how to translate into jager prometheus so we don't have to do extra work we just have to have an intermediary there you could also uh treat it not only as the intermediary but something that helps you translate signals from hotel into other frameworks now this isn't the main use case for tremor but there's an argument to be made if if you're working with something experimental much easier to integrate with existing infrastructure like tremor where you can then use hotel as an on ramp on your existing installation and verify that it brings a lot of value until you get the adoption that you need to bring open telemetry to a scale that is at an enterprise company both of those examples were tremor being downstream from an hotel agent but i want to be clear that you can use tremor as an upstream piece of an open telemetry agent as well if you wanted to uh keep it keep one thing that open telemetry states in their documentation is that it works best with structured data it doesn't work great with unstructured data so when you have logs that are unstructured that you expect it to struct to do and work with it's going to be cumbersome but if you use tremor to shape that data and get what you need out of it and then make an hotel compliant signal that's counters or even a log that's just easier for it to manage then you are able to save compute from the OLTP from the OTLP open telemetry protocol agent rather than having the agent have to do rate limiting or deal with the scale of those logs basically it's the same idea is it's just that you don't have to wait until after open telemetry has the data you can do this from the start as well that's it those the that's how those things fit together that's how tremor and open open telemetry works it's all in the presentation and I really appreciate you taking the time to listen I appreciate you taking the time with me here late on a Wednesday at kubecon I'm just going to go back and say as a reminder I'm Gary White I work for Wayfair with the open source program office you can learn more about us and our work at Wayfair.github.io if you're interested in tremor and would like to get involved you can also check out tremor.rs we'll be happy to have you I sincerely hope that you've enjoyed this presentation you have any questions don't hesitate to make use of the Q&A or you can reach out to me after the fact gwhite at Wayfair.com until then enjoy the rest of kubecon and have a great evening