 Hey, welcome to my talk of how this ability drives observability. I'm Mateus Aguilera, I'm a software engineer at CubeShop and a working trace test at OO. I'm passionate about testing observability and the vault and here are my links for LinkedIn and GitHub. If you want to reach out to us questions, just I'll be there and we talk. So let's start with what's observability? Observability is the ability to measure the internal states of a system by examining its output. Assisting is considered observable if the current state can be estimated by only using information from outputs, namely, sensor data. That's the definition from Splank. So what does that mean? Well, we have sensor data and the main pillars of observability are metrics, traces and logs. So in this talk, we're focusing on traces. But what's trace? When we talk about distributed tracing, it refers to methods of observing requests as they propagate through distributed systems. So it means that when you send a request from a system that has tracing enabled, it will be able to give you a map of everything that was done within that request. So if it had to reach for a database, that will show up in the trace. So what's a trace? A trace is a group of spans with parent-child relationship. But what's a span? A span is a data structure where it contains a unique identifier and you can attach data to it as a key value structure. The nice thing about spans is that you attach information that is related to the operation that's being executed. So if you are executing a HCP request, you would have HTP information there, like what was the status code of the response? What was the URL that we're trying to reach? What was the methods? What was the headers and all those kind of things? Another thing is that our span has two timestamps, one for its start and one for the end. So for each span, we know the duration of it. The nice thing is that because the span is related to our operation, the time of the span is basically the time of the operation. So if you know the duration of the span, you know how long it took for that operation to run. So this is an example of a trace. So what does a trace look like? I got this image from the internet. And the nice thing about traces is that even though I'm not sure about which system is that, I can't know what's happening based on the trace. So I know that someone is trying to purchase something using the order of service. In order to do that, the order of service needs to verify the user and for that, it uses the user's service. After verifying the user, it updates the stock of the product, saying which cart it's going to be used. And for that, it uses the stock service. After that, it tries to find an order inside Mongo. And that trace is incomplete, but you can understand what it's happening. So that's great, a great way of understanding systems. So let's talk about how we usually instrument our code. So first of all, we start by developing the product. So we are adding features, we are logging, user registration, and all those things. But then we deploy the application and we figure out that we don't know what's happening production. We can't know. So we need a way of observe it and see if everything is fine. So what we do, we add instrumentation. So now we know what's happening. We know traces, we know metrics, and those things. We keep executing this loop, but sometimes we find bugs. And every time we find a bug, because we have instrumentation, a part of the instrumentation purpose is to help us understand what's happening, we are going to use instrumentation, the telemetry to help us find a bug and try to fix it. But now we have two possibilities. Here, you find a bug and you can solve it because you have good telemetry and you knew exactly what was happening based on what the telemetry was telling you. So in that case, that's awesome, your telemetry is good, in a good shape, so great. And then you can go back to your cycle so you can just keep adding features. But what happens if you don't find the problem based on instrumentation? That's bad because you invest the time and resources into adding instrumentation and trying to get it right, but in the end, it didn't work. It didn't help you find a bug. So you have to keep adding instrumentation and try to improve this instrumentation. But that's okay because observability is hard, it's not easy to get it right. And part of the problem is that adding instrumentation and the development cycle are in different stages. We do not care about observability when we are designing our code or when we are coding something. So this makes it kind of hard to get the instrumentation right. Maybe because when you finish coding and you start adding instrumentation, you take a look at your code and try to ask yourself, well, what can be instrumented? And that's a tricky question because you are trying to instrument everything that you can. But should you? Maybe that's bad. Just as over-logging things is bad, over-instrumenting code is bad as well. So you should be asking what you should instrument before instrumenting things. The other part that makes it very hard to get observability right is that we cannot test it. We only find out that something is missing or that telemetry doesn't help at all. When we have a bug, we try to figure out what's happening and we can't. That's the problem. So wouldn't it be nice to be able to test your telemetry before deploying to production? It would be great. You could solve a lot of problems doing that. But now you have another type of test and that's a burden. You have to maintain another type of test and you already have contract tests, integration tests, end-to-end tests, unit tests. Well, that's a lot of kind of tests to maintain. But wouldn't it be amazing if you put more than tests for your application telemetry and you could actually merge some of your tests together? So here I present to you trace-based testing. You might be asking what's that? So basically trace-based testing is a test method that leverages your tracing data to make it accessible to our tests. So we can use that information to assert the behavior of the system. Okay, but what does that mean? How does that compare to my existing end-to-end tests? Sure, but let's take a look. So imagine that you have an end-to-end test. You have a test runner. So it can be a simple function written in JavaScript or it can be an external test runner such as TestQ. And so the first step is to trigger your test and that will trigger your API. We're going to send a request to our purchase API. We're trying to buy some product. For that, the API will reach the database, do some checks, and it will reach the product API. Product must be available, right? And the DB will be accessed again. Next, probably we have a shipping API. We need to create an order for shipping. So, okay, how do I do that? Probably run into the database and probably I do not ship myself, so I have to reach another external service that I do not control. So I hired a company to send the package for me. That's great. It works, but now how do I know that the test passed? Because, well, I might have a 200 status code in the purchase API returning to my test runner, but is that enough? Because we don't know how that communication works. Is that synchronous? Is that asynchronous? If it's asynchronous, that 200 status code doesn't mean nothing for us because that might be successful, but we have a queue and we have something executed in the background and it fails. And that 200, it's a false result for us. So to ensure that everything's working, we have to add complexity to our test. We end up with a lot of secrets and a lot of setup to reach for each one of those dependencies and see if they're right. Like, did the product stock changed? Did I get a shipping order? Did a company that will ship the product got my order? That's the kind of thing that we try to test because that's the only way of ensuring that everything's working, but that's a lot of complexity. And, oh, it's hard. Imagine that you just got hired and you have to write an end-to-end test. That's impossible because you don't have enough context. You might be from the team that maintains personal API, but you might not know a single thing about shipping or product. So that's very, very hard. To be able to write those kind of tests, you have to be an expert in almost everything. Or you have to have a team of experts writing those end-to-end tests. That's not good. That's like we're concentrating knowledge. But OK, but how does trace-based testing compare to end-to-end tests? So let's take a look at how that looks like. So we still have a test runner. Same thing. Can it be a function? Can we external runner? And we trigger the API. OK, that would trigger a database operation. But the difference here, now every time an operation runs, I send telemetry to my trace back-end. So I have a new trace. I have spans going from my application to my trace back-end. OK, and the flow, the flow continues. I brought in the API's hit. Database is executed. Then I have new spans going to the trace back-end. Same thing for shipping. We do everything, and we have everything working. Nice, that's great. Now, we kind of know what's happening because we have the trace. We have the blueprint of the entire operation. The last part of the trace-based testing, fetching the tracing and executing the assertions. So that's exactly what we're doing. We are going to get the trace from the tracing back-end, and we are going to run all the assertions that we have against it. So OK, but what's the advantage of using that over my existing end-to-end tests? First of all, you test your telemetry. And that means that when you have a bug in production, you will not get a surprise when you take a look at your telemetry. You know how it should look like. That's great. So again, how does that work? That's runner, triggers system, systems do some work, send traces, and we get the trace back and execute the assertions. So every time you execute that operation in your system production, we know that trace should look like that. So have a way of searching that. It creates a feedback loop that makes you write better telemetry to be able to write better tests. What does that mean? Imagine that you're trying to test something using the trace. If your trace doesn't have enough information, you cannot test everything you want. So you have to improve your telemetry because otherwise you have bad tests. But as soon as you improve your telemetry, you can improve your tests. And by improving your tests, you can improve even more your telemetry. So it's a loop that it creates. And it reminds me when you start learning about DDD and you start writing tests for your code because you want to think about how to design your code in a testable way. That's the same thing. We can apply to the DDD here with this strategy. So for example, let's say you have an application, you don't have any telemetry at all. And you want to start writing tests and use telemetry. So what would you do? First of all, you write a test. You expect to have telemetry. You expect to have a span named getDataFromDatabase. Nice. But it's failing because you don't have telemetry place. So what do you do? You add telemetry to your code. And now your test is passing. But probably it's not the best code that you ever wrote. So you should be refactoring that before merging. And then we can repeat the cycle. And we can write a test with things that we don't know about. We don't have telemetry for that yet, but we have an understanding of how we want to see that in telemetry in your trace. So we write the specs for the telemetry and then we implement it. That's a nice way of adding telemetry because if you've set off looking at your code and thinking, hey, what I can instrument here in my code versus what should I be caring about in this function? Like I'm testing a user registration. What should I care about in this flow? As soon as you write a test about it, you have a plan. You just defined a plan of how you wanted to make your traces look like. And you just have to implement that. You don't have to think about what should I implement. And you have a specification for that. That helps a lot. And it enables deep assertions test results without assessing internal dependencies such as microservices or databases. So as we saw before when we were writing our end-to-end test, we had to set up the test runner to be able to access the external dependencies, other microservices just to make sure everything worked. But that's like a lot of complexity. It's huge. And just thinking about that, it's overwhelming. And that's overwhelming for someone that's working in a company for a while. Imagine that you're just starting. It's impossible. You have to be an expert to understand all the requirements and all the relationships between the services. That's very, very hard. So if you have trace-based testing, you have the trace. You have the blueprint of your whole application. And you can just take a look at that blueprint and understand what's happening. You don't have to be an expert and know that the product API creates orders because does that make sense? Probably not. But who would guess? Maybe it's only the people that work in the product API. So the fact that you do not have to have understanding of the whole application to write end-to-end tests, that's amazing. You can start using trace-based testing to understand your system. And as you go, you can add more and more details to your tests. So you don't have to create a huge test at first. You can just use that for exploration. So I think that's one of the most powerful things about trace-based testing. Okay, now an example. Take a look at this trace. This is a trace from our demo API. And even though it don't have any context of how that is implemented, you can have understanding of what's happening based on the trace. So we have the trigger of the trace-based testing. Then we see that we just triggered a HTTP that is getting the list of Pokemons. And in order to do that, it executes in the database a findManyPokemon and then accountPokemon. Both of them are in both graphs and both of them generate select statements. So I know that when I trigger this endpoint, I'll have two select statements, one for finding all the Pokemons for me and one for counting them. I have no idea how the code looks like, but I already know how it works. That's a ball of trace-based testing. Another thing is that traces are data. So if you generate the telemetry the same way across our environment, you can run the exact same test in where. So you have to test your createUser endpoint, for example. In production staging, Dev, they work the same. You might have like, you might use a mock the Dev environment, but the telemetry should be the same. And because the telemetry is the same, the trace would be the same as well. So you can use the same test because we do not rely on connecting to different services or connects to database or those things. You don't have to care about secrets. You only care about the data and we are just testing that data. So you can start using tests across multiple environments. I think that's very powerful. So about the demo that we're going to show now, I'm going to use trace test to execute it. It is a API to manage Pokemons and it has dependencies on Postgres, Redis, Rabbitin, Q and Poke API. Let's say API that returns Pokemon information based on the Pokemon ID. And about trace test, it's open source. It's license, it's MIT. It has a web UI that allows you to create and manage tests. And if you are not a fan of CLI and test files, you can use that to create your tests in random, just as both as the CLI. And then you have the CLI and test files that you can use to store your tests inside your Git repo story. And because you have CLI and test files, we support CISD. Nice thing about trace test is that you can start using the UI and then migrate to the CLI with no effort because the UI is able to export the tests as a test file for you. So you can write all their tests in the UI and then just export them using the UI. So that's great. So let's go to the demo. So now we are going to try to achieve three things in the demo and for that, we're going to use the Pokemon API demo that we have. First thing is we're going to write a test for the Pokemon important endpoint. That endpoint is already instrumented and let's check how that works. So I'm already running the API and trace test locally. So I'm going to create a new test. Well, the API is the ACP API, so it can just create a ACP request. Name, it's a import Pokemon. And description, import invalid Pokemon, nice. So now I can feel the details of the request. So it's a post request to, I'm running the Pokemon API on port 2001 and it's Pokemon slash import. There is no organization. The content is JSON. Let's go to body. So ID, let's choose a Pokemon, 25. I think that's Pikachu and let's create it. So as soon as I create it, trace test will send the request to the API and wait for traces. As soon as the traces are complete, it will show them to us so we can start writing our search treats. So here we have the whole trace. We can see, we can zoom. So that's the trace, nice. So let's try to write a test for that. For example, for every integration test, I expect that the response that I get has ACP status as 200. So let's assert for that. So ACP status code is 200. So that's an insertion. Here we have a selector. A selector is a way of selecting which spans are going to be used to run those assertions. As is a very specific selector, we get just one span out of it. So let's keep like that. So it's the Pokemon post-Pokemon part. Status code equals 200. That's exactly what it means. We have that fuel in because trace test auto completes the values using the values that we have here. So let's add the assertion. So we can see that best. That's amazing. So let's go to another, let's add new assertions. Like, I know that there's another HTTP call here, the HTTP that we send to retrieve data from that Pokemon. What if we check for that as well? So it's the status code. Let's make sure that same thing. I'll not change the signature and status code equals 200. Let's change and we have it passing. But we have a pattern here. Like we are searching for all the HTTP status code. So why don't we write a custom assertion that checks only for trace test span type. I want that to be HTTP. If you see that will give us two spans. The post, that's the triggering span and the other one because both are HTTP and I can check for the 200. That's great because now we have two assertions. Let's delete this one and this one. Oh, okay. So we have, I edited this one. So that's why it changed, but that's great. We have now a better assertion. So let's just publish that. So we have this assertion. The other ones appeared because I have deleted that. So we have this assertion that matches two spans and we ensure that both spans should have HTTP status equals 200. That's great. Another thing that a lot of people like to do is, okay, I have this trace and I want to check for a synthesis, high synthesis of how would I do that? So let's check. Like one thing that we know here is that the Pokemon API, it sends a message to a message queue and then that message is consumed. And then it's imported. When it's imported, it get the Pokemon from Pokemon API and then save that into the database. So let's check the duration of that span. In trace test, we have some attributes such as trace test span duration that measures how long that span lasted. So let's add an assertion for that. This is a long selector. We can make it simple because we know that the span name is unique across that specific trace. So we can just use the name of the span as selector. Nice. So I want to make sure that the duration is less than, let's say two seconds, one second. Nice, it passes. Now I can see that one seconds, that 596 milliseconds is less than one second. That's why it passed. I can publish that test and we have the V3 because we applied V2, containing the two as the ACP assertions and then V3 containing this assertion. Okay, okay, that's great. Oh, we're ready to do that. We can assert other things as well like how long it takes for database operations to run and those things. Now that we created the first assertions, we can start running the test and those assertions that we created, we'll be executed right after the execution of the triggering transaction. So right now, trace test is trying to send a request to the service and waiting for the traces. Okay, so we got the trace back and if you take a look here, we can see that the assertions got executed and everything passed. So we know that the status code is 200, the Q that the message from the Q was executed within 500 milliseconds, so it's less than one second and the other ACP request was executed as well. So that's great. And let's take a look here like in the part Pokemon. If you hit all, you'll see all the attributes of that span. And if you notice, most of them are the default one from the open telemetry. We just have like a name and a service name. And that's bad because that means that this span doesn't contain any information at all, just the basic stuff. So we should fix that. Like imagine that we're trying to do something and for some reason we don't get the rest of the trace. We're not sure like what Pokemon we're trying to report. So we're going to write a test to test for the attribute Pokemon.id. But for that, I'm going to use the test definition. If you go here, test definition, you can see how you write a test using trace test. So I'll copy this file and simplify it a bit. Now that we have imported the test, let's change it. So we have a selector. We have, we want to select the span name import Pokemon. And we want to make sure that it contains the Pokemon ID equals 25 because that's the value that we are using in the body. Okay, let's try to run that test now. So trace test, test run, definition file is API, trace testing, test, let's wait for results and add color. Okay, let's see what happens. We can check the result here as well. So press the page, we see that there's a new test running. Let's wait here. Okay, we can see that we have a failing test because now the import Pokemon, it has a Pokemon ID equals 25, but the actual value is an empty value. So that means that our test works. The only thing that we should do now is to fix the application to reflect that test. So let's try to do that right now. Let's try to find here, import Pokemon. So we have here, create span import Pokemon. And we have the Pokemon ID here. So let's try to set attribute. Let's call it Pokemon dot ID and the value is Pokemon ID. With that, that should pass. So let's try it, let's stop. Let's rebuild the application. Now we can try to rerun that test again. Let's do it, let's wait for the result here. Okay, now we see that we have no test failing because the import Pokemon now has the ID 25. And if we go to that span, we will see that ID here. So this is a nice way of improving our telemetry because before you even add the code, you create a plan, see what's missing, create a test to cover that, that case, that scenario and then write the code for that telemetry. One thing that's worth mentioning, we use trace test to test trace test. And when we were adding telemetry to trace test, the way that we did was by writing tests. So first we wrote tests, everything was failing and then we started instrumenting the API. Doing like that, it made that process way easier than it was. For example, when we were building this Pokemon API, we tried to build the API and then build all the instrumentation without any plan, just looking at the codes and try to implement it. But it wasn't very clear what the application was doing. It was hard to read the trace. As soon as we stepped back and thought about what the instrumentation should look like, that was way easier and we don't have any instrumentation to make it more clear. So that's how you can use trace test or any other trace-based testing tool to guide your instrumentation. So that's all for the demo. And if you want to know more about trace test, here are the main links. You can check the code and open issues or PRs in your GitHub handle the story. And if you have any question and want to discuss with the team, we have a support channel that you can join and ask the team for how gain feedback is more than welcomed. And now I'll save some time for questions. Have a great day.