 Hey everybody. How's it going? Good to see you all. How's the conference going? Good, good. I'll try to keep the, when you leave, I'll try not to have everybody doing this. We're going to talk about how testability drives observability today. So it would be a lot about open telemetry, a lot about tracing, and how to instrument distributed applications. A little bit about me. That's what I like to do. I like to be on the river fishing. I was very good. I do catch and release, so that fish lived. But a lot of fun. From Memphis, Tennessee, I have started a couple startups before. I started one in 1999, 1999. So I'm old as anything. And it was in the real estate industry, 2008. Real estate crisis hit. I started my second business, which was cross-browser testing, which by the time I sold it, had 4,500 customers around the world. So as competitors were soft slabs and browser stack. From there, I went to work at SmartBear, who acquired the company, which is a leader in testing software, including tools like SoapUI, etc. So Swagger, other tools, API related tools. And I think long time technology developer was the last semester at my local college to use punch cards. So once again, old as hell. So agenda. We're going to level set on observability, open telemetry and distributed tracing. Then we're going to go through and look at what the typical adoption life cycle as you start to instrument your app. What does that process look like? We're going to introduce a new concept called trace-based testing, which leverages the data and open telemetry trace to enable testing. And then we're going to show how using those traces and testing with them makes for better observability. So it helps the whole cycle. If you're basing your test on or your traces or your test with traces, you're going to find that when you're in a two o'clock in the morning, the system is breaking and you look at your observability tool, the quality of the data is going to be much better. And last, we'll wrap up and go through some questions. All right. So what's observability? Observability allows you to look at an application and tell what's going on without pulling out the code, without adding more log statements to it. So a couple of different definitions. I like this, the more observable a system, the more quickly and accurately you can navigate from an identified performance problem to its root cause. And this is key without adding additional testing or coding. So do you ever actually have a perfectly observable system? I doubt it. There's probably a lot of cost involved in that, et cetera. But you get close. And the goal of this talk is how can you have a system where you get closer to having an observable system? All right. Another definition was observability. Bill, you measure the internal states of a system that's a little more technical. Internal states by examining its outputs is considered observable if the state can be estimated only using information from the outputs, i.e., you don't have to go add code or add new logs. This system is providing the information you need to tell what's going on. Typically, when we start talking about the observability space, there's three signals or metrics that people look at. One is metrics, so CPU usage, et cetera, time series data. One is the oldest form of observability logs. We've done logs since I've been writing code. And lastly is traces, which is newer and is super powerful. And I'm really excited about it. My last business, we were passing, we had racks filled with phones, androids, iPhones, we had Mac minis, we had VMware. So we had lots of different languages, lots of different systems. When you did an API call to look at a iPhone in a live test, the process, it went through like 12 different processes to get there. And when something broke, we would have 12 different screens up, and we'd be looking at logs from all of them, which you could do in a development environment. There's no way you could do that in a production environment. There's just too much data streaming. So we implemented our own version of tracing. Now there's standards around tracing to enable it. And they would help us many a time. So what's the standard? Open telemetry. Who's heard of the Open Telemetry project? Okay, great. Tons of people. That's great. So it is a standards-based approach, and it covers the three metrics, traces, metrics and logs. It is vendor-neutral, and the vendors, the industry support is really outstanding for it. So this, I threw this slide together. It's kind of cheesy, but I went through some names, Google Cloud, AWS, Microsoft, Sumo Logic, Splunk, New Relic, Dynatrace. If you go through all these sites and you look at what they're doing, they're all supporting open telemetry tracing. So it's really expected in the industry now. I think it's a good thing for everybody. As a customer, it's better to implement with the open standard and that way if you want to change vendors, you can. From the vendor point of view, even for these large companies like a data dog or a light step, it's hard to support everything, to have tracing work on every version of MySQL, every message queuing type system, every language, every version of language. That's a lot of load for any company to take. With the open source approach, everybody's contributing to managing that load of work. So we're going to drill down and focus specifically on what distributed tracing is. So distributed tracing refers to methods of observing requests as they propagate through distributed systems. And to kind of spell that a little bit more, if I have five services and API is called and the request goes from one to another, to another, to another, to another. Distributed tracing provides you visibility for that one call across all five of those services. So here is a Jaeger dashboard. Who's used Jaeger before? Okay. And I'm going to flip to software for a minute. I'm going to hang, actually hang my shamrock around the pedestal since we're going live or I'll just spit it out. So it's always scary. So this is a transaction we're going to deal with a lot today. In the examples, it is a post to a Pokemon import. So we built a little microservice based on the Pokemon API. And this is Jaeger showing a trace from it. And in a trace, the different actions that are happening as we go through here, they're referred to as spans. And particular spans are going to have tags here, but we call them attributes. So these are details about for this specific action. What was it? Okay, it was a database and we got database response and that operation was create and the type of database. So we get this detailed information about it. We'll also see if we scroll down, there's a span ID on the right. So every span has a unique ID and every trace has a unique ID. Okay, and this, this covers that traces a group of spans with parent child relationship. And that's why you get the fan diagram. Each span has a unique identifier and the spans contain information about the operations. So that was just a level set on what is open telemetry, what is tracing, what's observability. Now we're going to talk a little bit about the lifecycle of adding instrumentation to your app. So you develop a product. Maybe this is, maybe you're, you don't have tracing or you don't have open telemetry implemented. So the first time you're adding some instrumentation. We goes by and you have a production problem. You have a bug. You bring up your trace and you look at it. Did it help you solve the bug? If it did, awesome. I've got a developer I'm working with and he uses the word awesome. And he puts a passion in it. So you can hear awesome throughout this presentation. So it's like, awesome. It worked. But what if it didn't? What if you looked at the trace and you're like, I still can't tell what's going on. I need to add logs. I need to change my telemetry. Oh no. We need to go back and add instrumentation. And this is a cycle that I don't think there's a way to totally stop it. But I think there's ways to do it better. So the instrument better on the front end. So this process of continually going through a bug didn't work. Add some stuff. A bug didn't work. It kind of reminds me of a movie I watched. Live, die, repeat. So it's kind of a brute attack force. We'll just keep trying day after day until we kill the creature that's trying to take over the world. So observability is hard. It's not a natural part of normal development cycle. I think over time it becomes one. But I think initially it's kind of a bolt on. And then the only way we know something's wrong is when we need it most. And we need accurate information. You might have a trace that might show value. But did you actually test the trace? Do you actually know the values that are in it are proper and working? So wouldn't it be nice to be able to test your telemetry before you deploy it to production? And this is what gets me excited. But wouldn't it be amazing if we could do more than just test your application telemetry? Who else implemented open telemetry tracing in a production environment? So trace, yeah, great. Trace test uses, which is a product that I'm founder of and will be using as examples. We use trace test to test ourselves. And it's hard to implement tracing, but it's cool when you can get other benefits out of that work. And we'll talk about that in a second. Okay. Immediately we'll talk about it. So I'm going to introduce the concept of trace-based testing and the value of it, what it is, what's interesting. Trace-based testing is a testing method that leverages your current tracing data, which you spend a lot of work to get into your system and makes it accessible to your tests. It then uses that information to assert the behavior of the system under tests. So I'll give you an example. Who's done some form of API testing in their career? Okay. So what are the parts of it? Usually you trigger an API call and you check the response. You assert against the response. So if I was naming that, I would have called that response-based testing. Trace-based testing is you trigger an API call or a GRPC call or whatever and you test against the response and the full trace of what happened. So it's a much deeper view of what happened in the system. It introduces some interesting concepts with a response-based testing system. I don't have to ask you, well, which thing do you want to test against? I have to ask you which attribute you want to check about against the HUB status code. That's the attribute, but I don't have to specify it's the response. It's always the response. With trace-based testing, you have to tell it, these are the spans that I'm interested in and then you tell it, okay, in that span, check this attribute. All right. So let's compare this to a traditional end-to-end test. So a traditional end-to-end test, you would have a system. So you've got three microservices, one talking to external service, they're talking to the database, and you need to instrument, your test runner needs to do calls out to these different systems to gather the information, to get visibility into what happened when you did an API call. So you basically, each time you write a test, you write your own instrumentation for these. With the trace-based tests, your system's already instrumented. All these microservices are already sending information to your back-end, tracing back-end. All you do is put in place a test runner that can initiate a call and it needs to have access to the finished traces. So it orchestrates, it executes a call and it orchestrates with the tracing back-end to get the full trace and then it runs assertions against the response and the trace. All right, so what are some of the advantages of trace-based testing? And we'll show this with live software in just a second. You can test your telemetry before you change, before deploying changes to production. Who does testing of their code when they're at code? Okay. Who tests their telemetry? That's good. I'd like to talk to you about how you're doing that, so it'll be interesting. But how would I put it? What's everybody's track record on untested code working properly? It doesn't. It typically doesn't. More times than not, it doesn't. So testing's good. So believe it, and this is, I think, a really cool thing. As you create a feedback loop where you improve your test and the tests are based on your telemetry, by the nature of it, your telemetry has the stuff in it that's actually important because you only write tests on stuff that's important. So your telemetry gets better and as a refactor, you add tests case, your telemetry continually gets better and your tests get better. And I think we kind of addressed this on the previous slide. It is hard to add all this instrumentation in a traditional end-to-end test. With trace-based testing, you rely on the work you've already done. You've already instrumented for open telemetry tracing, so you get to leverage that work. And lastly, if I show you a trace, which I can't do right now with the screen dead, but if I were showing you a trace and I'm new to the organization, I'm a QA person you brought me in, I can look at this picture and go, all right, you're getting a Pokemon. It looks like you do two calls out to a database. I can figure out what's going on. I can look at a picture, people are good at looking at pictures or diagrams and understanding what's going on. So that's a great way to write tests. If I can do it graphically against a picture, it's much easier than, hey, we've got 20 microservices and there might be some old documentation somewhere that describes how they work together. Good luck. So this should say demo instead of show how tracing is, testing his traces with TDD impacts instrumentation. We are, as we go through this, we're going to end up using test-driven development. So I'll write my test first and then I will implement code changes and then we'll run the test. So we'll run the test with it failing first and then we'll change the code and we'll see that the test gets better. I'm going to start off the demo though with just a basic, this is how you build a trace-based test, a normal one. All right, the system we're going to show is a very important system. It's the Pokemon microservice app, otherwise known as the PMA. It's a little demo app that we have out in GitHub. It's got two microservices. One is API front-end and one is a worker process. It's got a RabbitMQ message queue right in the middle of it. So API comes in and message gets put on RabbitMQ and then the worker pulls it off. And after the worker gets the data, it accesses external service to PokemonAPI.co and does a call out to it to get a Pokemon and that writes it to a database. So a few pieces, not as complicated as most microservices but enough complexity to demonstrate the use case. We're going to be using trace tests. I'll try not to make this the commercial. Trace test is an open source project, MIT license. It gives you web UI to create and manage tests. It's got a command line. We will create tests both with the UI and with the command line. And it's got, with the command line, you can put these in CI CD processes and run them. All right, let's take a look. All right, let's start with a documentation on the endpoint we're going to look at so that we understand fully what's happening in the code. This is a flow diagram for the first microservice. We're going to be testing the Pokemon import, which is a post. There's a validation step that happens and then we put it on a message queue. The worker app pulls a message off the queue, does a call out to the poke API website. So that's external call. Once it's got the information, it saves the Pokemon in a database. So fairly straightforward flow. So I'm going to start by creating a test. So we've got different ways to trigger tests. We can do it via HTTP requests, GRPC, some other methods. I'm going to pick HTTP. I'm going to fill in some details that I have to type. So we're going to do the import Pokemon, but I'm in my local environment, so paste in that. And we're going to get the Pokemon with the ID of 52. We'll create the test and what the system did just now was it ran the post and you saw it came back super quick, came back in seven milliseconds. And that's because the API part of the process just gets the request, throws on the message queue and returns immediately with a 200. The real processing happens in the background. If we look at the trace, we can see that we did the import, the validation, put on message queue. We did an import of the Pokemon where we reached out to external service. And then once we got the data back, we insert in the database. So that's the flow we're going to be working with. Now I'm going to use trace based testing to ask some questions and build a test on this. I'm going to go to the top. I'm going to say when we do the post, I'm going to look at some of the attributes and I expect the response code to be 200. So I'm going to create a test spec for this to assert that this specific span and this selector is tuned to get this one span. If I were to change it, let's say I wanted to apply this test against all the HTTP spans, we now see that it selected this one and this one. So the selector lets you say what parts of the system should this be applied against. And then we said we want to check that the HTTP status equals 200. So we'll save that test and we can look at the test and see that we expected 200, we got 200. So that would be a black box test. I checked the incoming call. I said it should return 200. What's cool about trace based testing is you can go deeper than that. So I'm going to go down further in the system and I'm going to look at when we call the external service, what do we get back? All right, we see that we get back. I can't say this. How do you say me off? Is that I don't know my Pokemon as well as I need to know my Pokemon? But I'm going to create a test spec for it regardless. And I'm going to say the response body. And I'll just say it should contain me off. Save that as another test. Okay. And then lastly, I'm going to do a more generic test. I'm going to say and I'll just do this one from scratch. I'm going to say that I want all my database spans. And we should see if I did it right, we're going to see them selected. I think I did not do it right. Yeah, lowercase. All my database spans, we got three selected. And I'm going to do a timing based test. I'm going to expect that this length of time each of those takes should be less than 100 milliseconds. I'm going to save that. I don't know where my shamrock went. It failed me. It left me. Oh, I've got my shamrock here. That is an issue. So anyway, so we added three tests to this. And we can publish the tests. And at any point now, we can rerun the tests. And we'll get test results. And I could take this test and run it via the CLI. I could take this test and run it via build process. So here's our test. Does that make sense? I'm going to stop for a second before I move on to the kind of the next stage because next stage, we're going to change code. Everybody with me? Okay. We've got a production problem. You know, none of y'all have had production problems, but I always seem to have production problems. I don't know. I don't know why. And this is it. We've got a fellow Pokemon import that's happening. And it's causing issues. So I'm in operations, and I'm going to look at the span or the trace and try to figure out why. So I see that we're getting all the way down to where we're reaching out to the external system. But I don't see the database rights. So let's take a look at some of the attributes and figure out what's going on. All right. I see we're getting the Pokemon. So I know what the request was. I don't have much response data. What do I have? Oh, I got something saying there's an error. Anybody know what error? Looking through it. That's all I've got. So what's my next choice? I can go look at the code and try to figure out what's wrong. But in the course of doing that, I really want to prove the observability of the system. So the next time we have a problem like this, we can fix it. But first I got to figure out what's wrong. You know, I did notice that the response or the request, 999, I'm not sure how many Pokemon there are. So we'll go and look at the poke API site where we can test things. If we try to get a low number, we get, there it is. So we get a JSON object back. I'll try to get 999. And we get nothing. So we don't even get JSON back. So that's not nice. I would expect it maybe give me a JSON structure with an error or something nice to deal with. I'd never tested this boundary. So I didn't know what happened with this external service when we called it. So I'd like to do a couple things. I'd like to look at the code. And then I'd like to design a test to, I want to add some instrumentation to the code. And I want to have a test that's going to check that. I'm going to start off with a test. I'm going to specify what it is I want to be checking. So if you ever watch anybody with like cooking shows on TV, okay, I've pre-cooked this cake. So you don't have to watch me type. But I've started creating a test called bad Pokemon. And down here is our test. Currently it's just checking the hotel status code is equal to error, which we saw that was already there. I want to add a couple assertions. And the assertions I want to add are up here. So what are we going to do? In addition to checking the hotel status code is error, we're also going to check that the HGV status code returned to 500 so that we know there is a failure. And then we're also going to make sure the HGV response body contained not found, which is what that API returns when it doesn't find anything. And it's not, it is not JSON. So I'm going to save this test. And with test driven development, I'm going to go ahead and run the test. And I haven't changed the code. That stuff's not there. So it's going to fail. But we'll go ahead and see it fail. So we are running that test. It's collecting both the response and the trace. And then it's so starting against it. And we see that, yep, that set error for the hotel status code, but the other two tests failed because they're not in the code yet. So we're going to go add those to the code. Both HGV status code and the response body. So let's go into Visual Studio and open this section of code up and uncomment it. So what's this code do? First off, I can see now what the problem was. Without this code, we were trying to get the Pokemon and we were trying to convert it to JSON. We didn't have JSON when it was a high number. So the code would fail at that point. So we're going to add in, if the response coming back, there's like a sensor that knows what I'm near the screen. If the response coming back was not okay, we're going to put some more information into our open telemetry trace. We're going to put the response status and the response text. And we picked the fields that we're going to put them in. With the response status, there is an open telemetry has a concept called semantic attributes, which are documented on the open telemetry site. So there are standard strings to use to identify certain attributes. So we're using the standard string for HGV status code. There's not one for the response body. So we've added one into the system that HGV response body is what we chose. So we've added our code. I'm going to save it. And now we're going to restart that microservice. So this is a worker process. We've restarted it. And now we're going to take that same test we ran a second ago and see if it works. Okay, it sort of worked. It worked mostly. It sees that the response body contained not found. You know, if I recheck the Pokemon service, I will see that it actually returns a 404. It doesn't return a 500. So I made a mistake. So I can go to fix my code real quick or fix my test. We'll run it again with the test paths. So what did we do there? We had air. We didn't have good observability for the system. We used test driven development to say this is what this test should work when we finish development. We changed our code. And then we tested it. And we now have a trace that's going to give us detail that would help next time we have a problem in production. So I wanted to kind of do a little wrap up. We said some of the benefits are it helps you test helps you test your telemetry before deploying your changes to production, which we just saw they did. It creates a feedback loop to help you make better telemetry in your system. If we base our test on it, we're going to have the valuable information in our trace. A like trace based testing for it enables you to make deep assertions without having a lot of connection information to the backend databases, different logins, different security keys, etc. You just need the trace. And then it's really easy to write tests. It's easy to understand the flow in the relationships. And due to that, I think you can write more tests and you don't have to have quite as technical people writing the tests. So we are open source project. We live and die by feedback and input. So we would love stars, but more than stars, I'd like feedback. So I'd love to see y'all download the product, try it, tell us, hey, it'd be useful if it did X. So we've been in production how long? We released our first version in June. It was pretty usable. July, by July, we were testing ourselves with it. We're now adding variables and chaining to tests. So it's going to get even more useful and that'll be out before cube gone. Yes. Any question? Very good. Any questions? Yep. So you can export the results. We have J unit as a standard. So you can specify, hey, when you finish export J unit, we also expose that via the UI. This one doesn't have any tests on it. So it's not popping up, but it would give you a J unit result. And that's one thing we're kind of interested in is what format should we output everything in? I think there's probably multiple answers to that. Any other questions? I think so. So I think there's a case for setting up tests and running them at scheduled times as a monitor. And if there's an asynchronous process in the middle of that, you're really not testing much of anything. Any other questions? I can't, I don't have my glasses on. I can't read that time. One minute. Okay. I'd like to thank y'all. I'd love for you to go out to trace tests, try it and give us feedback. Thanks, guys. Thank you.