 Hi, sign from the back. So I get to get started at this point in time. So hi, I'm Dave McAllister. I am an open source evangelist for IngenX. Some of y'all may have heard of IngenX as a product. We are part of F5. And what I'm going to talk about today a little bit is some lessons we learned in taking a modern application structure into and adding open telemetry for observability into it. We'll probably go a little bit beyond some of the purely hotel aspects. I've been involved in observability for about five years at this point in time in open telemetry since it started in March of 2019 at KubeCon EU and open census before that. So first of all, let's talk about where it's happening and why this becomes important. Every company is on this direction to becoming a cloud native company. And cloud natives are often combined with this modern application environment where things are broken up and they're looking at microservices model. But we've all actually started someplace. Unless you are a pure green field, probably start up, you've started with a monolith. And so you go through this monolithic approach, lift and shift approach, refactoring. But we're looking at this cloud native aspect. Cloud native gives us the need for observability in space because we have more moving parts. We're dependent on asynchronous communication paths inside of that. So we need to understand what each part is doing and how to bring each piece back together. And come on, every time for the batteries to die. We know that everybody is going to go there. And so by 2025, according to Gardner, about 85% will be running something in the cloud, probably based on this modern architecture. And so when we start talking about modern apps, they basically are defined by capability, not by the implementation. And I don't care if you write it in COBOL, but they all have these certain characteristics that come into play. You can read through here. But when we come to this observability space, when we start looking at open telemetry, there are three things that really pop out that we need to deal with this. We want to find our performance bottlenecks. Today, we have a attention span for consumers, which is really short, roughly 3.7 seconds website, e-commerce site is a lost sale. Waiting that long, 3.7 seconds inside of here. We want to provide answers to questions that platform engineers, SREs, or DevOps can come into play. And then we want to provide that context on the errors so when something breaks, we can figure out where it is. So those are the things that we look at when we're looking at observability and taking it into this application space. And so we started with this thing called MAR. Modern applications reference architecture. It's a microservices architecture. It's human communities, and it's designed to be production ready. We didn't want a toy app. And so we built this thing to reflect what happens in the real world around this. Interestingly enough, the application really is only one part of this. We've also got these things. We were talking about monitoring of observability. We have infrastructure concerns. The application is only the piece that we see that we're touching at this point in time. So part of this, by the way, is this application that we went through is actually open source itself. I'll give you a pointer to it because we want people to steal the ideas. And if you have better ways of doing it, we want to know about them too because you'll find out there were some challenges involved in bringing open telemetry to this. So here's a quick sort of simple viewpoint of what we're building here. The front end was Python and Python had the user services contacts. Sort of the back piece here was Java and then we were working with the Postgres SQL database. Because we also know that today, microservice applications are very seldom written in one language, we wanted to be able to reflect that we had multiple languages that were involved in that. And when we look at this, this is sort of the shorthand version of an SBOM here. We're using kucon config, AWS, it runs on linode, it runs on digital ocean. We're using tracing with open telemetry, management pieces. We did have a load generator so we could test it. We're using a locus for that. The application itself is known as Bank of Sirius and it's actually a banking application, deposits, withdrawals, those pieces inside of those. We're using continuous integration to map what happens in the world for this using Jenkins and we're using Pulumi in Python to be able to do that infrastructure as code. So we were trying to make sure that we were using as much open source as we could. I'm an open source guy. Nginx is a company based on open source and so we wanted to continue that role and use as much open source product sets as we could. There's nothing in here that limits any of this to just being open source but nonetheless, this is where that structure is. So, quick viewpoint, there's a lot of things over here. We get people that are using it. They're going into the app code, the automation code. It goes into building an application. The application is running inside of Kubernetes. Inside of that there are observability pieces, there's security pieces, there's key management, there's logging. We're using Nginx as its base for communication processes, reverse proxying, talking to the back end and doing all that communication. And also can talk out to a cloud service. Once again, we're trying to make this as close as to what we expect to see in the real world. And the problem first starts. If you're not familiar with this, this is known as a Knieven framework. When we start building these modern applications, we actually have multiple sets of moving parts that change. And as long as we're changing one thing, it's pretty easy to get an answer. But in this case, we're working with microservices, which are making things a lot more complicated. So we don't know where they are, we don't know how many there are, all those pieces, which gets us into the chaotic side, the elastic piece and the ephemeral piece. And so those two things together meant that we had to do some unique challenges taking an existing application and making it ready for this space. If you are a green field, you are starting your application from scratch. A lot of things I'm gonna tell you today may not apply to you. If you are taking an existing application and putting open telemetry into it, you may suffer some of the same slings and arrows, for those of you who missed the earlier Hamlet soliloquy here, slings and arrows that are gonna bite you here. So the Knieven framework is something I use all the time just to talk about how we are having two moving accesses that cross the boundaries. And so microservices create these really complex things. They really are complex in here. Debugging, monitoring and observability piece is very challenging because it may not be the same every time you touch it. And we also are constantly doing faster cycles. One of the headaches that comes into play is your scale of data is massive. When we used to have the standard monoliths, a nice simple log file, we kind of knew everything was there. We didn't have to worry about clock skew. We didn't have to look about drift. We just knew where it was. But now we're in a microservices environment and we have to worry about all those things times the number of microservices that are going on. And so all those different pieces start dropping into this. We usually hear microservices in cloud when we're talking about this, but the same concepts will apply to a monolithic application. There's nothing unique except the communication pathways. So observability is a data problem. I've just talked about that. And we usually look at this data as three classes of data. Metrics, do I have a problem? Traces, where is the problem? Logs, near and dear to my heart. What does the application think is going on when it's inside of that? But when we start looking a little deeper here, we identified that there were a number of other things that were necessary in a modern application environment to really get observability. So logs, metrics and traces were just a start. We wanted error aggregation. We wanted to be able to pull things out and understand what was going on inside of the system on a daily basis. We wanted to see reporting those pieces. We wanted to have health checks so the system we could comfortably think was up and running and that each piece of the system was up and running at each point. We wanted core dumps. Core dumps are amazingly powerful tools, particularly when something goes wrong to let you know what's going on. For instance, if you have a problem in a thread pool and you don't have a core dump, you are not going to figure it out. Just flat out, not going to happen there. And then we wanted runtime state introspection to be able to look at all those different pieces, the caching aspects. Again, that thread pool coming into play. We wanted to be able to understand what was happening. And so the engineering team and I started with the concept that says, why don't we find something that does it all? And so we listed across the top bar all of these different things that we wanted to have happen. And then we started listing down the side tools that we could come up with, focusing on open source again as much as possible, tools that we could come up with, and then checking the boxes, where it fits and where it doesn't fit. And we rapidly found, very honestly, that apples do not equal toasters. We were not even the same class of fruit. We were completely off the wall. You can't compare elastic APM, Prometheus, and you can't really compare a Zipkin product to something like Greylog. They are very different beast. And so we decided that we would scrap this approach and we would go a little more organic. And because Nginx does have some expertise in open tracing, we started by looking at open tracing. And then somebody said, yo, look at open census. And so the first thing that comes up, open census, says, open census and open tracing have merged into open telemetry. And then open tracing projects has been archived. Go to open telemetry. And so we started by looking at open telemetry and found out really quickly that it was someplace we wanted to be. So standard space agent, cloud integration, all just standard part of this. It had this concept of automated code instrumentation, at least the promise that we could get tracing data without instrumenting the code manually. It had lots of developer frameworks. It had lots of different language capabilities. And when we started looking at people using it, it was in massive use. The tracing aspect, because it was the merge of open tracing and open census, was already solid and stable. When we came forward a little bit from that viewpoint, we also started finding that there were a little bit of headaches that came into play. But likewise, when we looked at this cloud-native modern architecture here, using my three data classes, we also identified the things that we needed to have. We needed to have the API. And the API had to be available for every language that was of interest for this. We needed the SDKs. Same question here. We needed a way to collect the data and send the data to place that. We could do the aggregation analysis and visualization for that. And we wanted to make it as interoperable as possible. We did not want to build a single structure based on a single physical offering for that. Because of that, it turns out that open telemetry actually solved all of those really well. W3C trace context standards, incredibly active program. Currently, the second most active program in CNCF is adopted by almost every one of the observability vendors is using this. It has one of the most active in-user organizations. And so we felt like we were on the right path when we do that. But open telemetry deals with tracing, metrics, and logs. Keep that in mind. So let's start with logs. So everybody here, I hope, understands what a log is, something that the system thinks is worth writing off. Actually, something that the programmer decided was worth writing off and retaining for that. And so logs seem to be really simple. And they are, kind of, harvest the log out, put it someplace else, and figure out how you're going to make use of that data. Logs are pretty straightforward. However, when you start digging into the logs a little bit, you find out that you've got transport issues. How do you get the logs from wherever they are, times however many places that they are, into someplace you store them? And by the way, logs are not small. A former company, one of our clients, delivered 50 terabytes of log per hour. So I've heard worse stories than that, but that's the worst one I've ever actually personally touched. Nightmare stories, by maybe or I'll tell you. Indexing, once you're going to do this, how do you find a log that you're interested in? Most of the time it becomes an indexed problem. How do you index inside of this? And then how long do you want to keep them? If you're in the security space, you may want to keep them for two to five years. If you're purely debugging, you may want to keep them for 15 minutes. How do you decide? And so we also wanted to meet some of those other questions. How do we answer the questions for platform ops? How do we answer the questions for the programmers here? So the logs have to be easily searchable. And they have to be searchable on any criteria that someone could think up. And so when we started looking at that, the logging structure was not daunting. Logs is a solved problem. But we decided that we would start and look at this from Elastic Stack for now. And we used Filebeat to pull the data in through a Kubernetes demon set for this. We used Bitnami to break it down, split the appointments out into the right places. And then we used Kibana because it had a bunch of preloaded search capabilities. So we got indexing. There were some interesting things. We got visualization, which was really useful. Got search capabilities, which was really useful. But my gosh, does this thing eat resources? Scaling up gave us bits. And even today, even running a very simple version of this, the logging lags the actual application by a substantial amount. Query variance was OK. You could pretty much search what you wanted to search. If you were indexed, your searches were really good. If you were doing a raw search, the search time could vary drastically. So we weren't really happy with logs, but they worked. We also, and I'll get into this a little bit later, we were trying to figure out how to do the best correlation between all of these different services. But as you can see, Elastic Stack is one of our topics that we're looking to see how we can improve it. Distributed tracing came next. Distributed tracing was complex and sort of chaotic. We have a fairly clean pathway going through our system. Our microservices are broken into very clean classes. Our classes there are broken into the traditional microservices where the microservice does as much as necessary, not as small as possible, but as small as needed for this. And so it was pretty clear what we were going to hear. We also had to have that supported all languages. And in our case, we only had two. We were looking at Java and Python, and we were looking at frameworks. We looked at Spring Boot, and we looked at Flask inside of that. What we found was really important when we started doing this is that with the open telemetry space, they have this magical little thing called the collector. The open telemetry collector is incredibly brilliant. It is incredibly flexible. It has the ability to bring in any kind of data that you're willing to invest writing the interface for. It can put data out any way you want it to go. It can do processing in the middle, and it acts as both an agent and it acts as an aggregator. And so we could put the agent with each of our services and we could aggregate it together and deliver it wherever we wanted. So what we delivered it was to Yeager. And Yeager is a nice distributed tracing environment. And so kind of looking like this from a trace environment. And so the tracing up here started with an Nginx front end proxy. And then you can see, as it goes down, somebody asked for their balance. The balance has to go out to the ledger. You can see each piece of how long it took. This length of time gives me the span of its breakdown. You can see the spans as they relate to each other all down. And I probably went off the line here for this one. But you can tell that this one took about one point, about one and a 10th second. Not bad. Now, the nice thing is that something has taken a normal amount of time. We can now tell you exactly what it is. So imagine, for instance, that this piece, the reader to ledger, suddenly took 12 seconds. We now know where to go look for the problem. We don't have any better clues into what the problem is. But we do know that we can start looking for what's happening in each of these pieces. And so interestingly, what we did was we wired the collector just to dump all the trace data to a local Yeager, just to be able to see what was going on inside of that. And then because we can now see we have visual representations of the data, we started figuring out how to currently wire all those pieces together to do that aggregation aspect. Aggregation aspect was really, really useful. When you start aggregating it, though, one of the things we found is that being able to keep track of clocks across multiple systems and multiple services becomes a unique challenge. NTP is what most people use. We use for almost all of these things, network time protocol. Network time protocol can vary by up to about 100 milliseconds where it reads. And so if you're occasionally going to see traces where the sub trace, the subspan, actually starts before the span does, because that's the way the data was reported. And so this is called skewing. And so skewing is one of the problems. Yeager is nice enough. This obviously must fit somewhere inside this line. So it can tell you that it is part of that subspan. And so it will try to put it inside the subspan. It won't let it start before the span starts. But it does hide additional data from you. So nice viewpoint of all the spans that were inside of that. But it really is all about languages. And in open telemetry, the maturity of each language is dramatically different. And so looking at what's in Rust versus looking at what's in Java are very different. They're all GA'd for tracing, because tracing was the first one that's stable. There are some of them that have good maturity on metrics. And there are some of them that have, quote, not yet implemented on metrics. And then logging is still a beta aspect for this. We expect to have it by the end of this year, first part of 2023, to have the logging aspect. But logging is just a wild dream right now for most of these languages. So when we looked at this from a language viewpoint, we actually had to figure out all these different pieces in here. Python was really easy. Remember, Python's more of our front end environment for this. So we had to add two files to this. And then we updated the requirements to include the dependency structures. And you can see what's going on here. When we did this, we did something additional. We actually, when we generate the trace ID, write that trace ID into any log file in that request cycle so that we can now correlate between the trace and the log with no headaches. This is amazingly useful. It's really hard to figure out and go back from a specific trace to a specific log entry if all you're depending on is the timestamp. And so adding the trace ID was something that did take a little bit of effort on our part but was really useful. Our logs were all written in bunion format, Jason bunion format, because the guy who did the logging itself really likes Jason bunion format for that. The trace ID lets us track down those dependent services. So Python was really simple. Java, on the other hand, had some interesting challenges. And so if you were writing in Java in a green field, it's not too bad. You can pretty much make it all work just perfectly transparent. Import the libraries, use the APIs. So and with Spring, Spring Boot, things looked really easy. They have this thing called Spring Cloud Sleuth, which adds them into the appropriate functionality. It gives you a trace and span ID. It ties in the common ingress and ex-gross points here. It adds traces to schedule tasks. And it can generate Zipkin traces. Collector doesn't care. It's perfectly capable of pulling Zipkin in, digger in, handing out Zipkin OTP or any of those. It also, Java has this really great auto-instrumented library, which is the first thing we tried. And boy, did we get traces. And they were useless. Not because the traces were useless, but the traces didn't give us any information about what was going on the inside of our application. It was showing us common points where we were touching the various packages. But it wasn't telling us anything useful for the application viewpoint. And so at this time, which is about six months ago now when we did this, Autoconfig was a milestone release. The ability to simply add the Spring Cloud Sleuth into play was a milestone release, which meant that it didn't support the current versions. And even today, it now supports more recent versions, but it still doesn't support the latest version of the Java runtime. And so I think currently, the last time I looked at it, it's supporting 13. And I think we're at 15 or 16 at this point in time. So there was a little bit of headaches here. And because of the nature of this milestone of release, we had to directly pull from the Spring snapshot because of dependencies that were built into the Spring Sleuth project. And so we had to change our build structure, pull in different sets of information at the same point in time. None of which is hard to deal with. But if you're in a production application, the last thing you want to be dealing with is something that's already down-repped and something that's going to require your build process to change just for those unique items. So between the auto instrumentation and the Spring Sleuth thing, we had some interesting challenges when it came to Java. And so we literally had to start looking at how to manually build our tracing structure. Again, right now, auto instrumentation will give you a tremendous set of information around things that are common. And if there are the libraries of dependencies that are in place, and there was a huge list for Java, you will get even more detailed tracing information. Because we were working with basically an in-house written application, we couldn't go that route. And so we actually manually built the tracing structures. That's pretty straightforward. It's pretty much you instantiate your trace, you instantiate your spans, and you call the spans. And it all magically happens for you. But it does take a little bit of effort to do that. So those things are going to change. And we'll continue looking at what's going on. Our answer to doing this was rather than going in and rebuilding every single package, we actually built a module, a single common telemetry module, that provided this tracing functionality. It used the auto configuration classes, and so that we could pull in the data that CloudsLuth was giving us here. We also added other things that were important to us, other trace attributes that were important to us, that we could actually track not only what the trace was doing, but where the trace was coming from here. We also wanted to make sure that tracing wasn't putting an undue load on the system. It's one of those cases that tracing should be really fast, the collector should be really fast, how do you know? And so we took the effort, because we wanted to actually do this test, we built an entire no op class of tracing. And we can flip a switch, and tracing now goes to no ops. And so you just can get a real clear viewpoint. Turns out that there's very little overhead. If you go from no op to trace, it's actually really a little overhead here. We added a trace name interceptor to our trace names so that we could standardize them. Because when we were dealing with this, each language uses a different type of naming formats. And so we wanted to be able to standardize our trace names so that we could connect them easily together and understand what variables it was talking about. And then we added things like service name, the instance ID, machine ID, a handful of other things, just because we were nosy for this. And then we wanted to take the statements that the trace was giving them, and we put them into comments that preceded our SQL calls. Because we need more information around what was going on with the database structure. So all those things were really the thing that we kind of built the closest to. We added the ability to use this with Apache. We had the ability, it was built to use nginx. But again, we wanted to prove that this was a common environment, not a specific environment for this. All this uses for the nginx side, this module, this open source module, open telemetry collector. You can go play with yourself, but literally it just grabs the data and makes the tracing information available. The data is automatically available to it. And then sort of the last piece here was metrics. And we got lazy, we skipped Python. Some pretty decent things, but we decided we didn't want to deal with Python. I'm not sure whether we got lazy or got bored, actually, come to think of it for this. Java required us to do some serious considerations here. So the original code used this thing called Micrometer, Stackdriver for the Google Cloud Guys GCP for here. And what we found was that when we started looking at it from an open telemetry viewpoint, there were some significant limits to the number of metrics and the type of metrics that we could collect coming from open telemetry directly into our environment. And metrics, metrics don't really tell you much other than everything's working right, everything's, or something's gone wrong. Doesn't give you a clear indication of where or any of these things. But I bet almost everybody in this room depends on metrics at some point. If you are in DevOps space, SRE space, PlatformOps, even in pure programming, you're probably dealing with metrics at some point for you. So metrics turned out to be really important and we ran into some interesting limitations. Micrometer turned out to be this really mature environment for Java virtual machines here. And it was the default API for the Spring aspect. And so when we did this, we simply said, we're not going to reinvent the wheel. We're not going to wait for OTEL to catch up with us for the metrics environment. We're simply going to use this. And oh, by the way, that collector thing can just pull the metrics directly from this and send them any place you want them to go. And so the collector, again, allowed us to do this correlation that we would have had to build separate correlating capabilities across without too much problems here. And so the OTEL collector can scrape the metrics of anything. Prometheus, stats D, put them into OTLP and just get them to there. Micrometer supports lots of backends. There's some interesting crossovers. And so we could simply make this work. And so we could take the metrics from the micrometer, pull them. So this was a pull activity rather than a push activity through the collectors and through a receiver here, through the exporter. And then we chose to send it to Grafana so we could take a view, get the dashboard approach. But we also at this point in time said, OK, OTLP should work. We decided that we'd send it to LightStep as well, just to see that it worked that way. And so again, because of this magical open telemetry collector, we can send the data any way we want to. And the backends generally can receive the data any way we want to send it. And so that became some real attractive functionality when we did that. So three principal challenges that we identified in our after action report here. What libraries you use with your application can cause you massive hard burn? Do you use the auto-instrumentation library? Do you have to write everything yourself? Do you grab a library that somebody else has put into the open source from a vendor viewpoint? Is it up to date? There were a lot of moving parts just of the library functionality here. And sometimes we looked at frameworks. So again, Spring Cloud, Spring Cloud Sleuth. Or we could do it directly through the hotel libraries. So there were lots of moving parts that happened in here. Setting up the collector and figuring out how to send data to it was the next big thing. Once we got the data into the collector, it was easy. But getting it to the collector the first time did require us to think about the problem and think about how we were sending it and what collectors we were going to be using. And then finally, and I've talked about this a little bit, how do you reconcile the metrics viewpoint, which is usually an aggregation viewpoint, the trace viewpoint, which is a single request viewpoint, and the logs that are being generated actively by the application. And so we started looking at how to start coordinating those. Hotel in the future will do all that for you. And we've seen the works and the directions and we're fully convinced, yeah, it's going to work. We're not there yet. So we have to do a lot of this work. Our metrics are not coordinated to a trace ID, but they are coordinated to a timestamp ID. I've already discussed a little bit about the timestamp ideas. Our logs are matched to the trace. The log is written while a trace is active. There is a trace ID in our logs. And so we can automatically coordinate back across those. In the future, we're going to catch up. This is about six months out of date. Open telemetry is a rapidly moving subject. And we are beginning to look at what our next steps are and how to bring things in. We now have GA for tracing across almost every language. We expect and are already seeing the release candidates and the GA approach for metrics in our languages of choice inside of here. And we're waiting on logs. We're all waiting on logs for this. But we'll continue to add those things here. So logs is probably the biggest thing we're really waiting for to come through from the hotel group, the CNCF activities for open telemetry here. We're really interested in keeping logs and we track them pretty heavily here. We're also now taking that and extending open telemetry capabilities, the ability to collect this metric data, this tracing data, to other NGINX projects. So we've taken this process and said, how do you do for an application? We write applications. We have these open source applications. How do you take our data that's part of the open source structure and pull it into where you want to? And so we're now looking at what it's going to take to start pulling open telemetry data, tracing, metrics, some point logs. Right now we write off logs totally separately for that. And pull it out so that you can simply drop a collector in, pull the data where you want and send it where you want to and understand what's happening in that software structure layer that's going on. So as a quick summary, because I can do this here, this was not a trivial effort. The lead programmer on this regularly was ripping his hair out. And we were talking down off the ceiling and feeding him a couple of drinks. And he would calm down and get some more work done. So with Java, we used Spring Cloud Sleuth. Sleuth, hotel exporter, the hotel collector, went into a pluggable store for that. Python, very similar. The Python libraries, the collector, the store. And then nginx used the nginx hotel module to the collector, to the store. And so hotel collector is the common item inside of that. The front end library is varied by the type of framework and language we were using. But the store, the collector is the thing that drove it. Similarly, when we got to metrics, we used the micrometer via Spring. We used it to Prometheus. Prometheus is a common environment. It's probably sort of leading to the open metrics work that's going on here, which also went into the collector. Kind of nice that we could do that. You used Unicorn, StatsD, which went into Prometheus, et cetera, to the collector. Common thing, collector, collector, collector here. The only one that was not was the log files. The log files went into Elasticsearch in Kibana. We are not happy with that. It is still a work in progress. We're also, as part of that future plans, looking at other logging solutions while trying to remain with open source or free as much as possible. There are some really great, fast log management, but we don't want to have a specific project tied to this. So we're looking at gray log, for instance. Not my favorite, but maybe it's faster. We're looking at low-key Grafana's log metrics capabilities. Each of these has unique challenges. We know, for instance, right now, just from doing the research, that a lot of them are very performant at small scale and not performant at large scale. And so when we're looking at this as production environments, we have to look at this and say, what's going to happen when we go to scale? And so we'll continue to improve and deal with that. Some of the pieces I didn't talk about, we actually did as well. And so we did build this error aggregation capability using tracing. And so we can see what each trace is building and building a report that puts that error aggregation together. It's not perfect. It doesn't catch everything. But the reason we did this is that traces are an individual request flowing through the system. And therefore, we can see what happened in a request basis when we're doing the aggregation reporting up here. With health checks, we used basically the back end, or the framework tools that were possible. So actuator feeding to Kubernetes, likewise, class management endpoints feeding into Kubernetes so that we can keep the health checks going. Runtime introspection, same basic thing here. And when we got to core dumps, the Spring Boot actuator also supported the capability to do thread dumps. Not pure core dumps, but thread dumps. Thread dumps and Java are really, really useful once you figure out what they're telling you for this. But Python doesn't have that capability at this point in time. And so Python was not part of that. So sort of quick summing up, this is an open source project. All these pieces are available for you. And you can go play with it. If you have ideas or something that we did wrong, let us know, please. Just part of the group that's trying to figure out how to do those things here. So metrics and traces, they took longer to get where we wanted to. They were worth the effort, very honestly, to do this. And without the open telemetry collector, we would probably still be trying to figure out how to do this. So the collector was clearly our friend. You're deeply happy with the collector. And we can't figure out how to live without that collector on anything we're working like this anymore. Metrics and traces, partially because of maturity, had some interesting gotchas. I've talked a little bit about that. But getting the metrics, the metrics output may be stable. But the metrics right now don't give us all of the information that we consider important when we're looking at this application space here. This is a snapshot in time. Things change. We will be revisiting this. We'll be bringing it up to speed, what it looks like going in the future. And then finally, auto-configuration is really great when it works and delivers what you want. If you are looking at pretty much an infrastructure impact on your application or your application is an endpoint-driven model, the auto-configuration for Java is a godsend. Just drop the jar files in. Runtime, you're done. If you are looking at application-specific information, you will need to do some level of manual instrumentation. The nice thing is it's pretty straightforward. In Java space, there's about four commands that you need to stick into the right places to make that happen. If you're doing crossovers to add tracing to logs, build your trace. Ensanciate your trace first, grab the trace ID, and just write it off into your log files. So it was pretty much the easy way to go. So with that, let me open up. If you've got any questions, I can take a now. Yes. I'm sorry, I can't hear you. So Mike, there's a wonderful air conditioning home right in front of you. How are you propagating the trace idea amongst all the different services? So they asked me to repeat. How am I propagating the trace ID among all the different services? We have actually taken the trace ID into a storage environment element that we are passing as part of a message that we pass back and forth. This is what happens with the standard tracing structures as you're going out to microservice calls anyway. You're passing the trace ID. We're actually grabbing the trace ID as it comes into each of the systems as part of that module we built and makes it available to do this. We simply add that as a variable to every log line that we write off. So if it's active, we see what logs are written with that trace. If there's no trace, it's simply a blank field. And so we can drop it out in a column or structure and just not have to worry about it. Yeah. So at least in the C++ API, there's the ability to add tags and events to the trace. So couldn't that be a replacement to logs? Yes. So the question is really, can I use tracing information to replace logs? The answer is a qualified yes. Let me retract a little bit here. In the application space, and we are beginning to see this start creeping in here, the trace actually contains the necessary elements that we would normally write off in the logs. We actually have this trace message space. And so we did use that somewhat, where we would write the log, but the log is actually that the log message is bound into the trace message ID. We got a little nervous about what we're going to have in a performance aspect, because we were still writing logs out at the same time for this. We also wanted to have our logs in a structured function. So like I said, we used money, and we wanted to be able to have the structure approach for logging. And so we wrote both of those out. But that's actually a really great point. In the future, if you are using traces, it is possible for your application to write the logging message into the trace message field. And that will give you the ability to not have to worry about having a trace ID, because the log is written into the trace to begin with. There's another headache that can creep into play with this. And that will help solve all these problems. The data loads that are generated are massive. I mean massive. Think about how many logs you write off on a regular basis. It is not unusual for a single trace to cross 16 to 18 spans. Easy. And I've seen them across 100 spans. If you look at, I think, the hotel defines down to, I forgot what the number was, at one point, hotel defined if you crossed 2,000 spans, it truncated. But think about that. You're now writing off logs. You're writing off traces. At the same point in time, your data load is unbelievable for this. So what happens is most people tend to end up sampling traces. And the collector has the ability to actually do trace sampling for you. Head-based sampling. OK, give me five out of 100. So rather than tail-based sampling. Most of the back ends do some level of sampling. So if your logs are written into your trace and you've sampled them out, you've lost your log. Any more? I think we have one. Do you have one? OK. I have to make sure I was properly paying attention. But I do believe you mentioned the word rust. I did mention the word rust. And then you didn't list it in any of the integration challenges you had. So I was curious. I want to live vicariously through your suffering. Yeah, so currently, open telemetry has like a list of about 12 different languages. So the question was, why did I mention rust? Most of our base level tools are written in C. You won't find a C subsection for open telemetry. So we can write rust headers to fit into the C calls. And that gives us the ability to reach into them. But it's a very immature SDK at this point in time. I'm sorry? Yeah, that's fine. Yeah. So it's kind of, actually, somewhere in here, I literally keep a chart. I go check it about every two weeks and write what the latest version of each of the languages and what the stability level of each of the languages are. Java is definitely the most stable and the most advanced. So but yeah, we have an open source app server that we are actively looking at using rust to do open telemetry work with it. And that's why I corrupt it in my vocabulary. Anything else? If not, I think we're done. We have one from online. OK. Timestamps, synchronization, it could be troublesome in a distributed environment. Is it important to have accurate timestamps? How could this be solved? Oh, jeez, there's a booth upstairs. I know this person's online. There's a booth upstairs that can talk to you all about clocking for this. It is right now, for most of the categories, your timestamps are good enough as long as you understand that you are going to get a certain amount of artifacts. You will occasionally see subspans start before spans. Most cases, it doesn't matter. Remember that the trace is a single request going through your system. So only if you, quote, have a problem with the trace, do you care what the trace says and where it looks like it. But like I said, there's a vendor upstairs. And they did a really great talk for Open Telemetry Day yesterday about the need for being able to build clocks. And they're building clocks in a mesh environment. And so their meshing environment actually improves the clock accuracy for timestamps down to about 10 milliseconds from about 100 milliseconds. So that's pretty good information to look at. Other than that, I haven't looked at this. I have a little list of questions that plan are going up as dares and asking them about for this, because I got all sorts of challenges when you get into distributed timing that can come into play. So with that, thank you all for coming and listening to me. And we'll see you around the show.