 We might want to wait a couple of minutes for people to get on the call, because it takes about five minutes. Let's see. Oh, I'm giving a summary. All right then. Good to know. I should prepare summarizing already. So we have Michael and Mike from Couchspace. Do you go by Michael, Michael? Is that how you guys differentiate? Yeah, actually, I think so. I'm the Michael and Mike is Mike. All right. That makes it easier. You're muted, Chris. Actually, two more from Couchspace on the core. Oh, great. Yeah. Sweet. Whoever is managing the attendees. There you see. 835 right out of the money. Perfect. All right. That looks like a good form of people. Hello everyone. Welcome to Friday. Welcome to the OTSC call. Great to see all your lovely faces. We have a pretty cool call today. We've got the couch base team here in force. And Mike and Michael are going to present the great work they've been doing integrating open tracing directly into their libraries and products. Please make sure you add yourself to the attending list. Looks like everyone's doing a good job of that right now. And yes, Matt, we are recording this. So this will go up on YouTube on the open tracing channel when we're done. And on that note, Michael, you want to take it away? Yes. Can you. I think I can just jam my screen. Can everyone see? Yeah. Okay. Perfect. So let me jump over to the slides here. All right. So thanks everyone for joining and everyone who is watching the recording in the future. A slight correction. So I'll be presenting this alone. Mike had his hands full with other stuff. So, but, but he is around for any.net and other related questions, but you'll just have to bear with me for now. So yeah, my name is Miho Nijinga. I work for couch base as a software engineer. And lately, last couple of months amongst other things, I've been needy into what we call response time observability, which of course also feeds into the whole open tracing theme, which I'll talk in the next couple of minutes. And then we have a demo and then I think we can just easy to open it up for questions. So I'm going to tell you a little bit about couch base, just a couple slides for those who haven't read it before what we've been doing, then the challenge which brought us into the whole open tracing distributed tracing theme timeouts. Then what we've done with adopting open tracing, the good and the challenging pieces, then a live demo and then I call to action and then we can do some questions. All right. So what is couch base for those who haven't heard it before? Basically, couch base is a distributed document or at the database focused on scalability and performance. Right. So that's the one line sentence of what couch base does. And so it has all kinds of properties like auto shouting, a flexible data model. But for I think for the purpose of today's core, what's important is that it has a memory first architecture and everything inside the database itself is also done asynchronously. Right. And we'll see how that works in a second. And the other thing is that you can scale workloads independently. And if you look at couch base from where it's coming from, right, it was originally years, years ago. It basically was a distributed managed cash and they came at your store. And then it evolved into the document database theme. And that's been adding other things like couch base mobile and legal query language, which is very SQL like we added food research. Now we're adding analytics. So you can see there are like more and more components being added to the system, which you can immediately think that it's probably not going to make it easier to troubleshoot performance issues. Right. But just that you get an idea. This is what we're coming from, basically from being a distributed managed cash and then adding functionality towards document oriented analytics, full text search and so forth. And one thing to call out here is that one interesting property of couch basis that it supports will be called multi-dimensional scaling. So you can basically enable every kind of service, be it key value, be it full text search querying on every node in the cluster, but you can also choose to only enable individual services on each node. So you can say, okay, I have a 50 node cluster on two nodes around the query service. On the other side do indexing. And then I have a couple other nodes where I store the data, the KV service. And the important piece here is that on the client side, our SDKs are actually intelligent. They don't just take your data dump and send it to a remote socket, which happens mostly with relation databases, but the SDKs play an integral role as part of the distributed system. So they basically get cluster information in your real time, basically get an update of the topology. And then they decide when a request comes in where to dispatch it to, including handling certain retry scenarios when you are rebalancing the cluster and rebalancing means you can add and remove nodes on the fly without downtime. So that's another challenge that the SDKs handle, right? Making sure the data gets to the right places at every point in time without user disruption, right? And here's an example of how we would do write operations. So for example, you create a document in your, let's say, Java application, and then you call the upset method on the SDK. What happens is, of course, we send it over the wire to the server and then it lands in the managed cache, our KV layer. And then the managed cache, once it's there, it will basically acknowledge the write to the application itself. And then it will asynchronously send the operation to the replication queue, asynchronously to the disk queue, asynchronously to the secondary indexing engine, right? So all of these steps are happening asynchronously. And obviously they can also happen over the cluster, right? So the replication queue will eventually send the operations to the replicas on other machines, right? So even if you just perform a single operation from an SDK point of view, many different spots in the distributed system actually affected. And when we come to the point where something is slow, right, we need to figure out which places in the distributed system have been touched and where the slowness comes from. And the other operations are the same, but I just didn't add more slides since it's similar. But if you have questions on specific functionality of CouchBase, just let me know when we can cover that later. So with that basic knowledge in mind of how CouchBase works and operates, why, so what's the big challenge, right? And the thing we've come across in the years, and I've been with CouchBase for many years now and we've been handling support escalations. And one of the bigger problem or the bigger challenge users and customers are running into are timeouts, right? And the problem is everyone who has looked at timeouts and worked with them and tried to figure out what's going on, the realization is timeouts are always the effect, but never the cause, right? Timeouts are the symptom of something being slow in the system or taking longer than expected. And that's something that realization hasn't happened to lots of users out there, right? You see a timeout and you think the problem is the timeout, right? So one of the steps or one of the important pieces here is to help users first realize that timeouts are not the real problem and they need to go deeper and figure out what caused the timeout in the first place, but then also give them the tools and functionality in their hands to actually go troubleshoot it because one of the challenges with, for example, the current Java SDK, but it's similar in other languages is that here in this stack trace, the user has been performing a get request to just fetching a document, right? And what it gets returned was a timeout exception, but the problem is just with the timeout exception only tells you, well, something was slower than expected, the deadline you gave it as a timeout value, basically it took longer than this timeout, this deadline, but it didn't tell you exactly what went wrong, what went slower, and then it's very hard to troubleshoot, right? The next step most of the time is go look at the logs, figure out if you can see something, go fetch information from the server to see if something's slow there and so forth, right? So a very iterative process, explorative process, but also sometimes very time-consuming and making that time slower to basically go from something went wrong to detecting what went wrong and exactly how to fix that, right? This is the whole purpose of the response time observability. So common causes, obviously there are three players or three components in our distributed system, right? There is the app servers and there can be many of them, and then we have the network, and then we have a cluster of card space nodes, and each of them can have several causes. So if you look at the application server, right? At the very bottom you have the networking card, and then on top of that you have the operating system with potentially many different layers of virtualization, including Docker, Kubernetes, whatever, right? So that doesn't make it easier to troubleshoot what's going on, and we have seen weird bugs at the OS lever. Yeah, I don't want to go too deep into that, but we had issues there as well, and then obviously you have the runtime, right? So if you develop a Java application, you have your application server, at least your JDK, garbage collection, all the fun things that you have to troubleshoot, then inside your runtime you have your application where something can go wrong, logic problems, and then inside the application you have the SDK, which can also have bugs, right? No code is perfect. So you have all these causes on the application side, and then if we go down the layers, we get to the network, and then you have firewall switches, load balancers, proxies, all causing latencies, maybe in batches, maybe, yeah, like spontaneously, we have seen firewalls dropping packets, firewalls basically blackholing the sockets, neither telling the server nor the client that the socket got closed, right, operations going into the void, all these fun things on the network, and with shared networks on EC2 and other cloud providers, yeah, it's even trickier. And then once we have all the application and the network handle, then we can look at the Codech Preserver cluster, and there everything applies obviously to the OS layer, but then instead of the application server, we have our Codech Preserver node where each individual service can cause loneliness, or like if you fetch a document, the disk can be slow, right, so it's coming from the KB service or garbage collection in the needy query engine, something like that. So then you have all those causes inside Codech base, and then on top of that, we need to figure out where the loneliness comes from with a single timer exception. And then that led us to look around and figure out some ways to make it easier onto our users and to our support folks. And this is what led us to open tracing. So we basically sat down and did some brainstorming and figured out some key requirements for adopting a certain API inside our languages. So there is vendor neutrality because we are not in the business of like application performance monitoring, right? We are not providing tracing implementations, but for us it's important that we plug in into as many tracing implementations as possible, right? We don't want to enforce anything specific onto our users. And then for us it's important also that at least by default it has a small footprint. We are bringing in more dependencies, which means potentially there are clashes with other application dependencies, right? The more baggage we bring into the system, the more there is a challenge that specific customers have to probably deploying it and so forth. So we're looking for minimal footprint. It needs to be supported across all our SDK languages that we support. So I've put all those logos here on the sides of Java, Gold, Net, C, PHP, Python and so forth. So we officially maintain a large array of SDKs and every functionality that we were out we need to provide in all of those languages so that if you are coming from Java and you're switching over to Net, you want to feel right at home, you want to have the same functionality and enlarge diverse ecosystems in enterprises, right? You have different teams running different languages but eventually if they settle on the same distributed tracing engine for the whole system or you have different microservices in different languages, you want to have the same functionality available. And then it should be actively developed, right? It shouldn't be something that we adopted that there is no adoption anymore right out there. So that's why we settled on Open Tracing because it's meta-neutral. It supports all the languages that we need for SDKs. And this is very important for us. It's an API where no other decisions are made. We don't want to enforce any specific network transport protocol decisions onto our users. We don't want to enforce other implementation details that they want to override or customize in their system. And it's backed by the CNCF. So it has certain momentum, right? It's not like this small one-off solution that users end up building customized code anyways for it, right? So it basically brings all this disbacking with it. And it's a moving target. And so that's a little challenge for us, right? But the good thing is we can influence the moving target, right? We can participate in Open Tracing and move it forward. What we mean by moving target is the versions. So it's not like there is a one-off spec that is basically set in stone. It's evolving, which is a good thing. But it also means that we need to basically keep track of the changes. And then with each incremental SDK version, basically bump the dependencies of Open Tracing to make sure we are tracking it appropriately. So one other key design decision was to implement a default tracer, right? So if someone upgrades, let's say from Java, SDK 2526 where there is Open Tracing support, the user doesn't even have to know that Open Tracing is in place other than getting in one more dependency. And the reason is we want to make it plug-and-play, zero friction, right? And the idea is that we ship with a default tracer called the threshold-loving tracer. And what it does is it's a tracer that is enabled by default and it aggregates low spans on a per-service basis. So, for example, QValue, Nickel, all those services we provide and logs them at an interval. And we have set specific thresholds, but you can tune them. So they are set at, for example, any KB operation that takes longer than 500 milliseconds, those are aggregated and then logged in a 10-second interval. And then, for example, the top 10, the top 10 slowest operations are printed into the log with additional information. And with the information that we provide, it suddenly makes timeout correlation possible. So in the timeout exception shown here, it actually changes from a simple timeout exception without context to a timeout exception which gives you an identifier that you can use for looking at the logs. We provide operation ID, the local and remote sockets. We set a timeout. So we suddenly enable the user to look at this information and then correlate it with the log output from the threshold log report, which is part of the tracer. So every 10 seconds, it will dump out the top 10 or whatever you configure slow operations and give you the same information. And in addition, give you the timings for specific parts of the process because let's say you have a timer that's 500 milliseconds but the operation hasn't returned. But if one second later the operation returns from the server, we can still get information out of it and then put it in your log. So in this case, you can see the total time the operation took to execute. The dispatch time is in there which basically combines the network time and the server time. And then if the server service supports it, it also gives you the server time and microseconds. So for example, if you store a document or you retrieve it from our KV engine, as part of the response, it will tell you how long the operation took. Then we also give it information. So by looking all those different timings spans, suddenly you get this immediate insight into the different timings of the system without before you had this opaque timeout and you had to configure on your own what's going on. So as in a Java example, this is like the demo setup that you'll see in a bit, but I wanted to include some screenshots in here as well. So by accepting a tracer into our client, you can configure whatever you want. You can just use our default couch base, a threshold log tracer. You can use Yeager Tracing. You can use LightStep. You can use whatever you want. And the way you put in your tracer is just the way you do it with the couch base, Java SDK or any other SDK in general. So there is this environment where you just give it a tracer instant and we would use it. So just by changing the environment with giving it another tracer instance, suddenly you go from our built-in zero-friction copy-paste logging engine to a distributed tracing engine without for you having to do anything, right? So for example, here is a screenshot I took from Yeager where we are performing an operation called GetFromReplica where we are basically you are asking for giving me the document from the active node and all the replicas, right? And you can see how the parent GetFromReplica operation then has subspans of the get going to the active node and then here I have one replica configured. So the operation, you can also see how it goes to basically the active and the replica at the same time. You can see how long it took. And then you can see all the spans, basically that previously in the log tracer where it just fields in the JSON, you can see them as extra spans here. How long the took the dispatch to the server, how long took the response decoding and so forth. That's just all of the box in the system. And then another quick example, here's a nickel query where you can see we are also adding tags to our spans. So you get the component, you get the service that you're running, you get the statement that got executed, specific operations ID and so forth. So basically storing all the tags along the way which you can use for filtering and further troubleshooting. So with that, let me jump into a quick demo so you can see the whole thing working. Or maybe are there any questions so far? Okay, then I'll just move on. This is awesome though. Okay, thank you. So you like demo. So I have just a couch base note running locally. So this is our UI. There is nothing fancy. We have two buckets. One is our travel sample bucket which has airports, airlines. It's just some sample data that you can use for querying. And we'll use it. And then I have a Yega instance running locally and I'll show you how that works. So here we have our code. Can you let me just show it to you briefly and then we run it. So as a first step, as a first step, we'll do the couch base tracing and then the Yega tracing. So other than that, we connect to local host, give it our credentials, open our bucket, and then we perform a document fetch. Then we will replace the document we just fetched and then we'll run a nickel query, select distinct type from travel sample. So it will give us all the distinct types that are in the bucket. We just sleep a little bit. So we give both the couch base and Yega traces some chance to send it to the remote system. And the way we set things up for the couch base tracer, I've been modifying the report a little bit. So I've been lowering the threshold to one microsecond to make sure that every operation actually gets locked and I increase the same precise from top 10 to top 100. You don't need to do that, but you can see how to configure it. Also setting it pretty to true. You only would do that in debug, obviously, because in a production system, you want to keep your logs short, but we lock it in JSON. So if you run something that is taking a log files and putting them to another system, for example, if you don't have a distributed tracing engine right now, you can still make use of the JSON block and feed it into another system to analyze later or for a support staff who can just grab for the stuff and look at things that are slow. Or we configure the gigatracer. We point it to local host, give it some params, right? But pretty simple setup. And let's run this first. Let me get out of here. I'll just run this with the couch base logup. And what you can see here is that, so we were doing three operations. Let me just... So that's just... All right, so what you can see is we've been performing those three operations again, the replace and the nickel query, and they all show up here in our log, right? So we have the get request and this identifier, it's maybe not as important to you right now, but what this thing does, actually, what we do with, once we connect to the server during the handshake process, we pass this ID to the server. So the server also has a threshold log of some sorts and it will log this ID as well so you can take with this ID and the operation ID. You can uniquely identify any operation in the system on the server side, right? So even if you do not feed this into some distributed tracing engine, you still get the chance to have a better correlation of what's slow in the system. And we see the timing, right? So 20 millis decode took 9.7 millis so you get all of these out of the box. Fetching the document on the server, excluding network time, took 49 microseconds. Then we have the replace operation and we have the nickel operation. And if we do the same thing with... Here you go, Tracer. Right, again. So you can actually, in the UI, on our code space UI, you can see so you can see the... in the packet statistics, you can see the operations going through. And then if we go over here, find traces. You see those spines in Yega, right? Without doing anything, click on the nickel span. You can see the dispatch to server time. On the nickel, you have all the tags, right? The query just executed the operation ID. Everything basically you've seen in the logging trace also gets stored here in Yega. And we can look at the others like replace. And the other thing is that depending on what kind of operation you run, if it's a mutation, we will span the encoding part. If it's a fetch operation, we will spend the decoding part because we've seen in the past that if you have huge documents, JSON encoding and decoding can take a long time, right? So you would immediately see that as well here. All those things are in place. So I think I'm pretty good on my 25 to 30 minutes. Oh, one more thing. I forgot the call to action before we go into questions. So our open tracing support is currently in the developer preview. We are planning on a beta end of next week. Two weeks from now. And then once couch beserver 5, 5 ships in a couple of weeks, months, something like that, this thing will become GA, right? So our call to action right now is we're actively looking for feedback in all kinds of languages. It doesn't matter if you are doing Java.net, whatever. We'd really like your feedback or no implementation on API where we can do better. Our response time observability. So we have a concept called SDKRFCs. So every feature we develop for SDK is basically put in an RFC where we discuss it. And these are open access. So I put the link in here for the draft. Take a look, put in questions and remarks if you have them. And the other thing I wanted to point out is so we have a blog at blog.couchbase.com. We're currently working on a series of blog posts on that topic. So watch that space. There is more to come there. If you watch that in the future, you can go there right now since they will be published. With that, thank you very much for spending the time with me 25, 30 minutes. And thanks for the opportunity to let us show what we've been doing for the last couple of months. And with that, I'll open it up for questions. Please, Mike, Matt, Graham, since you're on the call, if there are any other questions that affect you, please jump in. Thank you. An awesome talk. Thanks so much. I do have one quick question. So comment first. I guess I just said this is awesome, but this is awesome. It's really exciting to see this. And it reminds me a lot of Bigtable at Google had pretty thick clients that did a lot of important logic that was involved in the same sorts of optimizations you're doing. And I remember to recall that tracing in those clients was essential for the same reasons that you've outlined. I was curious, though, from kind of a business standpoint for CouchBase or for the people using CouchBase, how important are these sorts of timeouts in terms of the support team and so on? Is this like pain point number one? Or is this like, where is this on the list? So from basically working with support team, I can tell you that in general, timeouts are maybe one or two on the list. So it's a very big pain point. And the reason is that one of the reasons I didn't mention is that, for example, especially on our KB operations, we have a default timeout of two and a half seconds. And some users even set it even lower. And if you compare it to traditional databases where you have 70, 75, or even no timeout, very quickly, you run into those timeouts. Where in other systems, the thread will just go out. You have to go look at the profiler and see that your threads are blocked. People are running into timeouts with CouchBase way more often just because our timeouts are lower by default, which in my opinion is the right decision, right? Because it's this scapledge where if something gets low, then you can retry, you can do whatever you want, and we give you back the control. But the average user who is not used to handling timeouts, especially combined with asynchronous operations, so the Java SDKs, asynchronous as well, right? So handling asynchronous retries and so forth. You all need that for running a scale-up distributed system, but it's just not something that the average developer is immediately used to, if that makes sense. Yeah, the other, just to mention a couple of quick things. One of the challenges I guess we have at CouchBase is we are Apache 2 open source, and there's an enterprise subscription, all that stuff. So we certainly have those commercial customers, but then we also have lots of two and three node deployments. So in those cases, I don't need, you know, sometimes this becomes an issue for them, but they're not always running that kind of full tilt. Others like LinkedIn, they have their public reference first, and they have like 1200 nodes at CouchBase, and they obviously want their site fast. And so that getting that observability is really important to them. And just to clarify, part of the value prop here is not just that you've given instrumentation to your customers, but you also, you wrote this instrumentation with some playbooks in mind, right? So you actually have some playbooks. You're going to be giving your customer or otherwise when they come to support, you know, it's integrated kind of hooks you and trace points you put into the client. Yeah, absolutely. One, I don't think I can reference who they are, but one user is actually a member of CNCF and OpenTracing, and they're probably going to implement, they have some specific needs and they'll probably do their own tracer. And then of course there are going to be plenty of commercial projects, commercial products and projects that you can plug in. Yeah. My one comment on that is, you know, there's been a discussion about like automated tracing versus manual tracing, which I think is like maybe the wrong way to slice it. And I would like to maybe go forward, talk about there's two kinds of automated tracing. There's dynamic tracing, which is the traditional agent-based thing. And then there's pre-provided tracing from the service provider, right? And they're kind of mutually exclusive, right? Because the point of this instrumentation is the service provider is giving you this because they have like playbooks about what you're supposed to do when these things pop off. And that's like, you know. As you said, Brennan, the other thing is that with, like as a service provider, we just know from history where the pain points are, right? So we can provide very narrow and focused instrumentation on the usual pain points with, if you have some generic tracing agent-based, right? You're basically, you don't have this insight because you just can't learn every library out there, 100%, right? How the usage patterns are and so forth. Yeah. I would love us to start explaining this to people about how these are both useful things, but they're just useful for solving kind of mutually exclusive problems. Yeah. And what we have here is pretty modest. You know, it doesn't do a whole lot, but at the same time it does, it gives you, you know, a fair amount of insight very easily, which was kind of our goal. And we had just, Mike Goldsmith in particular, he and I had to spend a lot of time thinking about, like, okay, how do we make sure that we can run this out of the box, not spam the logs, and still get useful information out of it. That's awesome. So is this currently done only on the clients, or do you have also back-end instrumentations? At the moment, oops, sorry guys. Oh, go ahead. At the moment, it's only on the client. However, we do, and Michael showed this a little bit, we do actually grab certain statistics that are returned in the responses and then put those in. So that was one of the kind of important things that we had, actually, is that frequently, and we'll see these things, people will wonder, you know, we've seen the one issue, Michael kind of slightly referenced. We saw an issue where it was SSD ware-leveling, and it would affect nodes fairly randomly, but how do you find it? Because it would be intermittent, and so we needed to be able to see that, that server-side measure of how long it took. The other thing that we did here is there's a correlation ID so that you can take the correlation ID and then go look at other information on the server-side from a login perspective. That's maybe a little less distributed than we'd like, but it's there out of the box. In the future, we're hoping to take care of it. Yeah, I mean, I think important to point out this, right? This is not like the info couch base, but this is the first step, and we have more plans, right, to integrate it potentially with the server, have even more instrumentation, right? But this is like the first step into this adventure. Yeah, and we didn't really show it, but it's also important to us is that the user can do their own instrumentation and have that pass along into our libraries, right? So Michael, show them some of the other demos. Yeah, I wanted to mention that too. I've seen there's a few other databases that have open tracing instrumentation, and I think when they get plugged into application code that also has open tracing, being able to see that full stack has been quite powerful for those end users. Okay, so we've got 20 minutes left on the call and maybe a couple other things to get through, but this has been a great talk. Are there any final questions to ask the couch base team before we move on? All right. Well, we can certainly continue this conversation in Gitter, and this video will get posted up on the internet so we can start sharing it around because I thought that was a great presentation personally. But moving on, someone put update about happenings in the larger tracing community. I think I put that down on my phone. Yeah, did you have something specific there? Maybe. I mean, I might consolidate it with the item a few lines down around the conversation that we had on Tuesday. I think since the last OTSD call, there was a Gartner report that came out about microservices in APM which ended up basically saying that the enterprise software market that Gartner studies is moving towards increased adoption of explicit white box instrumentation and then mentioned open tracing by name a couple of times, which is cool to see and I think reflected reality that was met with a number of blog posts from other folks, which had varying levels of positivity about open tracing and instrumentation in general. That was not causally related to a blog post that Erica, Arnold from OTSD put up a few weeks ago, but it's topically related to that, which just kind of describe the different aspects of tracing which I think had put on the agenda. But in general, there seems to be a growing need to, like within the larger tracing community to describe the different aspects of tracing and name them and specify which problem is which problem and make sure that people who are trying to solve problem A, value that problem B and problem C and problem D are still important. So I think that's the basic narrative that I would try to play about this whole thing. There's certainly other takes on it that would be a little bit more emotionally driven, but in general, I think that this is a lot of people who need to understand the value of different problems. To that end, there was actually a very productive meeting on Tuesday that was organized by Ali Wright from Dynatrace, and he got a bunch of folks together who many of whom had been working on the W3C context propagation standards effort and tried to start a conversation including representatives from New Relic and AppDynamics as well as Census and Sergey from Microsoft around the importance of white box instrumentation without naming open tracing by name or open Census by name. It was a useful conversation for sure. I think everyone agreed that it was very important for there to be a lowercase s standard API for describing transactions that was separate from anything else, which is open tracing as a charter, and then AppDynamics and Dynatrace and to a certain extent New Relic, but maybe to a lesser extent New Relic, have certain things they'd like to express at different levels of application complexity like they'd like to separate span management from higher level concerns like describing HTTP requests and database calls and things like that. So we had about a day long workshop that would work and I felt like it was quite productive and there was a lot of alignment. So I just wanted to say that we're going to continue to have that conversation. If people are interested, you should ping me or Ted or whomever. All are certainly welcome to participate in it, but I just wanted to let people know that that's happening. I don't like how they feel like there are several different small conversations going on with people from different continents about tracing, and I wish that everyone could just be in one conversation. So this is just my attempt to update other conversations happening until welcome to anyone who wants to take part in it, but in general I thought it was very positive and there was a lot of alignment. Yeah, I would second that and say that conversation kind of dovetails with the W3C working group that is mostly focused on this trace context wire protocol. That wire protocol conversation is important. It hasn't been super directly related to open tracing in the sense that we're wire protocol agnostic, but it is important to, obviously, the members of the open tracing community where these two things start to get tied together more closely is on the other side with a sort of data export format. If you're going to use a standard wire protocol to tie multiple tracing systems together with, you know, unified correlation IDs across tracing systems, you're still going to have to have this problem where you have to get the data out of one of these two tracing systems into the other one so that you can analyze it. So a standardized data format is sort of the other half of that puzzle. And once you're defining the data format, you're getting to something that relates much more deeply to the kind of instrumentation you're doing. The model of that format ideally should line up with the model that the API is thinking about and then more concretely the, you know, tags and keys and values that the data format is using to describe things like HTTP calls or any other higher level concept really needs to match what that instrumentation library is doing. So I think that's an area where that W3C tracing working group and open tracing those projects need to gel up on that side. Because there is a lot of overlap there. Other people were at that meeting. If people want to keep talking about that please go ahead. Otherwise we'll move on. When's the next one? I don't know Chris. What is it? You can all come to Providence or at Island. Sweet. All right. Chris is organizing the next one in Providence. So ask him what is going to happen. Yeah. As soon as another one does get on the books we will make sure to spread it around the open tracing community. So other happenings going down the list on the agenda there was an inaugural Austin meetup. This is awesome. I believe this is the first official whatever open tracing meetup that has occurred at least in the U.S. to my knowledge. There was a meetup group in Austin that was formed mostly people at I believe a home away and under armor kind of holding it down. Sorry if I left someone out there. It consisted of a number of talks. I gave a talk Eduardo from Home Away gave a talk and there was a panel of tracing. It wasn't all open tracing related but there were people who worked on the Haystack system at Expedia as well as people from PayPal who has had a centralized log aggregator for a while. It's like a tracing system. And so there was a panel discussion on tracing and the pros and cons. My big take away was that this is really useful to people. Application developers don't like instrumenting the third party software that they're given. They don't like doing that themselves. They would prefer that it come with something out of the box like what Couchbase has created that they can just plug into. And if it's a third party plugin that's great. If it's first party and comes with a playbook with deeper information about what those trace points are trying to measure that's way better for them. So it was interesting to see users on one end asking for this and then hearing this Couchbase presentation around service providers giving them the thing that they're asking for. So that's making me feel really happy. Basically it was a great meetup. And we should perhaps start more in some of these other towns that are clustered in. So anyone who wants to start and run an open tracing meetup I'd recommend doing it and just ping me me and Priyanka if you want some help with that. I'll take you up on that. We'd be glad to maybe do something in the South Bay area. Awesome. Yeah, the South Bay meetup would be great. Okidoki. We've got 10 minutes left. We've gone through basically everything. Someone put tracing in four parts summary from TED so I guess I will talk briefly about that. We've already I think covered this. Erika wrote a great post called you know Tracing Tracing Tracing and I gave a talk along a similar topic down in Austin that I would like to turn into a blog post. But pointing out the way I think people have been receiving it well is like there's four parts of tracing. I put a link to the particular slide. I think that shows this. You've got a tracing API that you're using to instrument your code. There's a wire protocol that's standardized to talk between these systems. There's a data protocol that's sending things between analysis systems and then there's an analysis system itself. So talking about it in those terms the API, the wire protocol, the data protocol in the analysis system is a nice way to break it down because different people are focused on those different components. So depending on your role in this cloud ecosystem, you might find one of these things way more useful than the other. For example, if you work on cloud infrastructure at Google or Amazon and you're providing cloud services to people, you really care about this wire protocol and this data protocol because there's no way for them to install a Yeager tracing client in S3 for you if that's what you're using. So the API layer is not very useful to people who come from that background and because internally at places like Google, they tend to write all the software from scratch in-house something like an agnostic API isn't useful for them internally either too much. So people from that background tend to really really focus on these protocol level things. Whereas people who are not currently yet trying to glue together multiple tracing systems are less likely to look at this wire protocol and data protocol stuff and be like that's my big pain point. Without a standard data protocol how am I supposed to get my information out of Yeager and the answer is like I don't know, I just put it all in Yeager and there it is. So it's fine. So there is a bit of people in the community I think sometimes talking past each other because they're just feeling a different part of the elephant and I would like to get that sort of like cleared up maybe like getting some common language around that stuff. So on that front I'm going to try to turn this into a blog post to kind of follow up with what Erica wrote. So I don't think we can do this enough. That's all I have to say on that subject. Look for the blog post. So we've got a couple minutes left and we've actually made it all the way through the agenda. Does anyone have any announcements? I think Chris added something to the agenda. I did add an item because last meeting month there was an action item that I was going to change my PR to do trace context header detection on basic tracer for go to have a start with a first step suggested by Yuri that the inbound trace ID in the trace context header would be stored as a correlation but not used for the basic tracer zone trace and then the second phase could be a separate PR that actually upgraded basic tracer to the 128th bit. So I'm also going to those WTC meetings and I wanted to have in the show for both this group and that group and you know how so I guess I should just put this on the getter but what do you think is the method of correlating WTC trace context trace ID with an open tracing basic tracer trace ID. It's just a tag with the name trace context and that's it or if you want to do anything to stand through than that. That means to have basic compliance for basic tracer. Basic tracer has probably doesn't deserve to be in the, it probably should be in contrib or something. I think that it's more of a reference implementation than like something anyone is really depending on the meaningful way that I'm aware of. And this was just intended to be a reference implementation of how you might parse a trace context header and use it for your own trace. Yeah. It's a good question. So your proposal is to parse it and then add it as a tag to the basic tracer span but not to change the main propagation format. Yeah I got the impression that's what Yuri was suggesting last month when he said well why don't you start out with just within the notes from last month and then gone into a more deeper integration where I was just going to change basic tracer to do 128 bit trace IDs because it's doing 50 for the tracer IDs. And trust native, naively the sampling bed and everything. It's kind of a you know a punt but another thing I'm not attached to this at all but I can imagine basic tracers options that it uses at start up time could include some kind of designation as to which propagation format it's intended to use. I mean that XOT span context thing which was added as sort of like just a kind of whim I think has actually caused people a lot of anger which is hilarious to me. It's like we have to choose something. It's a reference implementation. Replacing that or at least having the option to replace that with the W3C thing would be completely fine as far as I personally am concerned although obviously there are other people but then there could just be like a construction time option to initialization time option made to tracer to say hey like use the W3C thing as your propagation format and personally I think that would be fine but again I don't know if you already had something else in mind and I don't need to dictate it. That's awesome. Part of what you were trying to explore Chris was just what if people are in this situation right? Like I have a tracer that uses the wacky trace ID but it's also a open tracing compatible and also just there exists no actual implementation of the do-it-yourself standard that I'm aware of so that was the other reason for doing a reference there's like a half implemented version in census that it doesn't actually match the time it's back or again. So Chris for light stuff we've made some changes to our internal format which is not no one is claiming to be any kind of standard and when we've done that we've basically had you know an if-else thing where we attempt to parse different formats and that would be another option to make basic tracer prefer the W3C format and if it's unavailable to fall back on the XOT span context thing or whatever I mean that is another option. Yeah I think the original yeah okay I think I'll try that then easier certainly than doing it with the weird um I think the original idea was oh let's say you go through um I don't know from Dynatrace to X-ray and back again and you want to see the trace ID that X-ray put on the thing assuming everyone's using W3C context and that implementation would have like a good example of how that might work that scenario that's a lot more complicated than what but that was the idea I think the reference implementation was done as well. I think it's a great idea I think that approach would I mean that scenario you just outlined is sort of the you know the graduate course level version maybe we're talking about the high school level version or whatever. Yeah that's what I want to eventually show is like here's how you could actually um have two different trace things in terms of two different trace IDs that sort of interoperate. I still think you'll need some sort of like a correlation ID tag because even I mean even if the tracer supports like W3C fully sometimes you you want to give the user the option to either participate in the trace or only connect to the trace right. Right yeah I agree and that's that's sort of how the Dynatrace Amazon you know sort of thing would work you wouldn't just trust other tracers or even use their ID you just add it as a correlation and then push it into magic correlation contact header. I certainly think there's value in just the simpler task of of just trying to parse that header like trying to actually implement that standard. This is getting into W3C business but my one concern on that front is there's been lots and lots of talk about optimization and I feel like in some sense some of that talk has been focused too much on like the kind of optimization you would get out of building your own custom HTTP client that was like super optimized for grabbing these things because again I think infrastructure developers will do that because it's worth it to them but I have wondered like what are the actual performance implications for everybody else who are going to use this around like first you parse the headers and then you got to go into this trace context header and then parse that thing multiple rounds of parsing there versus separate headers some of those optimization debates have felt a little easy without anyone plugging this into spring and other real systems out there seeing like what actually happens yeah yeah no I just want there to be code that marries the specs that they're writing over there so that people can look at it and talk about it so just write in stock awesome okay cool see you then great okay I think we're at the end of our hour this was a great call thanks again to the couch-based people for giving that awesome presentation and I would encourage Open Tracing people to follow up with some of this W3C work and start getting more involved in that group because they're starting to gel up a bit thanks to you guys for putting this stuff together we do you know do want to work with you and no reason we can't work together on buildings and loss paths thanks excited for next Open Tracing Meetup group yeah see you later