 Good afternoon, or evening, as the case may be. I'm Ben Siegelman. I am one of the co-creators of the Open and Tracing Standard. And I'm also a co-founder and the CEO at LightStep, where we deliver insights about complex production software systems. So I'm here to talk about the service mesh, and in particular, how we can make better sense of a service mesh architecture. I will also be talking about Open Tracing a bit, observability, how to make things more operationally sane. And I'm really excited to talk about it. So I'll just get started. This is a picture of a lot of birds, lots of them. There are hundreds or thousands of them. This is a murmuration of starlings. And I chose this picture because I think it's evocative of the vision that we have for microservices. You have independent actors that are working in concert. Each one is sentient and has its own agenda, but taking together, they're greater than the sum of their parts. And honestly, it's beautiful. Like, the idea of microservices is beautiful. I was at this conference last year, and I don't know. There are a lot of people there, but a lot fewer than this year. I think that this vision is taking hold in their industry for a good reason. And it's a vision that we should all be pursuing together. And I'm happy that we're doing that. Oh, I was like, that wasn't a laugh line. That bird, man, I'm going to send that bird a thank you note. Speaking of birds, this is perfect. Perfect. Really couldn't have planned that any better. I'm so pleased about that. So thank you, bird. This is how it feels. You're on stage in your microservice deployment. You're worried that it's going to have a bowel movement on you while you're on stage. So operating microservices is very challenging. Very challenging. Stop it. And it's for a lot of reasons. Some of them are fundamental, things that we're never going to get past. Software engineering is hard. Everyone here has a lot of job security if they're on that side of the house as a result. But some of them are not such good reasons. I think that we often find ourselves wanting to build application software. And yet, what we're actually doing is writing the same code over and over again to do something that's actually a hard computer science problem and should be factored into some kind of common interface. And yeah, we're doing it manually. So service mesh can help with discovery of services, interconnection of those services, circuit breaking, load balancing, that sort of thing. These are hard problems. You don't want to solve them yourself. Service mesh can help with security and authentication. You want to make sure that you discover the wolf without a lot of effort or custom code. And that's something that is near and dear to all of our hearts as well. We need to build secure things. Security should be built in from the firmaments. And that's what service mesh can offer us. What I'm really here to talk about is transactions, though. So this is evocative of what it feels like to look at a transaction in a typical microservices architecture. Each piece, each service is written in a slightly different way, maybe a slightly different language, or a very different language, different idioms, different frameworks. And when they don't work, they don't work in spectacular fashion. It's often incredibly difficult to figure out what happened. And yet, we have to do that. Like, the transactions that don't work are the most important. They're often stranger than fiction when you actually do the postmortem and figure it out. But you want to understand that explanation rapidly. And that can be a difficult thing. So what about this? This is my favorite tweet of all time. We replaced our monolith with microservices so that every outage could be more like a murder mystery. This is something that we need to move away from as an industry. It's a really bad feeling. I've had it. I'm sure everyone here who's ever been on calls have this feeling of, oh, gosh, I don't know what's going on, and I need to figure it out quickly. Microservices should not be getting in their way. And yet, sometimes it feels like that's exactly what they're doing. So the canonical answer to that problem of understanding transactions is distributed tracing. Distributed tracing is about telling stories about specific transactions across distributed systems, across service boundaries. And this could be microservices. It could be serverless. It could just be a bunch of monoliths that are all talking to each other. It doesn't matter. The idea is that you need to stitch things together and develop a unified picture and a unified explanation for the behavior of individual transactions. What you see here is on the left is Jaeger, which is another CNCF project. On the right is my company's trace viewer. Both of these are open-tracing compatible. They're actually open-tracing native. Zipkin also works as open-tracing. Many other awesome vendors have a lot of respect for New Relic, Datadog. They are also coupled to open-tracing. And the reason why we're doing this is because we shouldn't have every developer having to manually integrate into distributed tracing. There's too many touch points. There's too much diversity of language, too much diversity of framework to have everyone integrate with everything. And open-tracing is a lingua franca for describing the behavior of transactions and making that possible. So when the service mesh became a thing really quite rapidly and has taken off, it's a natural point of integration for something like open-tracing. So let's talk about transaction tracing without a service mesh. You've got four services. They're talking with each other. They're all written in different languages or different frameworks. And you have these touch points in and out of every service where you need to, at the minimum, you need to think about tracing. And the data from those touch points goes into the tracing system. There are many points of integration here. Again, if you're dealing with many languages, that's a lot of work. And open-tracing can help a certain amount here. But gosh, it would be nice if at least those touch points between services could be factored out. And yet that's exactly what the service mesh gives us. So too many points of integration in a service mesh. It starts off the same. You have your services. But now you have these sidecars that run either on the way in or the way out on ingress or egress from your services. And service mesh is a level 7 proxy, so it's completely reasonable to integrate with application level things like distributed tracing systems. So now our connections go through the service mesh. The service mesh integrates the open tracing. And open tracing integrates with, oh, whoops, sorry. Sorry, slide problem. Open tracing integrates with many different vendors and open source tracing systems, as well as other tools. You can build a bridge from open tracing to Prometheus, for instance. And great, now we've got visibility. So we're done. We're victorious. It's a great feeling to be done. So let's look at some of our traces. Take a screenshot of one earlier if you do it this way. This is not what we ordered. We wanted a trace that has a lot of structure and tells a story about a transaction. We wanted something that looks like this, where you have a shared timing diagram that shows many different services and how they interact. But somehow we had this instead. So that's not what we wanted. We did integrate with every RPC, but this isn't what we wanted. So what happened? We're not so victorious after all. So observing every RPC is great. I would say it's necessary. And service mesh gives us an incredible lever to do that. And I'm incredibly excited about it. Personally, I've seen what's happened at Lyft with Envoy. They were able to stand up tracing across their entire system of hundreds and hundreds of services with a configuration change. It was profound, honestly. It was incredible to watch. But you do need to tie things together. Just observing the edges alone is not sufficient. And there's a couple of ways to do this. The documentation will say something like, oh, you need to forward the header from ingress to egress. And that is true. Like, that works. It can be easy for a simple service. If your services are more complex, if they have queuing in the middle of them, or if there's some kind of fan out, fan in behavior, it gets a little bit harder. And certain languages is even harder than that. And this is the problem that Open Tracing was designed to solve. Open Tracing was designed to describe that sort of behavior, so that no matter what sort of tool or what sort of solution you're pursuing, you can see all of that. So we can use Open Tracing both the service mesh layer and within the process to connect these dots in a way that's truly end-to-end. So you were using service mesh for a single integration point at every RPC boundary and we're using Open Tracing within the process to propagate all of this state. And this really does work and is actually a really wonderful thing in practice. The cool thing here is that as an application developer, the somewhat cool thing is that you haven't coupled yourself to a particular downstream tracing system or observability tool. The really cool thing is that you're not writing a ton of code yourself to integrate into tracing. And if you're running microservices and you want to debug things quickly, I think you need it. I don't think it's a, I'm biased, but I don't think it's something that's optional. I think it's a necessity. And when we integrate with Open Tracing, we get more detail about the processes as well, which will help for root cause analysis. So I think this is getting a little bit dry. I even saw some of yawn. The birds have stopped. It's not as interesting anymore. So we need to apply this as something important, something with food coloring, something that we can really get excited about. So let's talk about donuts. So I want to introduce Donut Zone. I like doing demos involving donuts. Donut Zone's their motto is move fast and bake things. We are offering a DAS. Donuts is a service, scaling really fast, running into performance problems, built with service mesh using Envoy in this case, and also built with Open Tracing. And so yeah, this is Donut Zone. I'm not going to spend a lot of time describing architecture here. I don't think it's necessary for the purposes of this presentation. But suffice it to say, there are several services. There's a controller. There's a service that fries donuts. There's a service that applies toppings, whether they be sprinkles, or chocolate, or cinnamon. And there are external dependencies on Brain Pal, the popular payment system, and a number of clients, web clients, that order donuts and restock things. So at this point I think it's time for a live demonstration. I will go over here for that. All right, so first I'm actually going to show Donut Salon. This is a slightly same idea, different domain. Donut Salon is hooked up to Yeager. I wanted to show that this works with multiple different tracing systems, the same code in both cases. It's a pretty simple idea here. You can order donuts. We're featuring three different types of donuts for this demonstration. You can also restock donuts. If you go to the bottom here, you can restock donuts. So I can add more chocolate donuts and see the numbers go up, or I can order some more. OK, great. So you can order your donuts, and then there's some asynchronous JavaScript and XML, some Ajax here that allows you to see which donuts you've received. And oh, someone's already come in here and ordered donuts. You rascals. You can't trust anybody these days. I didn't say to do that. Anyway, so I can go in here, and Yeager I can say I want to see things that took at least half a second and indeed here's a trace showing one of those particular requests. This one resulted in error, because we'd run out of inventory due to whomever an audience is sneakily going to this domain on your own. But you can see it propagating across this simple distributed system as we top the donuts and fry them and so on and so forth. So it's cool. I do actually want to encourage some audience participation. But I want you to get a donut.zone instead of a donut salon. .zone domains have not been fully tapped like the .com. So I was able to get a donut domain, even if it was a .zone instead of a .com. So I encourage you to go there and order things on your phone. I'll do the same. I think it looks like there's too many people ordering. So I would also recommend going to the restocking and restocking some donuts. It's pretty fun. People are probably already kind of going crazy. I can see the numbers here are going nuts. I will admit this was load tested in a somewhat insufficient way, given the number of people in the room here. So if it falls over, I apologize. But yeah, maybe we should stop. It's been a good run, everybody. Wow. That's a lot of donuts. It's still up. It's still up. This is great news. Let's stop. Let's really stop. So let's look at some of the latest examples of orders of sprinkled donuts. I can bring these up. Well, these are examples of donuts that we've run out of donuts in these cases. But that's OK. You see a lot of errors here because of donut exhaustion. I think I actually have an example of a donut that actually did come through. This is a trace from just before the talk. These traces are all legitimate, but there are no more donuts. I think that's what's happening there. This is a trace showing this timing diagram. Actually, I wanted to show what this looks like if we just have the service mesh. It's definitely useful. You can see the propagation from the browser into the front end proxy and through the frying process as well as the donut topping process. But that's really all we get to see. We can look at URLs and things like that, but it's not possible to really get to the bottom of what's going on. All we know is that that's slow. If I look, for example, at this trace that involves the interior of the process, not only have we stitched things together, but we've also added more detail. In this case, the mutex library actually gives information about how many waiters were in front of us for a mutex lock. In my mind, this sort of thing, well, in fact, I know this from our own internal debugging at LightStep, this can be really invaluable to understand that the slowness is coming from contention around a particular shared resource. This is the kind of thing where you don't want to just see the service. You want to dig inside the service and really make sense of it. And we can even do things like we can go and look, for examples, of traces that had 20 waiters in front of us instead of just four and see where the latency comes from. And indeed, you can see when you have to wait for 20 users in front of you on the mutex, that ends up dominating the end-to-end latency for these requests. So in my mind, I don't want to go into great detail about debugging this fake app, but I do want to, hopefully, it comes through that these sorts of analyses are fairly trivial if you have all this data. And they're quite the opposite if you don't. Incredibly the opposite, especially if you have a lot of concurrency. And although this feels like a big room with a couple of thousand people in it, which it is, it's a small room if you compare it to the user base for any kind of scaled-out enterprise product or consumer application, this is the kind of concurrency we're all dealing with all the time. And it's crucial to have technology like this to make sense of it. So I only have a few minutes left. I want to get back to my speech here. So Open Tracing is integrated or nearly integrated with all of these, Envoy, Istio, Conduit, which was just announced by Boyant a few days ago. And I'm excited about as well as Engine X and the Engine Mesh project. These support Open Tracing, which means that any Open Tracing-compatible project or vendor can just plug into these things directly. And one of the things I get really excited about, and about CNCF, actually, and about this sort of integration is that Envoy, Istio, LinkerD, Conduit, Engine X, these things are exciting technologies. Open Tracing is also an exciting technology. But when you combine them, they're actually greater than the sum of their parts. This family of projects, they're not just independently useful. And it goes back to my flock of birds in the very beginning. I think the sum is greater than parts in this case. And it's something that's really exciting to me as a developer and as a technologist to see how these things make each other better. And when you combine them, it's even more powerful, more powerful still. And I just think that's exciting. Speaking of feel-good messages and preschool and stuff like that, we're going to have some small group discussions that are more friendly to Q&A than something like this that relate to both Open Tracing and service meshes tomorrow from 3.50 until 5.10. There is a salon for Open Tracing. I would love people to attend. There will be a bunch of us there. Matt Klein from Envoy will be there as well. And we can talk about the sort of material in this talk, help people get basic questions answered as well as just have a kind of support group about deploying this kind of stuff in production and the types of problems that we're all facing. Jaeger, an awesome CNCF project that is a distributed tracing system, having their salon tomorrow. As is LinkerD. And the Envoy salon was fantastic. That was earlier today, unfortunately, so you can't go to that one if you didn't already go. But I hope you went. I heard it was fantastic. And oh, that's it. So yeah, I wanted to thank everyone for listening. It's been a great pleasure to be here. And I'm really excited about the rest of this incredible conference. And yeah, with that, I thank you. And I will plug the ethernet jack in for the next presenter.