 All right, so thank you everyone for coming. This session is distributed tracing, implementing latency analysis for microservices on Cloud Foundry. And I said that in a sentence. So it's a long title. But before we start, I always want to take a selfie with everyone so I remember. So are you guys OK with that? All right. That's why I said everyone to come forward. OK, let's see. I got it. Unfortunately, I didn't get this side. So let me take another one with this side. OK, got it. All right, so we have 30 minutes for this session. And this is right before lunch. So I'll keep it short, keep it interesting. And I want to be collaborative. So I'll stop in the middle for questions and everything. So who am I? My name is Reshmi Krishna. And I'm a senior platform architect with Pivotal, based out of New York. And I'm also a very, like, I'm a women and tech community member, so I go to a lot of conferences, meetups. I also host the Cloud Native New York meetup. And when I'm not doing all of that, I like to love to visit places. I have 72 countries on my bucket list. Out of that, I fulfill 34. And I came back at 4 AM today because of flight cancellations and everything. So it's always fun. So for today's session, we'll be talking about distributed tracing. What is a tracer and tracing systems? These are some terms that are always thrown around. And there are a lot of technologies out there. And it can be confusing. So I want to level set first. And then we'll be talking about Zipkin, which is an open source distributed tracing system. It was by Twitter, but it's open source now. And then we will see how you can incorporate these technologies into your existing applications. So before I start, how many of you use Java or Spring? OK, so most of you. OK, that's great. Node.js? Couple. OK. All right, so I'll be focusing my talk on Java and Spring and Node.js. And then how many of you use PCF? OK, that's great. Most of the room. So we'll be also talking about PCF metrics. So if you're using PCF metrics, you do not really need Zipkin. But Zipkin is very good for testing on local. Or if you're using something that's like you want to use something that's open sourced and supported by the community. And then at the end, I would like to do a demo of all these technologies. So I mean, I always start with this. Now when we are moving from Monolith, which is a big ball of mud, to microservices, it looks something like this. And we always talk about the good things about microservices and why we should move to microservices. But there's a lot of challenges that's associated with microservices, like configuration management, service discovery. And sometimes it becomes magnified. So if you see the architecture, this photo was taken three years ago. And this is how Netflix's architecture looked three years ago. So imagine how complicated it must have been. So there are issues with certain things when you're moving to microservices. And since it's a journey, you need to figure out. And there are a lot of roadblocks that we need to figure out down the journey. So one of the things that everybody comes across is latency analysis. Now what happens is when you're breaking down your monolith to separate microservices and you're deploying them on different containers, sometimes what happens is there might be network latency added. Or there might be issues where you need to figure out which application is adding latency. So it's not one application anymore. You need to figure out across various applications because your request is now spread across various applications. And then it's also like which one was responsible so I can probably talk to the team that handles that microservice. So distributed tracing. And how many of you have heard about distributed tracing? Oh, wow. OK. So I don't need to give a lot of introduction. So it's the process of collecting end-to-end transaction graphs in near real time, right? And in terms of I'll be talking a lot about trace and span, are you guys familiar with that? Do you guys know? OK, so most of you know trace and span. OK, for those of you who don't know, a trace represents the entire journey of a request. So for example, if my request A goes to three different microservices, a trace will constitute of the whole journey, whereas a span is just for one microservice. So it's a single operation call. So for example, my request A hits a microservice x, that's a span. Then hits y, that's another span. And aggregation of that is a trace. And you will also hear terms like tracers. So tracers are libraries that add logic into your application. So they had certain properties, like it needs to be lightweight, so it's not adding additional overhead into your application, right? So example of tracers are Spring Cloud Sleuth. And how many of you have heard about Spring Cloud Sleuth? OK, great. So most. OK, so Spring Cloud Sleuth is specifically for Java Spring. There are other tracers that exist for Node.js, Python, other applications. And distributed tracing systems are just used for collecting, indexing, and viewing these information. So Zipkin. And how many of you have heard about Zipkin? OK. That's great. So there's no need for introduction, right? So Zipkin was, as you guys know, it was open sourced by Twitter. It was built as part of the Hack Week in Twitter in 2012. And now it's part of the community. Open Zipkin is the primary fork. And what it does is it just does collection, indexing, make searching of data easier. And it aggregates all of this information. So my job as an application developer is to send all the information from my application to Zipkin via RabbitMQ or any messaging stream or HTTP or logs to Zipkin. And then Zipkin does its logic and makes it viewable to the human eyes to predict problems and to see what's happening in near real time. So this is my PCF. And since most of you are familiar with PCF, I have deployed Zipkin as an application. So no longer do I have to. So earlier when Zipkin started out, it was three different components that you needed to deploy. But now you can deploy Zipkin as one single application. And you can just attach MySQL to Zipkin. So MySQL or Cassandra or any data store is needed to be attached to Zipkin so it can collect and store data. So you can look at it later as well. And it can index from there. And RabbitMQ is needed because it's needed if you're using stream. It's not needed if you're using something else, like HTTP or logs or something else. So I'm using stream in my demo. So that's why I've added RabbitMQ to Zipkin. The architecture looks something like this. So as you saw, I deployed Zipkin as an application. I have bound that to a span store, which is MySQL in my case. And I've attached that to a transport layer, so RabbitMQ in my case. And then all I had to do was my applications are already existing Java Spring applications. And all I had to do was add Spring Cloud Sluit to the class path and then attach it to the same RabbitMQ binder. So in terms of what changes specifically you need is if you do not have a binder in your application, you need to add that to your build path. You need to add the Spring Cloud Sluit to your build file. And Spring Cloud Sluit also has a concept of samplers. So the reason the concepts of samplers exist is because Zipkin is now attached to a data store. Sometimes you don't want to send your 100% of your traces to Zipkin, because it might be. And if you're taking care of it operationally, it might be harder for you to manage. So by default, Spring Cloud Sluit sampling percentage is 10%. Now it might be difficult to analyze latency with just 10% of traces. So you can control that. And you see in my demo, it's 1.0, which is 100%. But for sure, in your application, it will be different. You also can set different properties based on what messages you want to send. Do you want to send if your request was 200 OK? Do you want to send the information there? Because it adds these trace IDs and span IDs into a request and response header. And then, as I mentioned before, you will deploy Zipkin as an app and just connect it. So before I go into my next slides, are there any questions so far? No questions. Wow, all clear. Am I going too fast, too slow, good? All right. So in terms of PCF metrics, are you guys familiar with PCF metrics? Some of you are using it. No. One, two of you are using it. OK. So the good news is if you are using PCF, PCF metrics can be downloaded from network.pivotal.io. It's a product that when you install to Ops Manager, it comes with several different nuances. Like it will come with a data store. It also comes with queues and caches because you would need to send all the information from Firehose to this data store. And then when you query from your web browser, and I will show you in the demo how you do that, when you query from your web browser, Metrix API is a component that also comes with PCF metrics. It will query the data store and the cache, and will send you results. Now the good thing of having PCF metrics is first of all, it's integrated with UAA. So for example, if I do not have access to certain applications in production, I will not be able to look at logs for that. The second thing is with Zipkin, you will see I can only look at traces. And I can only see latency from there. Now if I need to look at logs, I then need to go back to Splunk or whatever log management tool I'm using, take the trace ID from Zipkin, and then go and search there. That's an additional step. And PCF metrics have eliminated that. And you will see that in the demo of how you can see logs along with your traces. And the feature is called correlated logs. Okay, in terms of if you're using PCF metrics, it's very simple. You don't need to control the percentage and everything because we are collecting all the data from Firehose, right? So it's just coming in from Firehose. Everything is coming in from the Firehose. So all you need to do is you need to add Spring Boot version, the sleuth version, the correct versions and later. And you need to add the starter sleuth dependency to your application. Okay, so I will go right into my demo after this. And in my demo, you will see four microservices. And I was talking about traces and spans before. The way it works is, of course it's not one-one. It's a 64-bit unique IDs, right? So the way it works is if you see the trace ID is something that's propagated throughout the trace. As a result, I know which spans belong to that trace. And the span ID, if you look here, span ID one, span ID two, so every time your request hits a microservice, the spans are created. And the parent ID just tells me where was that span generated from. So for example, back office microservice interacts with customer and account microservice. As a result, the parent ID is two because I want to know which back office, like back office microservice is the parent of those spans. So you can think of this as a tree. It's basically that. All right, is there any other questions? Otherwise, I'm gonna go right into my demo. No? Okay, great. Okay, now, pray to the demo gods and see what happens. Oops, not this. Okay. I'm gonna zoom in a little bit. Can you guys see from the back? The back row? No? Can you guys come forward? This is, I think this is the maximum I can zoom. Still can't see? Yeah? Okay. All right. So now let's go. So as you see, the first demo I'm gonna be doing, I'm gonna be doing two demos. The first one is with Zipkin and this is deployed in PCF. The second one is gonna be with PCF metrics. So in terms of Zipkin, once I've deployed it, I'm just gonna close everything. I will just bring it up. And it looks something like this. This is the UI for Zipkin. And it registers, if you see it has registered my applications already. So because I have my RabbitMQ attached to Zipkin, it sends this spring application name to Zipkin and that's what it registers as the application names. And it also gives you the power to see different endpoints. So now let's look at the demo and see how it works. This is a very simple application. So now if I go to an endpoint, which is demo start, what you're seeing here is my UI called my backend microservice, which in turn called my customer and account microservice and that's what I stated before, right? Now what I need to do is I need to look at why there was some latency associated with that. I need to go to Zipkin, refresh it and then I want to look at this UI application because that's where I originated my request from. I'm gonna look at this endpoint because it looks like that is my endpoint, the demo start endpoint. And if you look here, you can see that you can control the start time, end time. You can also control the latency. So for example, if you're in your production application, you're seeing greater than 30 microseconds and which is not your result. So you can just target those traces and spans and just target that duration. So you see all of them in a single row. So now let's see moment of truth. Okay, that's great. So now I have my trace with spans that came from that application. Now if you see it's one minute ago, it looks like that's mine that I generated. Okay, so it looks something like this. What you see here is when you zoom in, you see the entire trace along with the spans. You see that my UI was hit by this demo start endpoint. That in turn call these different spans. Of course, this is a demo application. Your production application is gonna look much different and crazier, right? And then you also see how many services was associated with that. Now that's very important because in my demo, of course, there are only four microservice. In your production application, you want to see if a request came in which microservices was associated with that certain request, right? And you can also zoom in and of course, you see the latency. It was around four seconds of latency. An additional thing you can see here is I have generated my own custom spans. So you can do that. So as I was mentioning before, and it's very simple, you just write trace.create span and that's all you need to do in your application. One single line. It's gonna generate custom spans and it's very useful if you're testing in your dev or QA because you can see what's going on. Now if I want to zoom in, you see these two circles here and the two circles are my custom events. So for example, if I want to say that I want to know, I want to get notified when my applications restarted or something like that. I can put in events here and this is what you see. These annotations are again, it's coming from, I've added these annotations as custom annotations so it's coming from there and it's very useful in terms of if you're adding additional stuff. So for example, I'm seeing here I've added a value, success. It can be error or something else, 400 or something else, right? So it depends on how you're using it in your application. So before I go forward, is there any question here? No? Is this useful? Okay, there's a question. You cannot add a custom span on a method call. You can add before the method call. Yeah. Okay, and then just last quick thing is, if you see here my back office microservice is an aggregate of the account microservice and the customer microservice. So it's a bottom up tree, right? And that's what Zipkin is helping you visualize. Now additional thing is, something that I find very useful is dependencies. Okay? All right, great. So of course I have four microservices, right? But think about it, if you have multiple microservices, this is really useful because now you're seeing which microservice is calling which microservice and these lines become thicker. So for example, if my UI call back office microservice, so if you see here used by, it only called it three times. So for example, if it calls it 10 times or 100 times versus if it calls these services only three times, this line will appear much thicker than the other line. So it's very quick to visualize what's going on and you know, it's very quick. And the same thing you can see here, which service is used by which service and how many numbers of call, right? So this I feel is very useful in terms of quickly visualizing, especially your production application. And for this, all you need to do is deploy this as an application and add Spring Cloud Sleuth into your application along with the appropriate sampler percentage. Now, let's look at PCF metrics. Okay, so I'm gonna look at back office microservice. So if you install PCF metrics in your ops manager, you will see something like this appearing in your apps manager dashboard. So all you have to do is you have to click here. Okay, of course I need to log in, UA. So now what it's doing is it's fetching all my applications that's associated with my account. I don't wanna look at all the application, I just wanna look at one for the purposes of the demo. So I'm gonna look at my back office microservice, which calls the account and customer microservice. Okay, so the demo is not about, you know, events and container metrics, but you see here that from Firehose, it's kind of aggregating everything. It's aggregating the events, aggregating the metrics and the logs and that's why it's very useful. Now, let's zoom in, because I know it's probably hard to see from the back. Okay, so we are gonna be concentrating on this logs part. And this is out of the box. And you can get all types of logs here, right? I'm not interested in the API router, any of these other logs. I'm only interested in the app logs because I want to see why the latency was there. So I'm just gonna look at the app logs and while it's doing that, I can also search for certain keywords. I know in my application, I had an output that said hello from, you know, several microservices. So I'm just gonna search for the term hello, okay? So now if you see, it tells me my back office microservice got a response from the customer service and it shows you all the logs associated with that. This is useful in terms of just even, not even distributed tracing, but if you want to search something in your logs, right? In terms of your keywords and you want to look at your application. Now, in terms of distributed tracing, you will see this arrow looking thing when you install this product. I want to look at this customer microservice and I want to look at what's going on. So this is the new feature, the correlated logs that were introduced in 1.3 PCF metrics and what you see here is same thing as Zipkin, right? But this is built out of the box. You don't need to have any of the slewed percentages. You don't need to install the application. So you're not really managing Zipkin here. The design is very similar. So you see my UI call back office that in turn call these start and account microservice. And then if you look at the trace, it's all the trace IDs from there. You can also filter logs based on the spans. And that's the power of metrics because if you're not in Zipkin, you need to copy paste the trace IDs from the JSON file. So I would like to stop here and ask, is there any questions? Yes? Yes please, right. So Dynatrace is of course the product, right? That you need to install and you know you need to manage that entire product. This gives you a power to just see something out of the box. So it's just utilizing your information from the firehose and it's just helping you see this. You can use Dynatrace on top of this and Dynatrace probably gives you, like you need to probably build those dashboards. I'm not sure if it comes out of the box in Dynatrace, right? Whereas here you can see the dashboards out of the box. It gets from the loggergator, firehose and everything. Pardon me, I can't hear you. Yeah, so it's getting from the loggergator. So you know when you configure the port, so 443 or whatever port you configure, it's just gonna get it from the loggergator. You don't have to use Clunk or anything else like that. Does that make sense? Okay, yes. Yeah, yeah. So router, there's some logic implemented in the CF router as well for this feature, right? As a result you see this. Yes, in terms of, right? So it's two different answers for two different technologies like for Zipkin you would need to manage your database, right? So you will probably a better off not sampling 100% because it's gonna be flooded very quickly. For PCF metrics currently it comes with two data stores. It comes with an inbuilt MySQL data store and what it's doing is so it's a little heavy weight in that terms and I know in the roadmap we are working on customizing how much data you want to store because right now I believe the data amount of data store is for two weeks, right? And in the future we are working on customizing it. And is Mocache in the room? No, he's not. So he's the product manager. He can probably tell you the timelines. Yeah, yes, yes. So in terms of standardizing that really comes from the community and looks like that has been standardized in terms of your tracers. So for example, Spring Cloud Slute and the other tracers that exist that's used by other tracing systems like Jagger, you know Uber is using Jagger and other tracing systems looks like because distributed tracing system their job is to just collect index and show you the data, right? Your application needs to send that data and needs to send it in a standardized way probably in a header. So that's what you need to do in terms of HTTP request and response, right? So that's already standardized. Yeah. So in terms of like Zipkin, I know it is standardized. Metrics it is standardized. In terms of Jagger, it might be slightly different but it's still standardized based on I think the things that they're looking for, right? The things that they're indexing. So Zipkin can connect to your data stores. So it can connect to your key value data stores. For example, Cassandra, I know I'm not very certain about DocumentDB but I don't see the difference. Like it should be able to because all it's doing it, all it needs is a data store. So it can be either a key value data store, it can be a relational data store. You know, that's all it needs. So I think it's doable. In terms of are you interested in a certain specific data store? It's not, I know they're working on it. The database implementation. So it's gonna show up as a separate role in terms of your application calling the database and getting the response back but it's not gonna show you what's internally happening in the database. So that's not there yet because that needs more integration with your database, right? So you will definitely see a role going from your application to the database and coming back but internally what's going on, you won't be able to see that. So can you please repeat your question? Yes. So your tracer is not gonna send information until your response is not received, right? Yeah, yes, yes. So it will send, for example, it's gonna send the information, let's say your first part of the batch processing and it's completed. The request response is done, right? Now based on if your response, if something else is waiting for the response, it's then gonna be sent afterwards, right? Your tracer is not gonna send until you haven't finished your response. Yes. Yes. Same concept. Multiple parents on a span. No, you cannot have that. Right now you only have, so in terms of, I guess I need to understand the question more, right? So I need to dig deeper. So if you need about, if you talk about multiple parents of a span, right, are you saying that your response is already sent back, right? But the parent span is still, it's not done, right? It's waiting for the other 100, right? I'm not sure about that. I'll have to probably research more into that. I'm, yeah. He probably knows, you probably know more. Is there any other question? Yeah. Yeah. So all you have to do is, when you add the tracers, all you have to do is just extract that because the tracers are gonna add the trace ID, span IDs into your headers. So you just extract that, right? So before I take any more questions, thank you everyone for coming. I know everybody is running for lunch.