 This is my goal. The guy is the most important person in the world. I accept the most of that one. I don't know. I don't know. I'm a scholar in Israel. I'm a Buddhist. I've been a scholar. I'm giving a stage. I do a lot of... I'm a surgeon. Although there is no reason I can't... I'm a surgeon. Of course, there are those who are the same. I'm a Buddhist. It could mean... No. Can you hear me? And at this pitch also. No? Yeah, it's okay. All right, so let's get started. Hi, my name is Nakul. And today I'm going to speak about distributed tracing. A very important tool that helps to debug and understand distributed system and microservices. So if you are building distributed systems and microservices, then this tool can help you... or this concept can help you to address some difficult and hard to reproduce problems such as long-term latency. I would like to kick off this talk by discussing why even we should bother about latency and what role it plays in distributed systems. After that, we will take a look on what distributed tracing can offer us. Next, there will be a short demo just to get a feel about distributed tracing. And after that, I will like to show you what is Zipkin and how we can benefit by using it in our text stack. And finally, I will like to walk you through the code that is used by the demo application. So what is latency? Well, one may define it in simple world as the amount of time it takes to complete a task. So for instance, let's say I want to fly from city A to city B and it takes four hours. So in this case, my latency is going to be four hours. Some people tend to mix throughput and latency and improving the throughput is also going to improve their latency. Well, it's not the case really. So coming back to my example of flying from city A to city B, improving the throughput means instead of flying one plane, let's fly five planes. So now we can carry more passengers from city A to city B. However, the latency that is overall time for every passenger still remains the same. In programming, whatever we do takes some amount of time. Every operation, for instance, if we want to access L to cache, it takes time acquiring mutex, acquiring lock and lock. This seek access to the main memory reference. They all take some amount of time. Maybe the amount is so small that it is negligible. However, if we are hit with a huge scale, then all those small, small numbers starts to add up. You will say, well, who really cares about latency after all? It's not hard to guess that it's the end user who suffers the most. But the funny thing is that they are not sitting with a stopwatch and counting how long your application takes to respond. On the contrary, they are just driven by some needs and they want your application to finish that task. However, in some case, speed is the name of the game. So for example, in a trading system, low latency can make a difference between grabbing an opportunity and losing an opportunity. The users really don't care how fast your application responds, but they do start to care when things start to get slower. So what make distributed system so interesting? Well, in a distributed system, a single request may end up touching hundreds of services written in different programming languages and deployed using different various runtime deployed across many machines. In a monolithic world, we can somehow squeeze all that picture in our head by using some diagrams, some documentation, you name them. However, in microservice world, the picture is ever changing. New service can spin up in matter of days or week. Old services can be replaced. We can have some services which need to be upgraded because of performance related issue. So it's really impossible to keep all that information in the head. And what happened when all of a sudden things start to get slow? Well, it's difficult to pinpoint what can be the cause. Maybe it's a new service, maybe it was the old service or maybe it was some upgrade that happened during between. So now I'm going to tell you a short story about Bob and how Bob encountered a long-term latency issue. So Bob is a programmer and he just migrated their legacy application to microservice application in order to address the business needs. However, he started receiving some random emails from the customer complaining that their system is working really slowly. Since there can be many source of potential errors like slow disk IO, networking errors, maybe GC garbage collection was kicking in somewhere in between or some slow queries. Bob have really no idea where to begin even start looking for the problems let alone to search a solution. Bob didn't knew that time that he was suffering from long-term latency. But he tried his level best to troubleshoot those problems. So first option which he tried was log analysis. Since there were a large number of lots of services were there it add up having lots and lots and lots of log files. When he start to dig up in the log files he found an overwhelming amount of information. There was a lot of noise like threat dumps, asynchronous tech trace so on and so forth. Moreover not everything was in critical path because some of the services were just asynchronous by nature. So when they received the request they immediately respond and spin up some background process. So in reality they were not the cause of the problem or they can't contribute in causing those issues. Moreover correlating all those logs requires some manual efforts. At the end Bob end up with some bunch of numbers and what to do with those numbers. One option was to take maybe mean, maybe more, maybe average, standard deviation what to do with them. So in a nutshell all those number and that log analysis simply doesn't help him to resolve those issues. So the second option was to go and ask for matrix to help. Well matrix can tell that something is wrong however it can't tell the cause what is the cause that caused this problem. Furthermore the way in which we aggregate this data can give us some false positive. So for example let's say there is a country where 100,000 people are living and one person is making a billion dollar a month and others are making only a dollar a month and if we take the average of this we will say well everyone is happy in that country basing on the income but that's far away from the reality. So Bob started to wonder whether it's possible to find out how many clients were impacted by this problem. That's where the Bob started to learn about percentiles. So he was precisely interested in the 99th percentile of his system. So the 99th percentile just simply means that one out of every 100 visit experience some delay D and in our case this delay can be 50 milliseconds. So the total visits that were experiencing the delay can be easily calculated by using n which is the number of visits divided by 100. So right now as you can see this number is really not so alarming it's 5000 after out of half a million. However we were in distributed system. So a single visit by a client was resulting in making 8 downstream calls and those downstream calls were interacting with some highly active service which was 99% fast but 1% slow and now when we start to take into this consideration then the likelihood of encountering latency during a single visit was 1-0.99 because 99% is 0.99 express here to the power of number of downstream calls so it is 8 in our case and after doing the max it end up to 8%. Now the total visits affected can be calculated using 8% of n. So this number now you can see it is 40000. This are lots of visits and moreover this number is directly dependable on the number of downstream calls. So if we will have here instead of 8 in future 20 then this number will be different. And interesting thing to notice is that it can effect probably every customer in our domain. Why because usually a customer visit let's say the website few times in a day. So there is exist a likelihood that all the customers will notice those latency related issues. So Bob was happy to find out that there is some small progress. However when he went to the boss boss really needed a solution to address this problem. But the Bob still didn't knew the request timeline that is when the problem was started how many services were involved which operation were involved and how many of them were in the call graph sorry how many of them were in the critical path. Also he didn't know how to correlate the log and moreover he did have no possibility to kind of test whether the single request is really slow on various cluster or it is only slow I don't know in this zone but not on the west zone. He didn't have all those data really to work with which could help him. Moreover he didn't have any clue how much they are deviated comparing to the acceptable value that that delays were and last but not the least he didn't know the call graph. So how a particular call a request traveled in his system. So in a nutshell he was missing a few things. What distributed tracing can help or does for us it can first track some request flow which means that when a single request flows through our system it can track how that request propagated in our system. And this data is usually available within minutes so we can react really fast. Moreover the apps can be dynamically instrumented so it means some special code in order to make it work. And it can provide us some great benefits like which services were involved which operation were invoked at what time so on and so forth. So we can say them like some system insight. It helps us to measure the ant and latency and last but not the least we can use distributed tracing to perform some sort of optimization like spotting redundant request or maybe some calls which are supposed to be executed asynchronous but they were executed synchronously. And I will show you this during the demo. So now the question comes in our mind how can we apply this knowledge which we already discussed. Well the answer is simple. We can use a tracing system. A tracing system should able to obviously trace and when I mean trace I don't mean that it should trace only across particular clusters or nodes. It should trace across various programming languages and which are potentially using different runtime environment. Also it should have low overhead because we don't want to bring our production down because we are tracing something in our production. It should be scalable because it should able to handle our as our business grows and our services grows. It should work around the clock 365 days because production bugs are difficult to reproduce and it shouldn't rely on programmers collaboration because if it will make it too fragile and then the whole concept of tracing is not going to perform so good for us. Open Zipkin is an open source tracing system. It is developed by so initially it was called Zipkin and Zipkin is a distributed tracing system. It is created by Twitter and it is based on Dapper and Open Zipkin is more like a GitHub organization so they take the primary fork of Zipkin and remove the bits which were specific to the Twitter so every one of us can access it. It's open source and super super super important is the pluggable architecture so we can plug the type of storage is depending on our competence of the team. One thing which we should learn before diving deep into Zipkin is about Span. Span denotes a logical unit of work done. It has a start time and an end time. So the work done is expressed in human readable string for example create catalog calculate cost foo bar they are the operation names. They are created by Tracer and Tracer is this instrumented code which runs in your application and generate all those Spans and they are super super super slim so they just carry the bare essential information and also the root Span that is the Span initial Span that is created it is a Span without a parent ID so after knowing what a Span is let's take a look on how the Zipkin annotation works. So we have a client who has an HTTP request to get something in our case catalog but it really doesn't matter. So it marks the beginning of the Span so the Span gets started here after some time the server receives the request and now we can calculate the network latency by using SR that is the time at which the server receives the request minus CS the time at which the client sends the request so this is going to be the network latency. After some time the server is done with the processing and the processing time can be done can be calculated using SS which means server send minus SR which means server received then the server hands the response back to the client the client receives the response it marks the end of the Span and the network latency can be calculated using CR minus CS which means client receive the time minus server send the times and the overall response time can be calculated using CR minus CS so CR means the client receive and CS means the client send so this time minus this time can give us the overall response time. However if we go deep in this HTTP request what happens here we can notice something that when we call the get catalog what the catalog service does is it ask some price service to fetch the price and then it call some another service product service which call some database call it makes call to some data analytics service and finally hand out the response to the client all these green boxes which you see are Span and the collection of this Span is what we refer to as trace one thing which is important here to notice is when we created or when the Zipkin created the initial Span there are some let's say some small fields which are associated with every Span so first is trace ID the trace ID remains the same for all the Span and as you can see it is one here and one here in reality just to delete this number on the slide I have chosen one but in reality it is 64 bit or 128 bit of randomness so they are unique and why 64 and 128 because it depends on the version of Zipkin previous was 64 and now they support 128 bit because of the collision in the IDs important thing is the Zipkin takes care of linking a particular Span to its parent using parent ID you can see this and this Span have the parent ID of one so this is the Span ID of this becomes the parent ID of this and for every Span it generates a new Span ID and here you can see the parent ID of these two Span is this three because Zipkin takes care so this is how it handles about Spans so what is a trace is a DAG which is directed as a graph of Spans and it forms a latency tree so now I will like to take some time to show you a small demo application so that we can move forward the code of the demo sorry the code of the demo application is available on github so you can fiddle around after the presentation alright so so shall I make it bigger now can everyone see it alright so what I am doing right now is I am making a request I make some request get request to some service running on port 8060 and it gave me some bunch of JSON here but the interesting thing is right now I haven't told you anything about this application what this application do us what this demo application is all about so let's go to Zipkin and ask whether it can help us so first of all I would like to know what this application does in reality so this diagram is generated all by Zipkin I can make it bigger so you can see it much better so what this application does is it it have bunch of services one is called price service it is used by catalog service one is product service which is used by again catalog service and it uses Apache Spark and Mongo and these two services are used by this particular service which is called product service so right now I have some fair bit of idea what parts of the what the sample demo application consist of now if I can say some fine trace I can see here the trace what happen when I make an HTTP get request so bunch of spans get generated here I will make it like this maybe so I can expand here also what happens is I when I issue a request to a catalog service I made an HTTP request which in turns called create catalog and during create catalog I fetch the price product and email and all this information is gathered by Zipkin here after that I can see which class it was and which method was involved and even I can see that I am using a circuit breaker here hysterics and which threads were involved here and what was the local address on which this service was running after that the create catalog service calls some price service again it makes some HTTP request and the price gets calculated by another service and finally it calls some product service and then it makes another HTTP request and during product assembly I make I retrieve the product and I filter the products and finally there is some asynchronous operation that takes place in the background and all this all this graph is generated by Zipkin by itself so at this very time I know what happened in my system and even can go to Zipkin and ask for some more things here but let's say if I make another request here going back to my so let's say I make another request here to some other product and I see that this product takes some time I can go to Zipkin and ask again whether it received some data and instead of longest first so we can salt it by newest first and for sure it received and if I expand them all I can see that the price calculation have some cache miss so I can click here and I can see that the cache was miss and the calculation took 30 seconds and on which cluster the request got executed what is more nice when it comes to Zipkin is let's say if you have a lot of request you can go to Zipkin and ask for all all the request which have the cache miss for instance happen during it so if you are running it on production and you face some sort of a problem then you can go to Zipkin and query the traces using this information alright you can also do here in Zipkin again just to show you what Zipkin is capable of it can we can also highlight trace based on the errors so if you get some error during processing some of your request maybe you can decide what an error condition is maybe the database queries take too long or maybe the cache was miss so on and so forth the Zipkin can highlight all those queries so in this case I executed some Mongo query and I encountered some problem while fetching the products what more we can do is we can ask Zipkin about all the services and this information can be fetched via either command line or you can go and see in the drop down section here Zipkin also allows you to fetch some information about which version it is running here in this endpoint and finally all this information which we see which I shown you is also available if we want to use purely I don't know some machine learning algorithm to process those so it provides some API endpoint and you can fetch all those information and as you can see just using some JSON and parsing relevant bit of information I created the same in the browser so coming back to the slides let's try to understand the architecture of Zipkin how it really works so we have some service in which some instrumented code is running and this code creates some spans and those spans got transported via scribe Kafka or HTTP and the collector receives those spans and storage sort of DC relies sample and schedule them for storing it can use under the hood Cassandra MySQL or elastic search to store the API the service which I shown you on the command prompt it uses the storage to retrieve the data and the UI in the browser which you see it is just a part which visualizes this data so this is how internally the Zipkin works what is a tag? a tag is an important concept in Zipkin it denotes some key value pair this information is useful when you want to understand some context with your request like in my case it could be a calculation time when the cache was missed it can help me also during debugging this information also we can have some sort of tags this is like class which was invoked what was the method was executed so on and so forth the log denotes some really meaningful movement in the lifetime of a span this info is time-stamped and a span may contain 0 or more logs in our case it was product fetch where we use the logs to kind of elaborate our diagram generated by Zipkin what are annotation? annotation helps to explain latency with time-stamped they are often a code like server received client send and so on and so forth binary annotations are nice because they tag a span with a context and they are usually to support query or aggregation so those queries which you see in the browser here on the in the browser where I wrote cache.miss there you can use for example HTTP.path to query some relevant spans they are repeatable and they can vary on the host which means that they can have a different value on the client and the server so for example you have URL rewriting enable on the server so the HTTP path will be different for client and different for the server a question which generally comes in the mind is can we have a large span well the answer is it is not recommended why because it decreased the usability of the tracing system and also it increased the cost of tracing system one thing which we should really be aware is of clock skew clock sticks at a different rate and in a distributed world if our clocks are not synchronized in a way then all the data that you saw that was gathered and represented by Zipkin is not going to help you to address and resolve those problems what you can do is you can use NTP which is or maybe have some GPS installed in your data warehouse to make sure that the time is that all of your service is agree on same duration of time what is a tracer tracer does the most heavy lifting it is the one which takes care of data propagation context generation passing info throughout your spans sampling is how we can control how much data we want to record so if you have a high traffic like really high traffic then small fraction of these spans are enough to give you an idea what is going wrong in your system why because if there is something going wrong and you have a high traffic then it doesn't really make sense to record all those spans because the problem is going to repeat itself that's why a small sampling rate can help us on the contrary if you have low traffic system you have to come out and adjust it based on your needs but don't make it too aggressive because if you make sampling too aggressive and start to log every everything then it is eventually going to slow you down because those things have to written in your DB and somewhere you will end up accessing the disk note that if you want to debug some request and you doesn't matter what this request have to be recorded doesn't matter what is the sampling rate you can use a debug flag supported by Zipkin and this spans which have this flag enabled they are going to be record no matter what the sampling rate is what is open tracing open tracing helps to standardize tracing it provides neutral tracing API the implementation is currently available in 6 language and you can go and read the documentation for more details if you are interested in open tracing Spring Cloud Sleuth brings distributed tracing to Spring it supports Histrix, Async, Rust template Spring integration and much more so thanks to this if you are using and building applications using Spring then Spring Cloud Sleuth just end it as your dependency and you have distributed tracing working out of the box so basically it have support for Zipkin now I would like to show you the code which I used to build this application and just to give you an idea how much work is involved in order to build such application so here I have a catalog service it really doesn't matter which language I am using maybe some of you are comfortable with Java but maybe some of you are not comfortable with Java but the concept remains the same in various programming languages so what I have here I have here a I have here a method called create catalog so it is this is the endpoint which get involved when I issue some request to localhost 8060 slash some product ID and what I do is I can log some event here like price fetch, product fetch and email send which you saw in Zipkin but under the hood it is really very simple I have a get price and when I call the get price I create a new trace command here thanks to which I can then invoke fetch price and fetch price is nothing but asking making another rest call to some endpoint and fetching some price from this class the price is fetched by if I remember correctly by price service and when I was calculating the cost so here I have I am first checking whether I have it in cash if it is in not in if it is in cash I return the price otherwise this is for demo purpose I just add check here that if the ID is less than 100 then add the value cluster 1 or cluster 2 but if you are running it on a production you can have here your own logic to decide what type of log or tags you want to create coming back to the code in the same way I can fetch the products here and this is how I I am just simulating a GC pose but what is important here is that tracer have this method is supported by spring it is not something like standard in tracing that you can call tracer.getCurrentSpan but usually what you need is you need to get the current span the one which is created by Zipkin or you can create it manually if you want to and then you can log the event there or you can log some tags some key value pair there then another service which I used here is created in is created in Go just to show you how it works alright so again I don't know how many of you are comfortable with Go but here I am showing you that the request was initially made to a Java service and then we end up calling a service or microservice how you will like to call this is not a microservice application because I wanted to keep out all the extra bit out from the code but it is a service which is written in non-JVM language so what happened here is I am again Zipkin have support for various languages and open tracing also currently support six language I can create a span from the request you want to create the span from the request what you do is you have to extract from the header the carrier and the carrier is the one which carries the tracing state I am extracting this info from the HTTP header and once I have this info I am checking if it is there is no error in during the extraction then create a new span here I am starting a span or continue another span and create just a child span for it so this is how we can create by ourselves not relying on some third party framework like spring or open source framework so if you don't have that framework available for you you can just see how simple it is to create those spans and start them and also once I have this I can create here my parent span but what is important here is that every span supports some operation and the operation have a start time and end time so if you are starting a span then you should kind of take the responsibility of stopping the span sometime in the future so what I am using here is I am using just some go syntax which means I am deffering the call for closing the span which means that once I will exit this function this call should be made for sure it is like finally call in JVM or in non-JVM languages you can just make sure that once you exit the function you are closing the span otherwise nothing is going to be recorded for you and one more important thing is you can you can set manually some tags using set tag and you can provide the key and the value here and this can be an string which you want to be which makes sense in your case and also if you want you can also set the types of the span so here I am setting for example for MongoDB I am setting it as a type of span as resource here so you can use ext.spankind.set so back to the presentation so you will may ask well who uses open tracing or who uses to be more specific Zipkin well there are a bunch of big player and medium player and small companies which are adopting to use which deals with distributed systems are using it Twitter, Google some companies like Facebook Lift and you can read or you can open you can read there more about it and also some of them have their own solutions so for example some of them rather than to use Zipkin they want to create their own tracing solutions for example here a company called Uber they use Jagger as initial version of their tracing system and now they are merging more and more towards open tracing you can also use Zipkin with ProMathuse and it expose some endpoint and you can configure ProMathuse to just get the data from that endpoint also some of you who use Hocular from Red Hat it also have support for Zipkin and currently the Zipkin is available for various languages like JavaScript, Java, Python and so on and the last thing which I would like to say is that we should remember that latency is never zero so rather than trying to run from it we should embrace it and distributed systems are really hard to reason because of the distributed nature involved in it they can have a complex call graph and also distributed tracing can help to analyze and to end latency and understanding the call graphs moreover instrumentation is a tricky thing to do because when you are instrumenting some code you have to make sure that you are passing the information correctly across various thread pools or callbacks or if you are doing asynchronous calls Open Zipkin provides open source tracing solution for us it visualize the request flow Spring Cloud Sleuth brings tracing to Spring World and Open Tracing attempts to standardize the tracing so thanks for your attention and if you have any questions they are welcome so maybe no questions alright so thanks a lot alright there is a question alright so the question I will repeat the question so the question is whether it is possible to do some sort of whether we can do manual instrumentation or whether we can rely on instrumentation out of the box and second can we trace those applications which are in which the developers have done nothing about instrumentation or distributed tracing at all so the answer is yes out of the box if you kind of take the Zipkin and just use it you have instrumentation out of the box you have to do nothing to do kind of make it work for yourself apart from adjusting the sampling rate and in reality all the code which I show you I was not doing any instrumentation there by myself I was just using some open source libraries to add some more specific details which I wanted to show you that how you can kind of do things not only out of the box but when you have let's say to add some custom fields to you but if you don't want to add any custom fields like cache miss or something all you want is that the request comes and you receive it and you visualize it and you are interested that which services were touch and which endpoints were invoked and how much time they take just you know you can rely on Open Zipkin to do it for you and the second question of your question was whether how to do it for the applications are not using in which the developers have done nothing about distributed tracing at all so the question is it depends if it is a if it is a JVM based application so what you have to do is you have to just put and if it is using spring you just put spring cloud spring spring cloud sleuth in the as the dependency on the application during the runtime and the application and if you redeploy the application then automatically it will start to kind of gather all those traces for you spring cloud will depend it at the runtime and do it for you but if it is a non JVM based solution or you have your services written in non JVM language then you have to kind of make sure all right so you have to make sure to add the necessary bits there in order to trace your application for example in Golang I haven't done tracing in python or something but I have done in Golang so in go you have to the minimum thing which you have to do is you have to create a tracer and use it or maybe with open zip can in a different way but you have to do a minimalistic thing to able for your application to detect that there is some tracing library available but all the spans and all the data and all the calculation of time is taken care by open zip can and it try to kind of put as much information as possible so that you don't have to write it by yourself some more question what is the project called H trace which you can use so basically tracing for Hadoop or some clusters so if you are going towards this direction yeah there is a separate so the question is that sorry I will repeat the question so the question is how the tracing looks like for a solution like Apache Spark or to be more to generalize this question more about machine learning solutions or like Hadoop or maybe Hadoop which do some batch processing so yeah there is a separate dedicated project also there I used it to do this it is also open source and we can use H trace to do this and I haven't faced any problem yet and especially if you have your jobs let's say are executed on large number of clusters then maybe sometime the disk is too slow or I don't know the network was drop off sometime and it's really able to detect all those things and it's also open source and you can find it online more questions alright so once again thanks just I guess from the university but it's coming just trying to make sure that I'm not missing it