 Good afternoon everyone, I am Bhavin and I work with in ProCloud Technologies. My work basically includes Kubernetes, containers and in this talk we will be discussing about what is tracing and how you can implement that in a function as a service platform. So let us see. So when we say we have been seeing this trend of using microservices, so when we say function as a service or serverless, those are nothing but way to implement microservices. Those are basically one more abstraction you get which makes your life easier. So when we say microservices, we basically treat our service as a product. So we divide our one application as multiple services and each team works on their own service and they keep the compatibility and all things. So this makes it easier to manage. Also when you do this exercise or you divide your application, there are some things which get complicated but those work the efforts. So a typical small application might look something like this. So at some scale you will, you want to have some way to trace things, want to find out what is going on inside your application. So let us see what is the monitoring thing and what is the black box monitoring, what we have been seeing so far. So what we do is we just keep checking if the service is running and we just keep checking that and we find that service is not working. It's giving 404 or 503 some error codes, we basically page people or send out notifications. But when it comes to microservices, does that are microservices up for 100 percent time? So when we say we have microservices, we have multiple replicas of our own application. So let's say I have one image service and it has 100 replicas. So in that case, you might have containers to run those 100 replicas. At any point of time, you might have some containers which are still getting up, which are still going to be in ready state. Some of them will be in upstate, some of them will have some issues but still your service will be running, but that basically means the microservices are never 100 percent up in terms of the number of users that you have. So nines don't matter if users aren't happy. So if you are doing black box monitoring on this setup now and you might see like you have 99.9 percent uptime for your all the services, but still there might be cases that your monitoring setup is saying services up, but your users are facing some issues. How you will be able to get into this? So there is this term called observability, which basically means you get the whole view of your system. So at any point of time, I want to go to my system and say what is the situation right now? What is the things users are facing right now? What is the latency at some place? How an application is performing? And what can be the trend after some time? You want this to be a system where you go and check, not the system will come and give you alerts about your system is down. So basically it is not about just disaster stuff, you can, you basically use observability as a tool to improve your applications as well. So this is a code from Kelsey Hightower. He says stop reverse engineering applications and start monitoring from inside. So basically you do one operation called instrumentation and you modify your application code such a way that you have insights from your application. We will see more about this later. So right now these are considered as three pillars of observability. One is logging, second is matrix monitoring and third one is tracing. So there are talks, people are discussing about what else we can do as a observability. So let's start with logging. When we have our applications running and each of the replica will be logging stuff like errors it faced, all the events, all the STDP request and anything the application has done, it will be logged. So there is still limitation of this logging. So what happens is, let's say and we know what things to log. So basically the developers and architects will basically decide that which things we are going to log and we will log those. But in most of the cases we have more happy parts than failed ones. So this is the signal to noise ratio which is like very less. So noise is more and signal is less. For us signals is exceptions we got errors we received. Those are which we want to look for. And these are the stuff which serves when you have, we already know in advance what can go wrong. So at some scale logs don't scale that much. So let's say you have 100 replicas today, tomorrow you will have 500 replicas of same application. And considering that signal to noise ratio what will happen is you will start getting a lot of logs. And in typical setups what you have all the logs from your containers are sent to some centralized system like elastic or something. This pipeline will get bottlenecked, the network used for sending this log will increase. So next one is metrics monitoring. So now we have logs, we know what events are happening. Now I want to get the insight of over time what is happening inside my application. So this can be number of HTTP requests served by my application, the CPU usage of my application, memory usage of my application. So how this statistics will help you? This will help you to find the trend when and how the application trend is changing and you can predict the behavior based on this. And the next one is tracing. So what the tracing means? Tracing basically means let's consider this scenario you are trying to book a cab on one application, you search for a drive, search for a location and from that point to the booking the cab itself you want to know what all services were called and how was that path all together. So the request might go through hundreds of microservices, 30 or 100. So you want to know what happened in each service and you want to get a very high level view just by looking at that one particular page. So if you say we will do it based on logs. So that solution won't scale that much. Next thing is that's the only thing you can do one is based on logs and another is instrumentation like we say start monitoring from inside. So this helps you to connect the individual components in your request and you also in one particular trace we call it as a trace one path. You can log small details like few log lines or metrics as well. So why tracing in serverless or functions as a service? So when we say we have serverless or function as a service platform it is more abstracted than what you get with other tools and also the short lived nature of functions. These functions are ephemeral. You will add probably at some time you will have more number of invocations of same function. So it helps you to get inside into this very small ephemeral functions and it will help you to debug things and get inside into your call path or whole path. So let's see introducing this greetings as a service application. We will take this as a demo and we will see how the tracing will help us to go forward. So this is nothing but basically we have two functions. One is greeter function and another is image function. And these both are communicating using Kafka queues. So this message queue is there and user puts their input in one queue. It's get processed by one function. It puts the output in another queue. So basically something like this. So you have input queue. It will go to greeter function. It will process the data and it will put again into next queue and image will pick up. And the output will be taken from the output queue. So you can consider it as a some small part of your whole application and we'll see what are all the components when we are trying to implement all this on a Kubernetes cluster itself. So now we are introducing some new tools here. So one is fission and the next one is our tracing backend which is Yeager. And you can see we have greeter and image function. So when there is one input message in the input queue, the fission will get triggered and it will give that message to your greeter function. It will pass the output to fission again and it will be put to the next queue. The next queue will trigger another function and similarly the output will be put in the output queue. This all happens via fission and invoking the functions will be taken care by fission itself. So in the bottom you can see we have something called tracing events and those are being sent to our tracing backend which is Yeager. So what happens is we have instrumented our application code. Now our application knows what to do when it receives one request. It will create some events, we call them tracing events and those will be sent to Yeager backend. So let us see what fission is and then we will get to Yeager as well. So fission is basically built on top of Kubernetes. It uses custom resource definitions which is feature of Kubernetes and you can, you will have everything on Kubernetes cluster itself. So we, moving forward we will be discussing about the specific tools but the things we will be going to learn can be applied to any serverless plus tracing backend. So we can see Kubernetes native as it is and let us see what are all the components which we will be dealing with today. So first one is fission functions. These are nothing but our function code which can be in any language and it will have one entry point. It can have one function which can be one entry point and basically your code will execute from there. Next one is fission function environment. So when you say you have your code you will need something to run your code into. So these are nothing but container images and these are basically STTP servers. So all the request which comes to your function will be STTP request and the response will be passed to your, passed by your function implementation, the code you write. So when you say you have fission function environment your code will be put inside this environment and it will be STTP server which will serve your request. So the next thing is function triggers. So now we have our environments, our code is inside them, how we are going to run that code, how fission knows when to call that function. So there are this thing called triggers. So right now fission supports four types of triggers. So we will be more focusing on STTP triggers and message queue triggers. So the STTP trigger is simple, you get one endpoint and at some path you call that endpoint and that endpoint will eventually trigger your function. Similarly you will have message queue triggers. So fission will keep on listening to that particular queue and when there is a new message in that queue fission will basically take that message and pass on one STTP request to your function. So it will typically look something like this. So you have one pod which is Kubernetes pod where we have our Python environment and we also have our code which is mounted inside that pod. So we have our code, we have our environment ready whenever there is STTP request that code will be mounted inside that pod and the pod will get that request. So this basically happens for first request and for sequential any request after that will basically serve from that same pod. So it is like something which gets invoked at first time and for some period of time it will be serving your request. So now let's get back to our tracing stuff. Now these are the kind of detectives for us. That is the two Yeager and Zipkin are the tracing backends and in center we have open tracing and open telemetry. These are kind of APIs which have defined standards to communicate in between your instrumented code and the tracing backend. So the question why we need tracing backends? The major role is to collect the all traces we call them spans. So these tracing events are called spans and at some place you want to aggregate them. Why it is needed? In order to get complete path in order to link all the spans you must have something which will do this all stuff right. Also there might be cases where your applications are using different type of ways to send this tracing event. So there must be something which serializes those tracing events, stores them in storage and gives you one interface where you can view them. So how to collect this tracing event? Basically instrumenting your code. So there are client libraries which follow some standard. In our case it is open tracing and those libraries will take care of collecting the events at your application side and sending them to tracing backend. So when you configure them you say where is the tracing backend and how to send that to the event. So let us see how we can implement all this in our application where we have two functions. So one way is to instrument each function code. So you go to each function code import the client libraries and you do this stuff. But there is simple way and straight forward way will be to modify the environment itself. So when we say we have fission environment which is basically an container image and it is a HTTP server we can just basically modify that and any request comes to your function will be recorded as a tracing event. So let us see how typical environment looks and we will be considering the python environment in this case which is Flask which is a web framework to implement HTTP server which has server.py and we have some files. So inside the lib directory you can see we have our tracing code and we have server.py. So let us see one of the component from server.py. So whenever there is a new request to your function that environment will be triggered with this endpoint. So let us say you have greeter function and there is new request for that greeter function. One HTTP request will be done to this endpoint and this will basically import your code and save it in the userfunk variable. So this is basically importing your code for first time. At this place what we are doing we are just wrapping our userfunk which is imported inside initialized tracing function. So this is like kind of wrapping over your function code. So let us dive into initialized tracing itself and we will see what all things we are doing. So in python you have something called decorators which is nothing but a way to wrap your function code inside another code. What we are doing here is basically we are using our function which is funk we just pass our function which is imported from the file. Our function will be saved our function will be called as a part of the wrapper and there response will be returned. So the tracing part will be done before the call and after the call of our user's function. So let us dive into this and let us see what is the first step. So the first step is we create a tracer object. So this tracer object will help us to create the tracing events we call them spans and from there we save this tracer in one variable. And the next step is to start the event. So our tracing event is basically we are calling tracer dot start span. As you can see it is inside a with block and whenever that with block ends that span will be closed. So there are two events the start of your request function call and end of your function call. It will be recorded as one span we will see what the span looks like and once we start the span we start it as we specify one name which is span name we give it one name. In our case it will be just function name can be greeter or image. So let us see how you can visualize this let us run our greetings as a service tool application and we will see how the yogurt can visualize it. So as you can see we are listing the available functions and we have our greeter function here and image function. We are also listing our triggers which are nothing but message queue triggers and we have two of them one triggers our greeter function and another triggers our image function. So let us put something in the input queue which is nothing but messages we are starting one producer in the upper side and we are also starting one consumer which is our output queue. So let us try to generate one let us try to pass on one JSON which has specific format which has name image and it has one greeting as well. So as you can see we are passing on this JSON this will be put inside the input queue and after our message travels hold the system we will get output in the output queue which is as you can see here we got the name we got greeting and we have one image as well. So let us see what that image is how many of you know SKCD. So this is the website where you can where there are real nice comics. So let us see how this execution looks in the tracing backend. So right now we have Yeggers this is Yeggers UI as you can see we selected greeter function and when we do find traces we get one span for our greeter function. So this has some tags like who generated that and what were the IP address and version and all. Let us check the image as image function as well it shows how much time it took and similarly we have so let us get back to what we had. So as we saw we have a greeter function span and we also have image function span. So as you can see there is one problem in this as we say you get whole path but right now what we are getting is we are getting a separate span for one function and we are getting separate span for another function. So let us take a look why is that. So in order to get the spans linked together you need to do something called context propagation. So basically you have to tell from where the previous request is coming from and when the request is going outside you have to make sure that that context is passed as a whole request. So that is how the Yegger backend will take care of linking all the spans together. So in case of our implementation we are passing the context as HTTP headers. So when you are doing first request you will create the span you will have some headers and these headers will be passed along the way. So let us see there is one incoming request to function and it has this header which says trace context and it has some string in it. So what we will do we will call this function called tracer.extract which will basically give us one object called span context and once we have the span context as you can see the request came we created one context out of that HTTP header and we are basically starting a new span now. So while starting a new span we specify that this new span is child of the previous span context we have created from that header and whenever we are passing on the response from our function or we are doing some another request which is going to some another system we basically create new headers and we pass those headers. So this is like passing on all the context along the way. In our case while implementing this we are also saving this in a variable called g. In Flask you have one global variable which is g and you can implement it at any point of as your request and you can utilize the variable inside that. So we will see why we are doing that. So this whole workflow looks like this. You got a HTTP request which has trace context header key and you call it extract on it you get a span context. So you can see span context has something called trace ID and it also has span ID. So span ID is something which was the previous request or previous function call and now you start a span which basically says the parent is the previous span ID the trace ID remains the same and you also span ID of your new span and whenever the your response or another request is going out of your code block you will do the inject and you will create the new header which basically has same trace ID but a new span ID. So there is this standard in this is specified how to specify this all things and each of the client library or each of the back end tool might implement it in a way but still the standards will remain same. So let's see how that translates this whole thing translates to actual code. So as you can see we we create the span context from the HTTP headers we received. Now when starting the span we say it is child of the previous span context. We also create something called generated headers and we save this generated headers as well as the current span which is we have created here and we call our user's function code. So basically we are calling this inject and we are prepared to we have prepared this out of the box. So your user don't have to play with this around so they just have to pass the use of generated headers and pass on those headers as a request or response. So now the issue we had so when you are using the efficient message queue trigger what it basically does it creates one HTTP request when there is a new message inside input queue and it doesn't have support for Kafka record header. Kafka also supports record headers and you can pass on extra headers as part of your Kafka message. So the whole workflow looks something like this. There is efficient message queue trigger which is running it will whenever there is new message in the input queue it will take that it will convert it to a HTTP request your function will process it it will create the HTTP response and this response will be translated again to Kafka format. So while this translation is happening there is loss of our headers so even if our function is generating headers and putting them in the HTTP response the message queue trigger is not passing it along the way. So this was the change I made in the Kafka message queue trigger of fission itself now fission supports that so whenever there is HTTP header as a part of response it will be converted to Kafka record header and similarly if there is Kafka record header as a part of message it will be converted to HTTP header. So let's see how that looks now now we have upgraded our fission installation as you can see we are running the upgrade process here and we will run we will send another message now let's put root confune and see what we get as you can see we got the message here as the output and let's get back to tracing back end which is Yeager in our case and we will see how it shows when we see traces for greeter function now as you can see now this is the change now you can see there is greeter function as well as image function and those are now linked together also it has one graph which shows how those are linked you can see those are linked together now so now there is another issue when we are using child of so we said that the new span is child of the previous span so in this case the tracing back end applies something called clock skew adjustment. So let's see as a diagram when you say you have parent child relationship it is expected that the child should finish before the parent span itself so in this case you can see we have parent span then we have child span A also our parent is calling child span B so these both should finish before the parent span so this is how the Gantt chart is represented but in case of something which we have which is asynchronous thing we have message skew in between so we don't wait for the next function or we don't wait for someone to finish their call and get back to us we just process do our work we fire and forget we put our output in some place and that will be picked by someone else so in that case it will look something like this and the tracing back end thinks that this is because of some network latency or it tries to apply some average and calculations and it basically tries to bring that child span inside the parent span but it basically for us it results in wrong timestamps on the UI. So in order to achieve this there is different reference type which is follows from where we don't say this it is a child of previous one it is just follows from that so in that case clock skew adjustments won't be applied the next issue the Yeager's client python wasn't supporting references altogether so you can just do child of reference and this was the change I did in the Yeager client python itself so now you can pass on multiple references or single references and you can specify follows from as well so with the chance we'll see how our server dot tracing dot pi which is a tracing code instrumented code looks like so we still have the span context we are still finding it from the HTTP headers we have now what we are we are creating a span reference yeah so the span reference is of type follows from we pass the span context to it and while starting a new span we say this references to the span reference so this basically says it is not a child of it is basically a reference with this change you get proper timings and you can see here not sure if this is visible there is another way to represent graph in the Yeager UI where you can see self-time so it will show that greater function is taking less time compared to image function which is darker red color so there is thing called there is call which is tracer.close so when we say our application code is sending the tracing events to tracing backend this thing is happening asynchronously so your application will keep doing its work and in the backend the client libraries will keep on sending the data to tracing backend but the issue is the tracer.close is not fully synchronous what I mean by that so let's say you your application has some spans to be sent in the queue it has buffered them and you call tracer.close it will basically close that object it will delete that and the spans which were in the queue which were not sent to the tracing backend yet will be lost so only work around with this issue is you just create the tracer and you never close it so they that issue is still open if someone is interested they can do it but the thing is what happens is when you start your application it will create the tracer object and even in the our fission functions as soon as the function is receiving requests it won't be deleted so your tracer will be there and it will keep sending on the messages and there is something called there is one timeout if there are no requests to one function for five minutes it will then and then only delete that pod delete that code so in those five minutes we are sure that our tracing events will be sent to the tracing backend so let's quickly see only one small video where we can see how we can debug one one issue in this demo so so we do one update to our tool and which has something broken so as you can see we have we are running the upgrade and we send one message which is root confine audience but we sadly we didn't receive any response so we must debug this so when we go to UI we can basically say that there are some specific standard tags you can specify which we will select one service in our case we will we are selecting greeter and we are saying error equal to true so in that case it will go ahead in the image we don't have in the greeter we don't have anything but in the image we can see it shows that there is one error so as we said we can add some small logs and all where here you can see the error kind is none to access and it says fail to fetch the URL so quickly I will go over this make sure you are using 128 bit trace IDs cause 64 bit trace IDs you might run out of those because there is no guarantee that every time you will get unique trace IDs also make sure if you are dealing with asynchronous applications make sure you are using follows from and if you are using something like serverless or something from AWS where you are not sure about the connectivity between your tracing back end and your functions make sure you use TCP or HTTP to send those tracing events yeah any questions yeah you can reach out to me this is my email hello yeah so actually I was actually a little bit curious about the fact that if I introduce this tracing within my microservices now by I mean as it is expected out of microservices they are supposed to finish within a very short time frame right I was curious of the fact that how much of an overhead will this incur yeah so basically the client libraries are designed in such a way that they will impose very minimalistic overhead which is basically equal to none so those are doing that sending of tracing events as a background activity that won't affect your application at all yeah while we do say that is there a way to measure that word itself I mean will will I be able to see that overhead in Yeager as far as I know nope I need to check these are some references I have linked this the slides in the proposal page itself there are there are two books and documentation yeah do we have any other question so basically you mentioned that these events get queued right so they are asynchronous that's why you are saying that we won't wait and it wouldn't cost much but eventually when the event processing happens a microservices are built based on the CPU cycles they take right so that overhead would still be there right yeah overhead will be there but it won't be that much so that's the whole you can say aim or goal of the client libraries are built in taking that in mind that we want very minimum impact on our application