 Hello everybody, my name is Paolo Fai and this session I will talk about data analytics integration with distributed tracing system Yeager. Before we start I would like to know how many of you are running any distributed tracing somewhere? And how many of you would like to do distributed tracing? And how many of you would like to run some data analytics jobs or are here to learn about machine learning? So I think I will disappoint you a bit. So this talk I will not talk about machine learning models, any AI stuff in general. I will talk about what kind of data analytics integration we did on top of Yeager. So later on we can run these jobs to aggregate data and extract additional features from tracing data would we collect. But before we start I would like to talk about difference between monitoring and observability because I think these two terms are referred interchangeably and I don't think they mean the same. And observability is something regarding many conversations you can find on the internet is something which requires human interaction is something when you debug a system but by using telemetry data. So not debugging in a traditional debugger but using the telemetry data you are trying to answer the question what happened after the fact after the request after the transaction. On the other hand monitoring is something what can work fully autonomously. It just collects the data and then validates for example the OLED definitions or something like that. And my point here is that distributed tracing is something what we use to debug our systems. It's something what is used for root cause analysis or there is a problem in our environment. So it makes a lot of sense to provide a capability in distributed tracing so users can ask very complicated questions what happened with the request. So our goal in like by doing this integration to allow data science on top of tracing data is basically to derive more information from traces what they collect. Traces they contain a lot of rich information we see the transaction end to end we see what everything will happen with a lot of metadata associated to it. So you know you can run these models as part of your standard Yeager deployment to derive metrics which are applicable to all of the data you collect or other use case might be that there is an incident and you would like to do some kind of on demand incident analysis. So you would speed up you know Jupyter notebook and you would write there some kind of very simplified code to verify your hypothesis whether a specific action is responsible for bringing your system down. And this platform can be used by you know by the operators which operate the system but also for data scientists who they don't know the logic they don't know the how to deploy Yeager and do all of the stuff all of the deployment they don't understand how tracing works they just want to write you know data analytics jobs the AI. So today's agenda I will talk how we approach this problem in Yeager there will be some architecture what kind of architecture allows us to do these kind of things then something about trace domain specific language. So we want to make it very easy for developers to to aggregate traces and extract features from them and last what kind of metrics we were able to derive from traces so far. So no AI I'm sorry so how do we approach this in Yeager. So at the beginning we have to query the data right we have a lot of data we have to have access to the data then we would like to apply some kind of filtering and after the filtering we run the model or we just extract the features we are interested in and then the result is again data we have to store it somewhere. So to get the data we already like Yeager deployment can be configured in such a way that we will use a Kafka so we can just connect to the same Kafka instance and get the data from there and then just aggregate on the time based windows right. This is for the data the new data you collect for all data we can use the Yeager query which then connects to the database. For the filtering the thing you know trace is basically a graph it's directly a cyclic graph so we think you know navigating within a graph might not be super easy so it makes a lot of sense to provide high-level API which will simplify navigation. So for example in trace domain the query might look like something like this get me all client spans with a specific duration or you want to answer questions are these spans even connected somehow and then to store the data this kind of open question we don't know yet how we should approach it you know different models or different type of analysis they might produce different results so it will require different schema for each of the results right but maybe then can be some kind of like generic scheme can be used for most of the models or even some of the some of the you know feature extraction will result in metrics so you can directly expose the metrics as Prometheus and when just collected from Prometheus. So this is the current Yeager architecture on the left side there is your application instrumented with Yeager client open tracing API whatever the important thing this is reporting data to Yeager collector the collector is sending to Kafka from Kafka there is separate component which reads from Kafka store to the storage. So we can basically look up our job to the same Kafka instance to the same topic read it from there in like time windows because problem with tracing is that you know let's say you have microservice architecture you have 10 services and the request goes across them right and each of these services they report data tracing data spans and these spans they come to Kafka different times so you don't get the trace you know as a one object you get only spans so you have to kind of aggregate on the time on the time windows once you do that you can aggregate on trace IDs extract the traces and then run the analysis. So this is for the models you would like to run as a you know standard deployment so you just deploy like something like spark streaming or fling streaming extract the results and expose metrics or store the results to database. For the on-demand analysis you can spin up Jupyter notebook basically connect to the same data store or to Yeager query to get historical data. So about trace DSL you know trace is directed acyclic graph so it makes a lot of sense to represent it as a graph right and we try to use two kind of open source graph languages to kind of extend those and so you can use use these languages and query the trace data. That's why I experimented with Cypher from Neo4j. It's something like SQL so it's it's declarative language so you first define the query and then you run the query against the database but our problem is that we don't expose the query API our storage doesn't expose that because we store you know the spans as documents so we couldn't use Cypher. Then next choice was Apache tinker pop which is a framework a lot of components one of the components is gremlin which is graph travels language it's very different to Cypher it's it's a functional language and gremlin is great because you can actually add new methods to the language we'll see that in a minute and then there is tinker graph which is in memory representation of the graph so what we actually do you know once we read the data from the storage we aggregate that and we construct like in-memory database for each you know time window and then you run the query against the in-memory representation. So this is when you want to extend gremlin you basically extend two APIs or two interfaces one of these is graph traversal so for example the first method is he's saying basically get me a trace like is there a trace with this specific ID in a graph and then we call the gremlin core API which is basically has like is there a property on this entity which is usually a vertex it just returns that. More complicated one can be for example a root span so we want to jump to a root span which basically says I'm looking for vertices which doesn't have any incoming edges which means the root edge, root vertex. From my experience with Gremlin I think it makes a lot of sense to to provide these methods because it's it's not trivial and I don't think that what people border and learn gremlin you know in order to kind of verify their hypothesis about the data what they collect. These ones are kind of like easy to write but you will quickly end up with kind of complicated ones when you want to verify something not trivial. So how we use it? So for example the first question like is it the client span of a specific duration so again we are using hash like has a tag from from the previous you know from the extended interface we provide the tag key and value and then duration duration is also from the extended API which is providing on the predicate the specific duration we're looking for. The other one is for example our two spans connected so again you provide the name of the span and then some Gremlin API repeat outgoing edges until there is a vertex which has a property child for example child. Again I think it also makes sense for example to kind of like wrap this into utility method like this one and provide like is this a span with this operation name parent of this one it will just return traversal. So as I mentioned Yeager doesn't support graph queries so you cannot write complicated Gremlin query and then run it against our database or our against our query service which kind of sucks because sometimes you would like to run the analysis on the historical data. So for that we the thinking that maybe we will never support the full graph API in our services so maybe we can just you know kind of have the common use case and well define those use cases and then just you know implement the subset of the Gremlin API. So what kind of metrics we derive from traces at the moment. So for example the first one is a network latency between two basically two services. I think there is no you cannot derive this from any metric systems so you get metrics from the for example for the server side and the request duration. You may get client side request duration but you are never able to split them by a host name or by a service because there is no such attack in those metrics. The other one for example is trace depth which basically says you know the height of the tree basically the trace tree. The other one is very similar to trace depth is service depth so what is the length of the services calling each other right. So you have for example service A calling B and B calling C so the depth will be 2 because you have two edges between services between hosts. It's basically number of edges or number of network calls between the services. And then last one for example like dependent services. How many services depend on your service or how many services call your service calls. So these metrics can be you know just exported as Prometheus metrics so actually the Spark streaming could export you know the Prometheus endpoint and just Prometheus will scrape those metrics. Just example of the network latency and the trace depth. The other type of metrics we are able to derive at the moment is trace quality metrics. Which tracing the most difficult part is actually instrument your systems. You are going to deploy tracing in a new environment. And there are usually a lot of problems. People can kind of forget instrumenting some APIs and then you get you know split traces. So it makes also a lot of sense to actually measure the quality of your instrumentation. And this is actually UI from Uber from the taxi company. So they build a custom tool which is able to measure these metrics. So what they do for example this is a service GoPro and then they measure for example is there like does this trace have all the metadata like for example like client and service bands is the Yeager client version reporting them in appropriate format or are they using the right Yeager client version and things like this. So for example this is the you know the the KPI. In this case it's once where everything looks fine. The problem with this tool is that they calculate the results but then then store them in the Cassandra table which means like another dependency on the database. And when you think about these metrics it's basically counter. It comes to spans which satisfy or doesn't satisfy the criteria. So I thought maybe we can just export them as a Prometheus metrics as counters and then you get the same KPIs right. But the problem is that these tools also provides ability to jump to the traces which doesn't satisfy the criteria. You know the wrong traces. We don't have this capability at the moment because you know the metrics API they don't allow you to label traces. There is no correlation between traces and metrics. But maybe later so this is the example of the Prometheus metrics. Now we see the trace quality for example the client version and we see the for example the service route with this client version is failing all of the basically all the spans failing this criteria. So then I actually thought like how we can you know calculate this like KPI just say one number. We don't want to look at metrics we want to have just one number to which we'll say what is the quality of your instrumentation. And I think it's possible to do actually with the Prometheus query language. You can just sum up the numbers and then divide by the total count. But still the problem is how then to navigate to the trace instance to the trace exemplars. And actually there is ongoing running open source open source metrics libraries are going to support trace exemplars. So you will be able to jump from from metrics to trace instances. Okay basically all what I have presented is kind of like new effort where we want to go with Yeager. So we wanted to be more data analytics platform. So not only just only visualize the data because that's what we do at the moment. We don't do any kind of post processing. So it's very new if you have any feedback any comments what kind of metadata we derive from traces. What could we provide more to users. Then just come and talk to me or create an issue. We are happy to hear your feedback. Okay this is everything that I have. Do you have any questions? We are considering now to do the working of Yeager. So we are starting up and our main feature we want to get is what we are going to do. What are the analytics that we call Actors. And then so as quickly as we have basic deployment we are going to cut at all. So my question is if cut edge values picture here and we are making this is really difficult or is there any alternative like if you use the basic Yeah, so the question is is it possible to do data analytics with Yeager without Kafka. So the question is maybe yes you could get the historical data from Yeager query but you will not be able to connect to the stream of incoming data. Because the other storage is what we use doesn't support this. Any other questions? Yes? So the question is how does it differ or whether if we could connect our or export our data to other analytics platforms. I think you can do it if the other platforms support you know importing the data from other systems. And I don't know what kind of features they provide so it's hard to tell. But so I was thinking for example at Red Hat there is this component it's called lock anomaly detector or something like this. So I was thinking maybe we could you know transform traces to locks and then run the lock anomaly detection on the on those. It's maybe the same type of the question you asked. So I think it makes sense also to use other systems if they provide such a capability. Yes so the question is if we considered like consider the shape of the trace as an input for data analytics. The question is definitely it's actually one of the new features in Yeager is a new visualization where you can actually see the differences between two traces. Because when you see an incident then you can definitely find a normal trace with that behavior right for the same endpoint kind of same parameters. And then you compare the structure and you will immediately see where the where the error happens because then that's the place where the structure goes goes differently is different between these two traces. No no so far but I think it makes sense to to have such a AI job which will do that. Any other questions? So maybe one of the use cases maybe that your service is scoring recursively itself which well there can be better metric for that right. But for example you would like to limit the number of network hops to keep it minimal for performance reasons. Other questions? Okay then thank you very much.