 Today, we want to share with you our learning using OpenTelemetry to monitor GraphQL queries in production. And our SREs and developers can best work together to troubleshoot these issues. For this talk, Ahmet is going to represent the developer side of things. And I'm going to represent the operation side of things. So I am responsible to make sure that everything runs smoothly in production. But these developers, they keep coming up with new technologies, introducing new problems and complexity in production. It gets very overwhelming. And today, Ahmet, he wants to push a new release to production using GraphQL. Hey, Ahmet, why do you even want to use GraphQL? Like, what is that thing? Well, as SREs, you've probably heard about GraphQL. Your development teams might be thinking about using it, or you've started adopting it, but might not be fully up to speed on the pros, the cons, the nuances of GraphQL, or even what it's useful for. So over the next few minutes, we'll just go through a quick introduction to make sure you've got enough information to understand what we're going to be dealing with. SQL, as I'm sure we all know, is a query language that's used to manage and manipulate data in a database. It allows you to access many records with a single declarative query and introduces the concept of accessing records by some key. Operations that can be performed on data within a database include select, insert, update, and delete. And with a single query, we can access data across multiple tables using a join. So these concepts, they made their way into REST, which allows us access to resources using different HTTP methods, like get, post, put, and delete. But to access those specific resource or nested resources, you need to do so via different endpoint URLs. You can't perform joins using REST. To me, GraphQL takes ideas that were originally developed to query databases and applies them to the wider internet so that we can expose them as an API. A single GraphQL query can return connected data. Like SQL, you can use GraphQL queries to change or remove data, too. But unlike SQL, though, rather than just querying data stored in a database table, GraphQL is a bit more general purpose. Data can and does reside almost anywhere. A database across multiple databases, different file systems, other REST, GRPC, SOAP services, and even event-based systems like Kafka. So just like SQL is a declarative query language for databases, GraphQL is a declarative query language for the internet. OK, this sounds cool. OK, but why do we need GraphQL? You know, what problem does it solve? And isn't REST good enough? Well, in the above three requests, SQL, REST, and GraphQL, they're all roughly equivalent. We need to get all the posts which have been published and also to find out the author's name in order to populate the front end for a new page in a blog site. We just need two properties from authors, ID and name, and three properties from posts, title, author ID, and status. So overfetching is simply when an API request returns far too much data that you're not going to use. With REST, we get back the entire resource, so we end up overfetching from the API, and this can be really inefficient, especially for large resources with list-based responses. My REST API call queried all the posts with the published status, but still gets back too much information. All I need is the post title and the author ID. Underfetching is when an API call doesn't have enough data that it necessitates extra calls to enrich that data. I might need multiple entities at one time, and if each request is underfetching the data that we want, we need to perform several round trips to the server to build our required data model. So in the same example, I got back the author IDs, and now I need to request each author endpoint in order to obtain the author name. That's a lot of call wasted bandwidth, added latency, and pain for the end users. Every additional API calls add precious latency to each interaction, increasing the response time, and adds a lot of complexity for our customers to integrate with us. Boom, you got it. So now that you know a little bit about what kind of problems that GraphQL solves, let's take a look at what GraphQL looks like. We can see an example schema, a query operation, and an example response. You'll see that when you query the API, your GraphQL requests syntax conveniently mirrors the shape of the JSON that you'd expect in the response. For API consumers, they can request exactly the data that they need with absolute flexibility so that they know exactly what data is going to be returned. Nothing more, nothing less. And for API producers, they can define their schema and write some clever code, or do it declaratively with a tool like Tyche, to dynamically resolve those requests. Because we've got a shared schema for both producers and consumers, we get the benefit of being able to scaffold and generate both server stubs and type safe clients too, just like with GRPC. Okay, I get it. That sounds interesting. Tell me more about the service you're about to push to production. Like what do I need to know about it? So let's move away from our contrived blog example for a moment and start talking about a contrived travel business. If you can imagine that every type in GraphQL might be represented by a microservice, we could think of GraphQL as an ingress which can conveniently expose those microservices as a combined, purposefully designed API product. We've got three different microservices here, a country service, image service, and a weather service. And a GraphQL schema, which neatly provides an API product for these microservices. In the simplest of use case, I just have one client, which happens to be an internal React app for our website. Given that the React app is an internal consumer of our KubeTravel service, our dev team will probably reach out via Slack if something isn't working quite right. But in reality, if we just have one consumer, GraphQL is probably a little bit overboard. Given that KubeTravel is an API product, there might be a whole slew of different apps, different consumer types built by different developers from internal through to partners, even the general public, building against our GraphQL API and even publishing back to our API marketplace. If something goes wrong, we need to know about it because our users aren't going to be so forgiving. Okay, so another question, right? How do we monitor GraphQL to production? Let's try to apply the red method. As a reminder, the red method is a monitoring strategy used to gain insight into the health and performance of distributed systems. Red stands for rate, the number of requests per second, your service are serving errors, the number of failed requests per second, and duration, how long each request takes. And based on this metrics, you can understand how good your service is doing and set up your SLO, alerting, and everything that you need. So how are we going to do that with our application? The first step is to instrument our GraphQL service with OpenTelemetry to get distributed traces. Now, there are different implementation of GraphQL available on the market, including tags implementation, but here we are going to use a Node.js implementation of GraphQL using a Node.js with the Express framework. Okay, so let's take a look if there's already an instrumentation for OpenTelemetry that is available for Node.js and GraphQL. We can go to the OpenTelemetry website and on the ecosystem, you can have a look at all the instrumentation that are available. And if we search for GraphQL, we see, oh, we're lucky there is already an instrumentation library for GraphQL Node.js. So this is what we are going to use in this presentation. In Node.js, we can use a trace.js file to instrument our service with OpenTelemetry, like you see here. And this is where we add our instrumentation. So this is where we added a GraphQL instrumentation to the list of the HTTP instrumentation and Express instrumentation that we were already using. And if you look at the code, you will notice that we're exporting the spans to an OpenTelemetry collector. And here's the result. So we have an end-to-end distributed traces in Yeager. We can see tags API gateways starting the transaction, reporting some spans. Then we see the GraphQL service also reporting its spans. And finally, the GraphQL service is calling a REST service, which is also instrumented, and we get also the span. So the end-to-end distributed traces is now available in Yeager. Okay, we have tracing in place, but to monitor in production, we need some metrics. The red metrics that we talked earlier, so how do we get from the traces to those red metrics? Well, Yeager has an out-of-the-box integration for red metrics. It uses a component on the OpenTelemetry collector called span-metrics connector, or previously called span-metrics processor. And it generates the metrics based on the span. The span-metrics connector creates two metrics based on the span. The calls total, which is the number of spans, including error spans, so the error identified as a time series with a label status code, and the latency, which is reported as histogram. And those metrics are then stored in parameters, and Yeager is able to connect to parameters to display those metrics. And here's the result. In Yeager, so there is the monitor tab when this is set up, and I have now my request rate, error rate, and duration for my GraphQL service. Hey, that's good, right? Like, I'm ready to monitor this in production. What do you think? What could go wrong? So let's start to look at two different error scenarios involving GraphQL queries. Hubstream and resolver errors. I'm gonna send a request to get some information from Italy, along with its weather data, but it's not working. Are you able to detect those kinds of issues? Let's check. Okay, let's try this out. Yes, you see, I see an increased error rate here on my dashboard. Let's check the traces to understand the issue. I can find the traces with error in Yeager by filtering them using the error tag and looking at those traces that have errors. I can quickly catch that the GraphQL service is returning a 500 HTTP error code because the weather service, it was calling return of 400 error status code. So in that example, I get all the information I need from open telemetry. And take a look. I can even check out which query is having the issue. So having a look at the errors and then I have the query, so you can fix it. Okay, I fixed the bug. Let's try again. An error with the temperature not being shown in the customer app. Okay, let's try again. Let me check. Let's check the traces. No, no error, all good. Can you maybe try to reproduce it? Look, Sonya, my error is in the response body. Here it is. GraphQL errors are fundamentally different from REST API errors. We can't rely on the HTTP status codes and status texts. According to the spec, the response of a GraphQL endpoint should contain either a data field or an errors field. And in some cases, both. In this particular case, we can see that the error object returned is an array and it contains the following fields, a message and locations. So the message represents the actual error message and the locations contains the line and the column and the location of the query. So what can we do to catch these kinds of errors? Okay, so right now we are missing this error information on our spans and maybe let's take a look at open telemetry semantic convention to see if there's an attribute we can use for that. The semantic conventions are a set of recommended attributes for different technologies. It's not mandatory to use them, but it's recommended and it's useful to process the information on the observability backend. And each and every instrumentation libraries are then responsible for implementing them. And we can see here in the semantic convention from GraphQL that it contains only a couple of attributes. So the operation name, the type and the document, but there's nothing specific about GraphQL errors. So let's add it by ourselves. So with this code, we are adding an additional spans that will report the error with our own attribute called GraphQL error message with the instrumentation. And if we do that and have run into the same error, we can see the error being recorded on the span. You have to notice it's not the first span because the first span that's come is still the one that gets traced by the HTTP layer and this one says, okay, 200 error codes, everything's fine, but at least we can start logging our GraphQL errors and we can report it. And still in Yeager, the error rate diagram doesn't change because it's based on the first spans that comes, but we can use parameters and add custom queries to retrieve the error rate for the GraphQL queries based on the manual instrumentation that we have added. So what we have learned here is GraphQL error detection doesn't work like a standard waste API and we need more logic around it. We solve that by adding a GraphQL error attribute in our instrumentation, but that's something that could be added to the instrumentation library down the line. So we have spoken about errors, but what about performance issue? Can we use the open telemetry data in Yeager to detect performance issue? So Ahmed, this is what I have. What do you think? Is it good enough for performance? All the different queries that are being sent to my GraphQL server are being aggregated by the slash GraphQL endpoint. Yeah, because they're all sending to GraphQL endpoint. So yes, what's wrong with that? It's not that easy with GraphQL. We don't know what queries clients are sending us. One client can send multiple different queries and we could have multiple clients consuming this API, each in their weird and wonderful way. So each client might have a different performance profile on a per query basis. And whilst one client could experience performance problems, another client could be performing perfectly fine. It's all about the type of query and the resolvers. Like with databases, right? Yes, but we have the added complexity because you wouldn't expose your SQL database to your APOI consumers, would you? So if we go back to that screen, the P95 latency doesn't really mean anything because it's an average over every single GraphQL query that it's sent to the server. We need more granularity. Also, Sonja, there's a couple of things you need to know about performance. We'll touch on a couple of them from the N plus one problem to cyclic queries. Complex and deeply nested queries can also impact performance. One of the most commonly used features of GraphQL is the ability to get nested data within a single server request. Unfortunately, it's very easy to misconfigure the way that the data loading in a way which won't scale as the data grows. And in this example, you can see what looks like a pretty harmless query which requests the continents in the world and then for each continent to go grab the countries and then for each country get some weather data. You know what? I think using open telemetry, we can detect that. So you can nicely detect N plus one query. Let me show you. So in this particular example, in Yeager, when looking at the trace diagram, we can see 27 HTTP get calls made to the weather service from one single GraphQL query. Of course, I would expect this kind of query to work pretty well on the development side, on your machine, but in production, this kind of query that will not scale. So we need to be able to catch them. One thing that we could do is get that number in parameters, for example, getting the average number of outgoing requests per GraphQL query. And you can even use it to set alerts in test or in production environment. Very cool. So cyclic queries, they reuse the same object one or more times in a cyclical manner. So despite being pretty unusual, I mean, why would anybody actually want to grab all the continents, then for each continent, grab the countries, then dive back into the continent of that country and so on. So this could either be done in error, but more likely as a potential vulnerability for GraphQL APIs. And they need to be protected as they can lead to a denial of service. And some queries, they're easier for a server to resolve than others. The depth of a query is the number of nesting levels. So in this example, we've got a query depth of three, ABC. And the complexity of a query is typically defined by the number of fields that have been requested. So in this example to the right, you'll see an example query with a complexity of six. It's also possible to assign custom complexity values on a per field basis. So we can assume that queries with a higher depth and higher complexity are going to be more expensive for a server to resolve than simpler queries. And as such, add precious latency. Okay, let's try to see what we can do better to track the performance of a GraphQL service. Again, we need to check what information we have on our spans. And here we see we have the full query and the field names. Actually, this doesn't really match the semantic convention we have seen earlier. So we see that here we have different libraries using something else than the semantic convention. But still, we can leverage the attribute that we have on our span to report new sets of metrics. So let's update our open telemetric collector configuration to add those attributes we get as metric dimension to report on the metrics for the number of calls and the latency, so the red metrics that we get with this change that are getting new dimension for GraphQL source, which is the query field name, field path. And here you go. So we could group them by unique queries. What do you think? I guess this could work, but I mean, if we've got an API product, we're simply gonna have too many similar looking queries that are all different. And it's gonna be very difficult to get any kind of useful information. Could there be any better way to group them? Okay, let's try to think about something else. Let's look again back at the open telemetric semantic convention to see if there's an attribute we could use for that. And maybe in the future, we could help contribute it to our library, to the instrumentation library to fix that. So we have again those three fields, the operation name, operation type, if this is a query, a mutation or subscription, and the query itself. So what if we grouped the query performance by the operation type and name, what do you think? Hmm, that's much better. But at the same time, we still lost a lot of detail. For example, in query continents, we've got an average response time of 2.4 seconds. Is this because most queries are deeply nested? So maybe, maybe we could try adding the clients, the client identification. So for each group of operation name and type, maybe we could add the identification to the client that is making to the call. So you would know if a mobile client is sending more complex queries and where you should look in particular. I guess that might work, but I guess it's very much dependent on the client's specific query and those kind of workloads. I think we need to explore this further with the community. What do you guys think? So we are almost at the end of our talk and let's summarize what we have learned so far. Um, back. Sorry. Thank you. Yeah. So, open telemetry is helpful for monitoring and troubleshooting GraphQL queries in production, but there are still a couple of things we need to improve together. The first one is the semantic convention. So you have seen there is a definition and some library don't use it and would be super helpful to have all the libraries using the same level of semantic convention. And also we are missing here a definition for GraphQL errors because you have seen that GraphQL doesn't work at the same level as a standard HTTP call. So we would need additional attribute information to understand what's happening in GraphQL. It's necessary, it's required to understand if there was an error to be able to catch them. But also if you think about error-based sampling where you just get a certain amount of traces, you want to make sure that you capture all the errors. So you would need to have a kind of attribute to be able to get the GraphQL errors. And also the red method of monitoring for GraphQL is useful, but needs to be extended to GraphQL specifics. And those specifics are query type, operation name, query depth, maybe query depth, maybe query complexity. And it also kind of depend on the workload that you are using. If you're having different clients or just one and who is making call to your GraphQL service. And using OpenTelemetry in production, the operation team now has all the visibility they need to ensure the reliability of the system. So thank you very much. Let us know if you've got any questions either now or online using this QR code. And also a copy of our slide decks available via that QR code if you want to take it. And we've got the best stickers which are special edition just for KubeCon. So make sure you grab some on your way out. Thank you.