 I think I have. Ultimately, we put it to evolving technology track. And now we'll have open telemetry in Kubernetes. Oh, tell me how. Oh, tell me why by Peralse and Sally or Mali. Stage is all yours, Sally and Peralse. Thanks. I just want to mention at the end I have a, we have a short video and if the sound is too low in that, and I'll share it, Shelby, I'm sorry, Selby. And if the sound is too low, you might just want to bump it up. I recorded it a little low. So here we go. Open telemetry. I'm Sally O'Malley. I work at Red Hat and I work with Peralse and I let her introduce herself. Hey, everyone. Sally and I, we work on the same team in emerging tech and this is our journey on open telemetry. Yes, what a journey. Open telemetry is a CNCF sandbox project. It's pretty new. It came together. It brought together two other open source projects, open census and open tracing. Those two merged only in 2019 and ever since then, the community has been growing open telemetry. And we have a vision of instrumenting the core components of Kubernetes and that's what we want to tell you about here. And I have the slides, so I must. Our agenda, we want to tell you about that drive to enable open telemetry in upstream in Kubernetes. We want to talk about our contributions, why open telemetry is awesome and how you can enable OTLP tracing in Kubernetes. And let's get started. So a brief history, about a year, a year and a half ago, there was a Kubernetes enhancement proposal or a CEP to add distributed tracing for all object life cycles. This scope was made smaller and the CEP ended up merging as adding distributed tracing to just the CUBE API server. And even that addition took over a year of debate and going back and forth in review. So it's a big step that at least one component now is instrumented and also at CD is now also instrumented experimentally. So why, while debugging latency issues in Kubernetes, it's just really difficult. There are so many services, transactions happening and it's hard to figure out when you have a latency issue where it's occurring. Tools like logs and events and metrics are great but they have limitations. With logs, you really have to know what you're looking for because you have to search each component individually. You don't get that overall picture of the system. Events, it's not a standard format. So it's just, a lot of them are custom messages and it's hard to make sense of if you again, if you don't know what you're looking for. And metrics are useful for showing that a process is slow, but not why it was slow. So that's what open telemetry adds. So I think that a relay race is a kind of good analogy for tracing except the runners are the transactions and the baton would be the context that's passed between services, operations, transactions. But this race would have to be a really crazy race going in all different directions starting asynchronously and in parallel many different races at once. So it's sort of a good analogy, but a little too simple. What open telemetry does, it helps you to visualize these transactions across components and in this case across Kubernetes. It really streamlines the debugging in latency issues giving you a picture of where everything is occurring. You can identify regressions and you can know who to blame and who's when they happen. So we started this with first introducing instrumentation in the cryo code and we have a PR for that. Next, we also instrumented the KubeLit and open a KEP in KubeLit so that we can capture the entire transaction all the way from HCD to the API server to KubeLit to cryo and then back. We are also pushing to enable feature gates in OpenShift and honestly trying the best that it does not take as long as it took in the Kubernetes. We are hoping it gets smart soon and at last creating an up-to-date guidance document for tracing pipeline. Because we had a bumpy ride and we are hoping after this presentation you don't hit into the same bumps as we did. Well, the expectation of the best case scenario would be you don't hit into any bumps but at least not the same one as we did. So next slide. Yeah, so when we are done with everything this is how our contribution is going to look like. It's we would be able to trace all the GRPC calls made from KubeLit to cryo. So whenever KubeLit get a request to create a pod it would call the cryo engine and the cryo engine would be doing and start to launch the containers and the containers will go inside pod. So all these transactions would be sent out to the trace so that if there is any latency issue you can pinpoint where. Next one. And with that we started on a journey to OLTP. And it is, and we chose our hotel is because it is an observative framework of course it is CNCF and it comes with all the sets of tools APIs and SDK to instrument, generate, collect and export the telemetry data. It is also vendor neutral and interoperable and by that we mean is it doesn't matter what is the back into using for visualization or open telemetry is compatible with many tools like Yeager or Zipkin. So your underlying instrumentation part does not depends on how you are exporting or visualizing it. And also to notice that OTLP is a young project it just started into 2019 and the documentation is still very bad sparse and outdated things evolves quickly and they're not reflected in the documentation. So we had, we started very confidently but then we ran into so many issues that was like super frustrating but at the end of our work we were able to tie all the components together and hoping that you enjoy this presentation. That's you Sally. Got it. So oh, tell me what, what is its observability? It's understanding the internal state of a system through its outputs. So there are traces, logs, metrics, events and these help to monitor, debug and optimize. What's telemetry? Again, it's often these things put together thought of as a systems telemetry, the tracing, the metrics, the logging, the events, the contacts. We're focused here on trace data. Most applications, they already have metrics and logs handled. The trace APIs in open telemetry are stable. The metrics is barely stable, newly stable and the logs, they're still a work in progress. So eventually all metrics, logs and traces will be available but for now we're focused on open telemetry trace data. What is a trace? It represents work being done. It measures the end-to-end latency for a complete request. So you click a button and add a product to a cart. How long did that take? Did it get held up? Where did the operation get held up? That's a trace. And a span is a building block of a trace, a single operation that takes place as part of a request. So a trace can include many spans and spans contain metadata that tell you about the operation, the name, timestamps, attributes. You can add custom key value items, events. And so let's start collecting parole finally. Of course. So the first step would be instrumenting that and by instrumenting I mean you have to have a mechanism where you create the trace data. And for that, the first step would be to import and configure the open telemetry API in SDK and then creating the telemetry data and exporting that data. Since we were working with Cryo and Cryo is written in Golang, we use the Golang SDK for open telemetry and injected the codes to create the trace data. Once you have created the instrumentation, the next step would be conferring the open telemetry pipeline and that I will come in to later. But basically what you need to do is the collector needs to receive the process and export the telemetry data that it is receiving from the application. And we do it by the means of OTLP which is open telemetry protocol and it is the generic protocol to receive and send the data over GRPC. And once you have received the data you want to see it somewhere, you want to send it somewhere and that is how you set up the pipeline. So next slide. Okay, so once you have instrumented you have set up the pipeline you obviously want to see what's going on and for that you need to set up the visualization backend. We did this using Yeager. While Yeager itself has the complete set of tools to enable distributed tracing but we are only using three components which is the collector and the collector receives the spans and add them into a queue to be processed. The console helps you to visualize the distributed tracing data in a UI and the query component is a service that fetches the trace from the storage. So if you want to look for a particular trace you use a query service. Next one. So this is coming to the pipeline. It is the most hard thing to develop. This one's Sally. It's the most difficult part as in at least for us it was and once you have hacked the pipeline once you have understood how to make your pipeline it would be a smooth job. So, but what is a pipeline? Well, the pipeline is the data part in the collector starting from receiving the data to processing it and then finally exiting the collector system while the exporter. And now let's dig into each of the components specifically. We can start with the receivers. The receivers are typically, what they do is listen on a particular note on the network and receive the telemetry data. So there could be many kinds of receivers. You can have OTLP, open sensors, Yeager and there's a small code snippet attach how you write the receivers or how you configure the receivers. For our purpose, we are using OTLP and it is again like a simple specification that describe the encoding and transportation and delivery mechanism of the telemetry data. Next is processors. Processors can transform the data before forwarding it. You can remove attributes from the span. You can drop the entire data and decide not to forward it and this is how you create sampling when you drop a data simply by not forwarding it. You can have multiple kinds of processors. What we are using is batch processor and what batch processor does it? Collects a batch of trace and just forwarded to the next component in the pipeline. And next we have the exporters and the exporters would typically forward the data they get to a destination on a network. So, for example, we are using Yeager and logging and what they do is they keep on listening on that particular port and once they receive something, they forward it to the rendering to the next component. For us, it would be the Yeager UI so you can just see all the traces that all the requests that were made and we also use login and we don't have a backend storage but we just use the host of the pod, the collective pod store all the logs. Obviously, it's not recommended in a production environment. You should have a backend storage like elastic store or you can use Yeager backend as well but we are not storing or we have not attached any backend storage for our case. Yeah, so at last, if you want to tie down all these components, this is how our pipeline, the cryo pipeline looks like. So, we have a cluster and inside the cluster you will have a node and cryo is running as a system D process on that node. We have the instrumented code, the cryo code running on the node and we also have the collector deployed as an agent on the node and both the cryo and the agent, they are using host network so that they could interact with each other and the first component, the receiver would receive the instrumented code from cryo and it will send it to the processor and then it will send to the exporter. Now the exporter inside the agent on the node will send the data to the collector. I know it's getting confusing, so just bear with me. If the collector is not running on the host network, we have one collector per cluster and it uses the Kubernetes service to enable receiving and sending data. And finally, it sends it to Yeager, which is UI. Next slide. So, when you have just one node, the previous slide represents only a single node, but for us, we were dealing with six nodes and cryo was running on all the six nodes. So our pipeline has cryo running on all the nodes, the agent running as daemon set on all the nodes and the collector running as a deployment in the entire cluster. So for instance, we have six nodes, so we have six instances of cryo, six daemon set and one instance of collector. You might be wondering, what is the difference between agent and collector? Well, the binary is the same, there's no difference. The difference is only on the way you have deployed. Again, cryo runs as a system D, so it needs something on all the nodes where cryo is running to get the data and to be able to talk. That's why we have deployed the agent as a daemon set and it interacts with cryo with the host network, but we have deployed collector as a deployment which uses Kubernetes service to interact with the rest of the components. Next one. And here we are. I'm gonna stop sharing for a minute because I'm going to share this video. It's about, I don't know, I think like six minutes. So if we don't have time for questions, we'll take them in the breakout room. And let me make sure it's full screen. I believe we're ready to start. Okay, I am sharing again. You can also make the video full screen. It shows a terminal, so it's a little easier to see it with full screen. Here we go. This recording is going to show some experimental features in Kubernetes. At CD, API server and cryo have experimental distributed tracing support with the open telemetry instrumentation. So I'm going to set that up. Let's see what it looks like. See you later. I have a VM running and I have configured it to build cryo from source because the PR is not quite merged yet. I have configured the machine to run a QBadmin cluster and in this repository, you will find documentation on how I did that. And then I'm gonna run QBadmin with a config file that adds extra arguments to enable tracing in the API server and at CD. We'll send to a collector deployment and from there, the OTLP data will be exported to whatever backend you choose. I am running Yeager in this demonstration. So again, CentOS 8 stream, 8 CPUs, 32 gigs of memory and 20 disk space. I have no idea what the minimum requirements for this would be, but you know, you're safe for this because it works on my machine and let's get started. I will show you the QBadmin config that we're gonna run. Okay, now we deploy the agent and the collector as I already mentioned. And then once the collector is set up, now you need to set up Yeager so that the OTLP data can be exported to Yeager where then you'll finally be able to see it in the Yeager UI. That is it. Before I switch over to the terminal, I want to show you the QBadmin config file. This is how you tell QBadmin to run with Cryo. We're setting up the API server with the trace config file and with the feature gate enabled and the trace config is volume mounted. The controller manager, I'm passing the cert file and the key file so that I can later deploy the cert manager and then the etcd extra args, experimental distributed tracing arguments. Now let's switch to the cluster. Again, I'm running in GCP, so we'll SSH in. Right away, I'm gonna alias OC because I'm too lazy to install it and also because I kind of switch back and forth. And then you can see I already have a cluster running, but that's no fun, so let's tear it down and you have to be rude. All right, the cluster is gone, you can see that. The trace config I already copied, but let's check it out. I set the sampling rate very high for testing. I showed you the QBadmin config file from the repository in GitHub, so I didn't set a hostname, it's no big deal for our purposes, it won't matter. All right, let's check it out. So here's all the pods running in the kubesystem namespace. Pass this, otherwise a master node would not be scheduleable and the pods will be pending, all right? So right now, hopefully we have API server, etcd and cryo trying to export trace data. We won't know that until we set up the open telemetry collector, so let's do that. I am lazy, so I'm going to give this service account cluster admin rather than figure out exactly what permissions it needs. Here is a YAML file that includes the agent config map, the collector config map, which is exactly like the agent config map and the collector deployment for open telemetry and the collector daemon set and the collector service. In a real Kubernetes cluster, you'd have multiple nodes and each node would have a system to cryo service, one agent collector per node. Those agents can all send their OTLP data to the single collector. You can export to the back end of your choice. Here we will use Yeager. So I'm going to need the service address of the open telemetry collector and I'm going to edit the config map of the agent so that the agent knows where to send the, where to export the data. Here is what the config map looks like for both the agent and the collector. It has a receiver. Here we have the OTLP default GRPC receiver. The exporter, we're exporting to logging and also we're exporting to the open telemetry collector. You can set up the processors. Here's the pipeline. Trace data with the receiver, the processor and the exporter. Now we need to refresh the agent pod to pick up those changes in the config map. We'll see in the logs, you can see it's the logging exporter is working. So the trace data is in the logs. In the collector, you can see there is an error because it's trying to send the trace OTLP data to the Yeager collector I have set up in the config map but Yeager is not running. So the next step is to deploy Yeager. If you're running a cluster with operator hub like OpenShift, OKD, the Yeager operator, all you have to do is deploy that and it pretty much takes care of running the following commands that I'm about to run for you. So we're going to create a namespace for the operator and all of its resources. The first being custom resource definition. Now we're creating a service account and we're going to do a role and a role binding. Once we deploy the operator, we have to edit the deployment to watch all namespaces, setting it to the empty string. It's a very simple Yeager custom resource, the all in one. So once the operator detects this resource in whatever namespace it's in, which we're going to deploy it into OTLP, the operator triggers the deployment of various resources. Just by creating that Yeager instance, I have a new deployment, a new config map, a couple of Yeager services, Hotel Yeager Pod. The error in the hotel collector where it couldn't find Yeager, we need to delete that pod to refresh it. We're ready to view the traces. First, I'm going to forward the localhost 16686 from the GCP machine to my localhost at 9876 and that will drop me into my GCP instance again. And now I need to port forward to the localhost. So once that is done, I can now go to my local machine and you can see I have cryo at CD and the API server. I have a webhook application, a test application running and I just scaled to 30 pods so I could generate some activity. So the cryo traces are sort of on their own, but if we look at at CD, you can see that the API server and SCD spans are included in a trace. As an application owner, you can instrument your code to add information. In this trace, you can see the request and the response and then the overall put information is included with the span. You might include the host name, application, maybe user, a UID. It's up to you, the API, the trace API has baggage where you can inject custom key value pairs within a context and that context is propagated between services, between operations. So I'm not an expert at analyzing the data. I'll leave that to you all, but if you follow along to this video, you'll be on your way. Thanks. The webhook application that Sally was running was not instrumented. So the benefit of instrumenting the core Kubernetes component is you don't need to, as a mandatory instrument your own test application or your own consumer application that's running on a Kubernetes cluster because if you're creating a pod, if you are creating any of the resources, they would eventually, oh, we're here. So, yeah. Yeah, yeah, that's fine. But yes, we definitely want to make sure that's clear that by instrumenting the core of Kubernetes components, all applications will get the benefits from tracing because any application running is making requests and responds to the cryo and Keeble and API server and SCD and then next would be the Keeb scheduler and the controller manager. So if we have the whole core instrumented then everybody benefits and then it would be a bonus for you to instrument your own application. Anything else I forgot, Pearl? No, no, we covered all. So, yeah, in conclusion, we would like to instrument everything and we would like you all to try it out because man, it took a lot of trial and error for us to get everything working and we don't want that to go to waste. So please check out these links and set up a VM and run some QBabman clusters. Any questions? We have about two minutes. You can also ask us questions on Twitter. If you have, I can paste my Twitter ID, Sally can do the same. If I can remember it. It's also on the slide deck, the first page, so. Yeah, yeah, go with the slide deck, but I believe that's it. So yeah, thanks everybody. This recorded video will be uploaded when it's asking, where can I find the recorded video? This will be uploaded to our YouTube channel, to the Dove Conf YouTube channel once it's processed. We will link, I'll add a link to our slides. We have to make them public first. So I'll add that to the slides and send a link. Thank you. Perfect, thank you so much, Pearl and Sally. That was a very nice presentation. I hear a lot of claps in the background. And if you guys could upload your presentation on the scheduler, that would be nice so people can access it after. And our next talk would be at 2.30 p.m.