 process, at least to me, is usually pretty scary. There's lots of issues getting access to the right production data. And usually, when this is happening, you're under lots of pressure to fix the thing causing the outage. You usually have to send out updates every 20 minutes or 30 minutes or whatever the SLAs are. And you're under immense pressure without access to a lot of data. And it's just very, very challenging as a developer. So let's take a look at some of those challenges and how we might be able to mitigate those with better tooling. So to start off with that, let's first take a look at a monolithic application. In a monolithic application, you're trying to deploy primarily one large service, which may be talking to some databases and some other back end systems. And generally, you're trying to look at things like, oh, I forgot to add some particular logging or metric. Your program just crashed, and you don't really know why and what's going on. And the famous line for everybody, it works on my computer. I don't understand why it's not working in production. Or actually, I don't even understand why it's not working on someone else's machine. So this is a pretty, pretty classic set of problems. Now, you add in the fact that you start moving to microservices, and you start having the problem for monoliths on every service. But you also start having a new set of problems. Why can't my services talk to each other? It was a lot louder. And did an upstream API change, caused some blocks to happen, and which services are making this transaction slow? And this isn't meant to be a comprehensive list, but it's just supposed to show you the breadth of the types of problems you're going to run into. One of the cool and interesting things is that Kubernetes is making the deployment of microservices a lot easier. We actually see the number of services is exploding. Right in this diagram, you're seeing how this problem works for two services. In reality, what we actually see is something like this. This is actually an actual service diagram we pulled from an actual cluster running a workload. And if you go take a look at popular other services like Netflix, and others, you'll see there are hundreds of services that are talking to each other. So if you're trying to debug this, it can be a pretty big challenge. So I went through a lot of information over there, but let's summarize them with some of the key problems and kind of bucket them loosely so we can understand what those look like. So one of the problems is collecting the right data. And an example of this is getting the right log lines and metrics, and figuring out why two pods can't talk to each other. Then there is the flexibility or rather the inflexibility of analysis, which is your program just died, and you'd really want to understand why and you wanted to be able to analyze that. And then obviously the classic problem doesn't happen in my dev environment. And then there's also just contextualizing the information and understanding why this problem is happening and the context of where it's happening. And this is like, oh, I deployed my service and something broke. Did it actually break because it would change in the different service? So contextualizing it with other updates in your API. And also figuring out which of these services is making a particular transaction slow is another part of contextualizing information. And with that, let's quickly summarize some potential solutions. And this is by no means a comprehensive list. And there are many different ways to solve these problems. So one of the things around collecting the right data that we'll talk about today is doing auto telemetry with EVPF, which is something we do in the project we work on, Pixi. And then to deal with the analysis, it's trying to create scriptable and API-driven interfaces to give that level of flexibility that's necessary. And the third thing is to try to build a more Kubernetes-native system that can understand actual entities and Kubernetes so that we can actually build an entire experience for debugging. With that, I will move forward to quickly talk about Pixi, which is a project that Natalie and I work on. It was contributed by New Rolec to the Cloud Native Sandbox earlier this year. And Pixi is an observability platform for developers. And we kind of have three key things that we work on. The first one is doing this auto telemetry with EVPF. And we'll see some of this in a demo. But basically, we can go and automatically instrument an app without a priori adding code. The second is building a fully scriptable and API-driven interface. And the third one's being Kubernetes-native. And with that, I'll hand it over to Natalie to quickly go over a demo. Thanks, Zane. Yeah, so we're going to do a lightning demo here. And if anyone's interested in learning more about Pixi's capabilities, you should also come to our talk on Friday at 11. And right now, we're just getting the display set up so that I can actually see what I'm doing when I'm demoing. No, we're good, yeah. But yeah, as Zane said, I'm also an engineer working on Pixi and really excited to be here today. OK, so first, we're going to show you a slightly sped up video of how to deploy Pixi. We're showing this because we think that part of the challenge with collecting the right data is that it's just so hard to instrument our systems. And EVPF provides this amazing opportunity where we can automatically collect data without the need for manual instrumentation. So while we sped up this video a little bit, it usually takes about two or three minutes to deploy Pixi onto your cluster. I think that the key thing here is that in just one command, we're automatically deploying all these EVPF probes to your cluster to collect a rich set of telemetry data, which will then show in the UI demo next. And kind of an interesting fact about this is that the Pixi team was at KubeCon in 2018. And LinkerD, which is another really awesome project, gave a demo that heavily inspired this deployment flow. And we hope that deployments of observability and other types of infrastructure in Kubernetes can get a lot simpler. So deploying. Oh, it finished. OK, great. All right. Now let's switch over to the UI. So OK, I deployed Pixi to my cluster. What can I do now? So that was a video of deploying Pixi to this exact cluster. We can see right off the bat that we're seeing traffic between various Kubernetes services. We can list out the various nodes in my cluster. And we can see things like the namespaces that exist. And this is where the Kubernetes native part of these systems that we want to build come in, where if you know the system you're running on, like Kubernetes, you can attach much more helpful context to debugging Kubernetes-specific problems. So that's why we want a reason about entities like Pod or service. And so Kubernetes helps us do that. So when we cluster our service graph here, we can see that there's two main apps running on here. And I heard in a bug report that customers are finding my online boutique to be slow. So let's dig into why that might be. We're going to go into our services and just widening this serviceability. These are some of the services that are running in my cluster. And we want to actually sort this by latency and see which ones are the slow ones. So we can see that this checkout service right here is taking almost a second in some of its requests. And that is heavily unexpected to me as a developer. So I'm going to click into this and try to figure out more about what's going on. I can see these HTTP requests and errors in latency. And all of these things, I didn't have to instrument my system manually. It's just being collected through the EBPF probes that were deployed via that simple PX deploy command. I can see the inbound traffic. I can see that the traffic coming from this IP is actually a lot slower on average than the traffic coming from the other IP. There's only one pod here. So let's drill down into this pod and see if we can figure out more about what's going on. And this is a lightning demo. So we're not actually going to solve this, just trying to show off what I might do in a true incident. So we can see lots of both infrastructure level metrics as well as application level metrics in this view. And one of the things that we're most excited about is that we recently added continuous profiling to Pixi. So you can actually see performance flame graphs of not just the fact that something's slow, but which function is slow, which function is the problem. In this particular case, we don't really have one function that stands out a super large amount. But actually, we have used this in our own production cases when we were like, wow, that function is like really slow and we didn't even know it. One point that Zane was talking about is the idea that we want things to be API driven. We want things to be scriptable. We want to be able to compose data pipelines together of different pieces. And we don't want it to be in some walled garden that's like very hard to access. So one thing that we did when we built Pixi was we made sure that everything that you see in all the views that I showed is generated by a script. And we have client APIs for this. So you can build Slack bots with Pixi data and things like that. So just to kind of demonstrate, we can see that these are some of my HTTP requests running in the cluster. And we can see headers and request body and things like that. And let's say I just want to do a simple filter and only get a particular request path. Oops, I pressed the wrong button. Okay, well, I'm probably running a little long anyway, but what would have happened is I would have added a filter to filter all of these requests down to a particular endpoint and then I could drill down into those particular requests. I don't think the technical gods are blessing us today with the system. Okay, great. So back to this presentation. The thing is we are so excited to be part of the CNCF because of the quality of projects that the CNCF has. We're thrilled that these projects, like Prometheus and Open Tracing, like now we get to be part of the CNCF with them. And they are also solving these problems in really important ways. We have Open Tracing, which is making a standard for data that people can use to query and analyze all of their data together. That's huge, huge for flexibility of analysis. We have systems like Prometheus, which are, it's a world-class time series database and it's so easy to use and it's incredibly scriptable with PromQL and so you can make highly composable data pipelines with Prometheus. Fluent D for logging, huge for default observability on Kubernetes, it's so much easier to get in all this information today when you have an incident than it used to be. Linker D, as I mentioned, they directly inspired a large part of our architecture. They in Envoy, with the service mesh approach, it has huge impact for observability. We would love to invite you all to come to our happy hour tonight. It's sponsored by New Relic where both Dan and I work. It's gonna be at the Intercontinental and the whole Pixie team will be there and a large part of our community and we would love to see you there. Also, here are some talks that we are gonna be attending. We're gonna be attending way more than just this but here are some of them that we're really excited about. As mentioned before, we're actually giving a different talk on Friday at 11 so we would love to see you there. And finally, if you wanna check out the source code, Pixie's fully open source, please check out our GitHub and we would love to see your issues or feature requests. Thanks everyone.