 Thanks to Sean. Hi, everyone. I'm Zane Asgard, co-founder and CEO of Pixie. I'm also an adjunct professor of peer science at Stanford University working in the areas of edge AI and systems. Today, I'm going to talk about what is Pixie. In a nutshell, Pixie is a developer tool that allows software engineers to debug, monitor, and analyze applications that run on Kubernetes without having to manually add instrumentation to their source code. So with that in mind, why are we building Pixie now and where do we see things going? So over the past few years, software has started to become more decoupled, moving into microservices, getting orchestrated by Kubernetes, and many production systems even run across multiple clouds. So teams have basically figured out how to build software in a scalable and agile way. On the other hand, as we take a look at how we secure, manage, and debug our applications, we still spend a lot of time just collecting and wrangling the data that we need to debug the applications. So where do we typically see this time getting wasted? So the first area is typically in this rigid, predetermined data collection. A lot of it comes in the form of boilerplate code that you add, and usually language-specific code that you add to your application that allows you to understand what's happening. In this very simple example over here, there is actually an HTTP request, actually the GRPC request, where there is a Go specific code that's added to capture information about the request. And you can actually see the business logic is starting to get buried within the source gather. One of the, I guess, one of the little tips that I want to share is, as an ex-Google heir, one of the things we were actually pretty used to is just getting lots of instrumentation for free. And that allowed us to focus on instrumentation that was important, like getting customer information and actually even running tests around it. One of the disadvantages of adding manual instrumentation is that it can land up being brittle and break when you're actually trying to debug a production outage. So, to summarize, it's actually quite painful to manage and maintain all the boilerplate. And we'll take a look in a little bit about how PIXI makes this easier. The other area, and some of this is based on work that I do at Stanford, along with some of the research coming out of Google and Facebook, is that most of the telemetry data, which is typically metrics, traces, and logs, is useless, like 98% of the data is being trucked around to the cloud when most of it is just giving you information that says, you know, your application's going okay, or nothing's really wrong. But 2% of the data is extremely important to actually understand why your application's broken and be able to debug it. So, one of the things that we're interested in PIXI is actually moving enough of the capture and compute all the way to the node so that we can determine what this 2% of the data is, which allows us to efficiently transmit that information over to the cloud or use it for deeper analysis. The third area is that we have UIs that are pretty rigid and sometimes very difficult to use. As software developers, we like the fact that we can quickly go and get information about the metrics and logs and hopefully give us enough information to debug what's going on. But part of what's missing is the ability to easily extend these, especially extend them in a way where we can codify the knowledge that exists in the team. And we'll take a look at how PIXI addresses some of these things. So, why are we building PIXI now and what are we planning to do? So, part of PIXI promise is to extend the observability stack to the edge which helps us actually reduce both the complexity and the cost of the system. And with that in mind, we have kind of three pillars that we think about for PIXI. The first one is, you know, the rigid predetermined manual collection. We wanna change that so that we can actually have code-driven collection that can actually happen on the fly even after the application has been deployed. This actually removes this crucial redeploy that are typically required to add additional instrumentation. When you have a production outage, you can get information about the application without having to modify it. And we'll talk about that in a few minutes. The other area like we discussed is currently most systems are cloud-only to do all their storage and do all their machine learning in the cloud. You know, part of what we wanna do is actually move some of the storage and the machine learning to the actual cluster and even the actual node so that we can actually reduce the amount of data transfer making the entire system a lot more efficient and also allow us to collect a lot more data without increasing the burden on your system. The third area is move away from these very manual solid interfaces is something which provides out-of-the-box functionality but as developers gives us things that we want which is basically gives us a good API and allows us to extend the user interfaces by writing scripts. All right, now we're gonna talk a little bit about what is Pixi? So as we mentioned earlier, Pixi is a platform that allows you to do instant code-driven debugging and we provide information about application performance metrics, infrastructure metrics, network performance and debug logs. And we actually have like three different modalities that can interact through which is our CLI, our UI and also our UI that can run on mobile devices. They all utilize the same API and underlying code so they're all very consistent and work out of the box. Just a lot of different things for a small company like us to focus on. So what we typically do is primarily target application developers who are interested in looking at performance issues in production. And the way we think about this problem is this notion of a T. So the top of the T as I like to refer to it is give everyone no instrumentation baseline visibility which means that once Pixi is up and running in your system, we will actually give you information about all the HTTP requests and infrastructure metrics and a bunch of other things without having to do any work at all. Further, on some language stacks like Go and C++ and Rust, we can actually give information about code level context. Pixi does this by actually leveraging new technologies like EVPF to help us actually capture information and we correlated together with information that's available in your Kubernetes cluster to actually make this information digestible. Ultimately, as I mentioned, every single thing in Pixi is a script. So we have, you know, at the bottom, you can see that we send over Pixi scripts to our API and we can return back metrics, traces, logs, events and given insights about the traffic that you've seen. One of the nice things about the abstraction layer is that we've been able to build with developer community that has managed to build out many different other use cases due to the bottom of the T. So some of those use cases are things like network performance and application security which are enabled by Pixi but we're not currently focused on as a product. With that in mind, how do you actually install Pixi and what are the implications? So the easiest thing to do, right, is to just grab or install that SH file and execute it and we will download the CLI, help you authenticate and get everything going with a single command. If you wanna get a little bit more insight into what's happening, you can actually use one of our other deployment schemes like Docker, Debian or even like YAML and Helm charts so that you can actually deploy directly to your cluster without using our CLI. With that in mind, I'll go through a quick overview of how Pixi is deployed using our CLI. Since I don't really wanna subject everyone here to watching the Kubernetes cluster or deploy all our services and pods, I'll actually just show a quick video how this works. So to get Pixi deployed, you type in PxDeploy, we discover clusters, run some checks and then basically let you install Pixi. You can see on the bottom we have a timer that quickly shows how fast everything is running. And once Pixi's installed, it typically takes about two and a half minutes on my cluster, you can actually list all the scripts and you can instantly start seeing data. So just to recap and install process which takes about two and a half to three minutes, you can instantly get access to data without having to modify the cluster state or the application other than installing Pixi. Cool, with that in mind, I will switch over to the UI where this is the same demo cluster that actually was demoed in the video where we actually have Pixi running. Since Pixi instantly actually gets access to all the traffic, we are able to generate all these service graphs where you're like right out of the box. So this is like a high level cluster view. As you can see over here, I've selected my demo cluster and I'm running this CryptLPX cluster and we'll see what that means in a little bit. But, you know, at a nutshell here, we're seeing all the service graphs which tell us like here are all the services and what are the communication patterns between these services. In this demo today, I'm primarily gonna use this application called online boutique which is a shopping application developed by Google to showcase Kubernetes. So inside of online boutique over here, I can see there's this thing called checkout service. We'll be coming back to that in some more details later. But if you take a look at the edges of the graph, we actually summarize all the high level information like requests per second, errors, and the latencies. Pixi understands relationships between objects so you can actually double click on these entities and we'll be able to generate entity specific views like your server requests per second or latency or CPU usage. Each of these views in Pixi has a default for every type of Kubernetes entity. So quickly I'm gonna switch over to the namespaces view over here which summarizes all the namespaces on this cluster. And as part of that, I'm gonna go into online boutique which is the application we're interested in looking at. Once again, you can see the service graph for online boutique as it's been summarized for this particular namespace. So, oh, one of the things we see over here right away is this plot which actually shows you the latency distributions. With this, you can quickly see what the P50, P90, and P99 latencies are. And one of the things actually is pretty apparent over here is this thing sticking out and it says, oh, the P99 latency is 1.9 seconds. So if you actually dig into this, we can try to get a sense of where the latency issues are arriving. And we can see that there's a huge latency spike a couple of minutes ago. And if you scroll down, we can actually see that the inbound traffic is actually quite slow and the outbound traffic is fine. We haven't summarized this in a waterfall chart but it's pretty apparent over here that there's some problem with the service. And interestingly, if you scroll down, we have actually been able to figure out what the slow requests are. So over here, we're like, oh, this request actually took 2.9 seconds and we captured an example of the slow request. So if you click over here, you can actually see that we do full body tracing of HTTP requests. And in this particular case, this is actually an HTTP2 request because this is a GRPC endpoint on Hipster Shop. So we can go over here and we can see that there's this protobuf message that actually captures basically the entire request body of HTTP along with the source and destination and what was the response for that GRPC call. Furthermore, Pixie actually captures contextual information about every single request that's actually stored and easily accessible. Natalie will actually be talking a lot more about that in a later talk. So just to go into one more level of detail, Pixie also understands many different database protocols. For example, if I switch over to MySQL View, I can actually take a look at every single MySQL request that's been happening in the cluster. And as promised earlier, everything inside of Pixie is done with scripts. So if I hit Command E, I can open up the editor and see what the script does. And in this particular case, we can actually see Pixel, which is a Python DSL based on Pandas that Natalie will be talking about a little bit later. And it's a pretty simple call that says, fetch all the MySQL events that have been happening in the last 30 seconds, pick the top 100 and display them in the UI. So one of the nice things about this is using this scripting language, developers in the community can actually build customized views that go beyond simple charts. So far, we've seen how Pixie provides high-level visibility for every single application that runs in your system without actually going into the code. But as promised, as one more thing, they're going to provide information about how Pixie can give you code-level visibility. So Pixie, as one of its core focuses, can give you on-apply code-level visibility for Go with support for C++ and Rust coming soon. And as I mentioned earlier, we already saw that no instrumentation baseline visibility, and now we're going to dive into a code-level context and visibility means. So, as a software developer, I'm sure everyone has run into production bugs that they've wanted to solve, and they're basically thinking about, well, what if I could just add a print statement somewhere in my source code to see what's going on? And actually, we're here as one of the examples. We have a very simple function written in Go, and you're like, I really just want to look at these variables, and I want to take a print statement. What you typically have to do is go in and out of the log statement, wait for the code to compile, run all tests, go through code reviews, and then eventually get deployed in production, which in some places might be a few hours, and others could be a few weeks before you actually see the log statement in production. We asked ourselves a question, what if we could actually add this log without having to go through this entire cycle? And that's what I'm going to talk about today. So with Pixie, we can actually go at these log statements in your source code, and we do this primarily by using stuff inside of EBPF, which we'll have many, many talks about coming in the future. But if you're interested in some high-level stuff, you can feel free to check out my blog post. So with that in mind, we're going to use this application that we used for last, a last session called Online Boutique. And Online Boutique, as I mentioned, is a shopping application built by Google Cloud that showcases microservices running on Kubernetes. So in this application, you can go do things like buy a vintage camera lens and add it to cart, and then it will tell you how much it costs in terms of US dollars, place an order, and buy it. One of the things that happens inside of this application is that there's this thing called a checkout service, and it has a function that basically adds two different currency values. And we're going to actually go figure out if we can capture the arguments to that function without having to recompile and redeploy the code. So with that in mind, let's go into Online Boutique, jump into the checkout service, and then go to the pod for the checkout service. There's only one running, so we'll select that pod. And one of the things we need to grab is this thing we call the UPD, which is actually something called a universal process and modifier. And that actually gives us a unique ID that we can use to track any process within Pixie. So I'm going to go to this little script we have called boutique checkout trace that actually contains a code necessary to trace this money function inside of Online Boutique. So we're going to take the UPD that we have and then copy it into here, which actually runs this pxtrace.probe function. Let's talk a little bit about how this works. So we basically specify a path to a go function we're interested in tracing. So in particular, we're providing a path to this function called money.sum. And just for a reference, I don't really want to pull out the source code and get into all the details, but basically sum takes in two different currency values, which are two protobufs, which have information about the value, like the fixed point values, and also the currency information. And it returns the sum money value, and if there's an error, especially related to different currencies. And with that, we basically define a probe function, which basically says, okay, for each one of these protobufs, dump out the units and the nanos, and then for the return values also do the same thing. So how does this actually work? Once this function is running, we call this thing called absurd trace point, which actually inserts this into the fixie state. And Natalie will talk a little bit about how that works in the language. But at a high level, what we do right now is we take this code over here, we then generate all the BPF code necessary for us to actually go and observe this application while it's running in a safe manner. And we find the right note and then deploy the BPF code amount out. Once the code is deployed, we take the values that BPF is generating and write it to fixie tables, which you can then leverage the full power of the fixie language downstream. So with that in mind, let's go ahead and just deploy this. And you can see over here that the trace point is not deploying. And once it's deployed, the schema will get prepared so we can query it. This usually takes a few seconds. And at that point, if you're in this application and we actually go buy something, so let's say we have a vintage typewriter and we go and buy these things and these two items are gonna cost us $87.12. We will place an order and you can see over here that we actually pulled out the request. And these are the two items that we purchased, the $1.12 item and the other one is a $67 item. And just to see it again, I can pick another item. Let's say you buy a barista kit, an attic cart, place an order. We can see that we instantly capture the $128.72 item over here. We also capture other contextual information, like what time does this request happen in? And along with all the other basic pixie context information like the service and pod, we also capture language specific information in this case, for example, the GoRoutine ID that actually process this request. So that's just a little taste of what our Go tracing and code level tracing stuff does. And in the future, we don't really expect you to go find the exact function to trace. We would be able to provide your code level views, for you to be able to click in and be able to tell us, okay, go add a log over here, tell us how long this function took to execute and give us information about the arguments. Further, we can actually do other dynamic tracing. For example, we can deploy code from the very popular BPF trace project to actually dynamically capture other BPF information. A lot of this will be talked about in future community radios. So with that, it's a good jumping off point to what we think about other use cases. So one of the things is that while pixie as a company is pretty heavily focused on building code level contexts that help application developers debug their applications, the pixie platform itself is pretty accessible. And this has allowed members in our community to build things to monitor the CI build health or actually go deep into network monitoring, find out information about TCP retrends at events or even build stuff that helps with application security. And actually, Kelsey will be talking about some of these things in a little bit. So, we talked a lot about how the community stuff worked but what are we working on next? So one of the things that we've been really excited by is building more and more HTML. And what we mean by this is actually giving information by analyzing all the traffic that's running through your node and summarizing it by using various machine learning techniques. So someone on our team actually put this one thing together which reminded us of Microsoft Clippy, so we stuck this on a slide. But part of the idea over here is that we can actually go and analyze information and tell you what we see as specific problem areas. So in this particular case, we pointed out there's a high correlation with latencies on one of your services. Since we actually understand all the traffic, we can actually go in and tell you specific keys that are causing trouble in your JSON messages or specific customer IDs or specific customer groups. And all of this is done without actually having to do any work because all of our models are running all the time and we're trying to figure out based on this information which signals the most important ones to surface. You probably won't be seeing a Clippy coming in Pixie anytime soon but we'll probably surface this in the UI in various different ways. So, how does this all work? Under the hood, we deploy these things called PEMS which actually run on every single node collecting data from Linux kernel along with other sources. We actually run machine learning model inside of our PEM which run on 100% of the data on our data streams. We generate high level features that we can then share with other nodes in particular we shared with this component we call Vizier which is actually the thing that orchestrates monitoring across the entire cluster. And Vizier itself can take this information and send down model updates to basically help the PEMS capture information that's necessary to make decisions about what messages are good versus bad. Cool, well to summarize, today developers waste a lot of time wrangling telemetry and dealing with data and Pixie aims to make that easy. We do this by extending the observability stack to the edge and providing a programmable interface that's easy to use. Our core focus is delivering what we call this one click APM experience where you basically install Pixie and you're up and running for application performance monitoring. As I mentioned earlier, members of our community has actually extended the Pixie platform to other use cases like infra and network performance monitoring and application security. And part of the reason that Pixie can achieve all this is that we do edge machine learning where we can take a look at 100% of the data to help you understand where the issues are without overwhelming any of the systems.