 Good to go. All right. Welcome, everyone. We're really happy to be here today to talk about reproducing production issues in your CI pipeline using eBPF. So yeah. My name's Omid. I work at New Relic. I'm an engineer there, principal engineer. I was a founding engineer at Pixie. So I've spent quite a bit of time working on observability, getting data with eBPF. And we're going to talk a lot more about that in the talk. And I'm super excited to be here. Matt? Hey, so my name is Matt LaRae. I work at a startup called SpeedScale. And I hate writing tests. So that's what our talk's going to be about, is how you can get good testing without writing tests, which is also what my company SpeedScale focuses on. Thanks for the applause. Yeah. Yeah. Yeah. We did better. All right. OK. So like I said, I hate writing tests. And probably you do too. One of the things we did in previous generations of technology is we would go and write these handcrafted tests where we would go and make sure it's going to work and we would try to guess. Problem is we still have huge numbers of production incidents. So what I was trying to accomplish with this talk is give everybody a bunch of easy things you can do. Get hub repositories you can go look at that and we'll show you how to instrument your application and run realistic test scenarios using traffic replay in your continuous integration pipeline. So that's what we're trying to achieve, is get more tests, better tests with automation, not with human intervention. So what is traffic replay? Now, you may have heard of this before. It's been around the industry a little bit. But basically, when it comes to software quality and validation, it means treating your tests like cattle, not pets. So instead of having those hand curated tests, much like your Kubernetes infrastructure, you want to have them continuously refreshed. You want to just blow them away, get rid of all the tests when they're not valid anymore. And you want to treat it as a system that's continuously updating instead of as this sort of written and cast in concrete sort of thing. So a couple of different approaches to software quality that many people probably are using. If you're an early adopter of Kubernetes and Cloud Native, you probably have tried to test in prod, which I like to live dangerously, too. That's great. I do a lot of testing in prod, sometimes by accident. But there are some limitations to doing that. So live dangerously. So companies that were very early adopters like Twitter came up with an idea of traffic mirroring. Basically, the idea of traffic mirroring is to apply the scientific method to testing new releases. So the process goes, you have a continuous integration pipeline. You go and you deploy it, but you try to limit the blast radius of a failure. So you've got this new code. You're not sure if it's going to work. So you deploy it along with two replicas. And then you go and see how it performs versus the replicas. So you get kind of this comparison chart to see what's working and what's not working. Now that works OK for Twitter because Twitter has a certain set of conditions that you may not have. The first one is they have billions of transactions a day. I'd actually maybe trillions, but billions at least a day. Tons and tons of traffic constantly going through it. So if they have a small amount of error, that's OK. Because somebody will just hit refresh, no big deal. Now if you work at Deutsche Bank, totally different story. You can't have these failures. So that's the limitation of traffic mirroring. And there's one additional limitation is time. Traffic mirroring only shows you what's going on in your production system at the time in which you're mirroring. So to solve these problems, the new generation of the technology is to actually use traffic replay with persistent storage. So the idea is we're going to tap into production, usually using a proxy, which we'll get to that in a second. Usually using a proxy, we're going to tap into production. We're basically going to open a big pipe. And for some period of time, we're going to funnel that traffic into a data store. Then we're going to try to reuse it either on our desktop or use it in our CI system. So the idea is you can constantly take these new snapshots and constantly rerun them using persistent storage. So even in the middle of the night, still get good traffic in your test environment. Or let's say you have a one-time event that explodes in production. You can go capture that traffic and then use it again and again in CI to make sure that never happens again. So stuff like that. So that's the last one, which is traffic replay, which is what we're going to focus on in this talk. OK, with that, cool. So if you want to do traffic replay, where do you start? Obviously, you need to record information. You need to capture some of the traffic. You need to know what every single service in your Kubernetes, every single pod in your Kubernetes cluster is saying to every other pod. If you want to know what traffic is hitting the service, you need to be able to trace that information. The challenge is, how do we do that? How do we do that on a production cluster in an unobtrusive way? In a safe way, in a way that doesn't break the system so that we can get realistic traffic patterns that we can then replay in our CI pipelines. And that's where the EBPF solution is going to come into play. So I want to introduce you to Pixi. Pixi is an EBPF-based tracing solution on Kubernetes. We're going to talk a little bit about EBPF and what it is. But Pixi itself is an open-source CNCF observability platform. And among the various different observability things that it gives you, one of them is that it automatically traces your network messages. So in your Kubernetes cluster, it can automatically trace all the traffic that's going on within the cluster. And the main philosophy behind Pixi is that we want to be able to do that in a way that requires no manual instrumentation. We don't want to go and change any code. We don't want to even redeploy anything. We want it to just automatically capture all of that data so that we can then get it into our CI pipelines. How is that possible? And that's the magic of EBPF. So it's powered by EBPF, and we're going to talk about that in a moment. Some of you may have heard about EBPF. EBPF is very popular these days. It's really growing in popularity. But I'm going to start with the basics. So if you've heard about EBPF before and if you haven't, you should go read more about it. But if you've heard about it before, sometimes you may have heard of it as a VM in the kernel, a sandbox environment, all that sort of stuff. Those are great descriptions. They're accurate. I'm going to put those aside for a moment. And I just want to tell you what you can do with it. And so I like to use an analogy. I like to think of EBPF as like a debugger a little bit. It's like a breakpoint in a debugger. With a debugger, when you're interested in what's happening in your application, you can go ahead and put a breakpoint anywhere in your code and say, when you reach this line of code, break. And then you get a kind of a terminal. And then you can go poke around. You can go read the values of different variables. You can read the memory. You can inspect everything that you want. So pretty powerful. Of course, if you try to put a debugger into production and break it, you're going to halt the service. And that's no good. So we can't do that. But what EBPF allows you to do is essentially set a similar breakpoint, which is called a probe. And then you kind of automate the reading of the different variables that you want to get. So you can go in and write an EBPF script that says, stop when you reach this line of code. Go collect the value of the request variable or the fields of that variable or whatever else other information that you want. Take that out and push it out into user space or put it somewhere so that you can kind of go get it later. And then immediately resume the execution of the program. It's very quick. It's very efficient. It's safe. The kernel will actually check that your EBPF program won't kind of mess up the kernel or anything like that. And so you can really, in an unobtrusive way, that's the key word, grab information about what's happening in your system so that you can collect it for replay. So what kind of overhead can you expect from EBPF? Yeah, that's a great question. So I mean, EBPF itself is very lightweight. Unless you're putting it on the tightest loop, you're probably not going to notice any impact to your performance. The probe itself is going to be so quick you won't notice it. It's negligible. In terms of Pixi itself, Pixi takes, because it does other processing and parsing of information, it takes anywhere between 2% to 5% of your application. So the EBPF part itself is really lightweight. Yeah, and then I have an example here. I'll just walk through that quickly. So here we have process requests function. That's your application code, let's say. You can put a probe on the sendPong function. You say, every time I reach the sendPong function, I want to do something. And then the EBPF, that's your trigger. And the EBPF program here is the most simplest one you can think about. It's just counting. It's saying, how many times did I reach this? But you could also go in and say, what was the value of the variable? What was it? Was it an HTTP request? What was the content of the HTTP request? All the sorts of information that we would want. OK, so that seems pretty powerful, EBPF. Again, I really encourage folks to go read up more about it. If you're not familiar with it, it's super cool. Pixi, the way that Pixi uses EBPF in its protocol tracer is that we want to be able to trace all the traffic in your system. And the important thing is we want to be able to trace the traffic in your system regardless of what language you wrote it in. So Python, Go, C, whatever, Ruby, we don't care, no JS. No matter what sort of application you're writing, we want to be able to trace that traffic and not have you change anything about your code. So what's the best place for us to put the EBPF probe on is actually in the kernel itself. So you can do that with EBPF as well. So you can go into the Linux kernel and say, when the application makes a send syscall, trigger. Trigger an EBPF probe. And so then it doesn't matter what framework you've written your application in, we can trace it. And so every time a message is sent, we can trigger on that, and we can collect the information about what was sent. Every time that a message is received, we can trigger on that, and we can collect information about the response. So an HTTP message goes out, we'll get it. If it's a MySQL, we'll get it. It doesn't matter. As long as it's going through the Linux kernel, which everything has to, we'll get it. Is it in a container? Sure, it doesn't matter. We'll get it because the container sits on top of the Linux kernel, and that traffic eventually goes through the Linux kernel. Once we've gotten the data, the EBPF probe collects that stuff, puts it in memory somewhere, and sends it up to the Pixi edge module. So essentially sends it up to the rest of Pixi, where Pixi will parse the data, put it in a structured format. If it's HTTP, it'll put it in a structured format that we can query and look at the traffic later, and we'll see a demo in a little bit. Now, some of you might be thinking, oh yeah, but my application, my application's TLS, right? So by the time it reaches the kernel, it's all encrypted. So you're going to be out of luck. How are you going to get the data out there? EBPF to the rescue again, right? So you can do a lot of cool things with EBPF. So when Pixi detects that there is encrypted traffic, what it actually does is it puts probes instead of on the Linux kernel itself, it puts it on the TLS library. So for example, if you instrument OpenSSL and probe every time there's an SSL read or an SSL write, it's essentially the equivalent of send and receive, right? All the traffic that's going through your OpenSSL, regardless of which application it's coming from, OpenSSL is typically a shared library. So like everything that's using that OpenSSL, we'll be able to get that traffic as well so that we can replay it, right? So talking a little bit more about the Pixi architecture. So Pixi in Kubernetes is a daemon set. It's going to, so when you deploy Pixi, it's going to deploy an instance of what we call the Pixi edge module. You can think of it as the agent, the Pixi agent. The Pixi edge module, there'll be one instance of that on every single node of your Kubernetes cluster, right? And so if it's sitting on every node of your Kubernetes cluster, and it's watching the kernel, everything that is happening in your cluster is getting monitored, right? So we'll capture all the traffic from EBPF, as I mentioned before, every pod. So it's one instance of the PEM is monitoring all the pods that are sitting in all their containers, everything. It's all getting traced by the Pixi edge module. The characteristics, we already mentioned the zero instrumentation, no code modifications. That's the beauty of EBPF, right? You didn't have to change anything. Other characteristics of Pixi is that it is a distributed architecture, so all the data is held on the edge in the cluster itself, and there's a way to query the data to get out just the information that you want. So we're not sending volumes and volumes of data up into the cloud or anything. The data stays local, stays on the edge, and when you run a query to look for particular data patterns, you can get the data out. And then at the highest level, there's a scriptable Python pandas interface, so you can write Python code to write queries. We didn't want to reinvent anything, it's pretty much pandas, so you can query your data and get the information that you want out. So when you make a query, you're sitting up, let's say in the browser, you're working with Pixi, you make a query to get certain data, or you click on something where you want to see the service map. A query will come down through, down to the Pixi edge modules. The Pixi edge modules will filter out the relevant information, do the aggregations, any anything that they need to do, and then ship it back up to the cloud where you get the information that you actually need. Now I want to do a demo. I was gonna do this live, but we've been having a lot of wifi issues here today, so I've prerecorded it. I'll talk to it as we're going. So let me just start that. So when you come to Pixi, start that. All right, great. So when you come to Pixi, this is kind of the landing page. The first thing that you're gonna see in the UI at the top is a service map. So that's showing, Pixi's recording all the traffic, all the HTTP traffic between all the different pods, and over the last five minutes in this case, that's configurable, but over the last five minutes, it's seeing everyone who's spoken, every pod that's spoken to another pod, and it's showing that information on the service map. In addition to that, we can see the list of services, we can see the list of pods, you can see things like error rates, latencies, all sorts of other good stuff, but what we're interested in most is kind of the traffic there that's driving the service map, because that's what we wanna get. We wanna get the raw data out. We wanna get all the data so we can replay it in our test environment. Drilling down a little bit, you can see like, here's a sock shop application, is front end is talking to the catalog service, it's also talking to this orders service, there's some traffic going from orders to shipping, all of this is auto discovered, right? You didn't, again, didn't have to change anything. You can kind of see the rate of traffic and the latency profile, error rates, that particular edge is in red because it has some errors so that might be worth looking into, but that's a separate topic. So you can get all this stuff out of the box, and again, the key here is that it's all being driven by this EBPF magic. Pixi has this concept of scripts, so if you go look there, you can select different views, essentially, to get different data, and what we see here is there's like DNS data that you can look at, there's Kafka, there's Redis, MySQL, Postgres, whatever, but we're gonna look at HTTP data in this demo. So we select HTTP data, and now what we see is a table of the most recent traffic in the cluster, across all the namespaces in your Kubernetes cluster. And so you can see there's a source column that shows you where the source pods are coming from, there's a front end there that's communicating to a user pod. You can see the latency, you can see it's a get to a customer, so it's getting customer information. There's a response coming back, it was okay. You can actually see the JSON payload of the response as well. If it's a post instruction, oh, we'll see here, sorry, drilling in a little bit, you can kind of see the JSON body here of, there's an address, there's cards, there's different information in here that we're all tracing. That's gonna be important and we wanna reproduce stuff in our CI pipeline. On post requests, we'll capture the request body as well, same sort of idea. I think that's it for the demo. Oh yeah, in the last week, sorry, it's not. And then we have some scripts. So this whole view is scriptable, so you can pull up the script that's driving this. What you're seeing here is the first line there that I'm about to highlight. It says data frame. We're saying get all the HTTP events from the in-memory databases in the pixie edge modules. Then we're adding some Kubernetes context like pod names and things to that, augmenting it. You can filter for stuff. Here, I'm just gonna do a simple filter just to show how the pandas is so easily scriptable. I'm gonna say, I just wanna see the requests that have a response status code of 400 or higher. I'm looking for some errors in the cluster. I'll just rerun this. And what you can see now is the response status. We're seeing all the responses, all the traffic that actually has errors in them. So that could be very useful in certain cases as well. Maybe those are certain things we wanna pull into a CI pipeline, kinda test it out. So now, this time, actually done the demo. So I'll switch back to the slides. Oh, it does that. That's not what we want. Okay, there we go. So the last thing I kinda wanted to mention about pixie before I handed back off to Matt is how you can integrate with pixie, because that's gonna be important when Matt wants to show how he's gonna pull data into the CI pipelines and do his demos. Pixie has a number of ways to integrate. There's an API, GRPC API that you can interface with. Generally, the input is a pixel script. You essentially provide the query that you want that chooses the sort of traffic that you wanna extract out of pixie. Again, you could say all the traffic that has errors or just all the traffic, all the HTTP traffic, whatever you're looking for. And then out on the other side, you're gonna get the data. So there's two ways to kinda do that. There's a pixie API using GRPC interface that you can do to collect the data. Most people might just wanna use the UI, but when we wanna kinda automate things, that's gonna be very useful, the GRPC interface. There's also a plugin system that lets you export stuff out if you wanna have a more tighter integration with other tools. So with that, I'm gonna hand it back to you, Matt. Okay, thanks, Elman. Okay, so, EBPF gives us this like superpower, right? And so I'm gonna go through a couple different use cases. One of them to use the superpower for good, and the other to be a little mean, right? To our test application. So to start with, put yourself in the mindset that I think most of us have been in, which is you're at a company and you are taking over an existing piece of software, of service, let's say. And there are certain requirements that have been met, which are that there are no tests. No one knows how the thing works. All the people who wrote it, they're gone, right? You have some alerting in production, but it generates a lot of noise, but of course, none of it means anything, right? It's all the wrong alerting. And so let's see if we can make a dent in that problem, okay? So the way we're gonna do that is by inserting, we're going to insert into the pipeline some of the traffic that is extracted through the Pixie API. And once again, codes on GitHub, you can go and modify it, use it for your own purposes. I've made a slightly simpler test application just for ease of use, which is the, it's just a gateway and a payment service and a user service. So nice and easy to understand. Okay, so the first thing that I usually do when I'm going to do this and put this in a new pipeline is I go look at Pixie or whatever, and I see what are the requests that I want, right? And you see, I'm filtering by the gateway because I only want to see incoming transactions at the ingress basically, like right behind the ingress. So I'm gonna go ahead and filter that part down, get rid of everything else, okay? Then I'm gonna go switch over to the API and I'm gonna extract that information and convert it into a set of curl scripts. Well, it's a script containing a bunch of curls that I can go and reuse both locally and in the CI pipeline. Because I like all of you, very nice. I created something called Pixie to Curl in Golang, if you know Golang, that's like the language of Kubernetes, that you can go and modify for your own purposes, but it pretty much works out of the box to go and pull the data out and generate these big curl scripts, okay? And you can see some of the examples here from a real example. So the thing about this, there's a couple pieces I'll touch on. Again, the wifi, you know, I'm doing this with screenshots, so sorry that I'm gonna hit you with a wall of text for a couple slides. But you need to filter the data down. One thing to keep in mind about Pixie is that it is a distributed query system, the API, but it doesn't have unlimited capacity. So you need to set your columns and start filtering it down to get just the fidelity of information that you need for the use case you're after. The actual code for this, the Pixie API is super easy to use. I'll save you from reading this, basically. All we're doing is parsing the response back that we're getting from Pixie, then feeding it into a standard curl generation library. So each one of those requests I was showing in the Pixie UI, you just get a curl statement for each one, and then we have the output we're looking for. Okay, so I wanna stop all these new developers from breaking the app when it goes to production. So I'm gonna go ahead and put it into a GitHub action. So this is where you turn it into a system instead of a one-time thing. Way I do that, at least for a lot of our customers, is we say, give us a Kubernetes, or give us a Kubernetes cluster that you can use for testing. That's it, right? Then go and we'll insert a step in the GitHub workflow that will go and deploy that version of the service and run our curl commands against it, and then fail if one of them breaks or it takes too long. So now we've gone, we've got that step in GitHub, and let's see what happens with our two use cases. Okay, so the first use case I'm gonna go through is, can we safely change the API? Again, no one knows how this thing works. It's too hard to trace the code. The code's been around for 15 years, nobody knows how it works, so let's go and let's just try stuff, right? This is, they teach you that in computer science, master's degrees. So, okay, so real simple. Again, if I were giving a live demo, we're just gonna go make a little minor HEI change. No big deal. Let's see what happens. When we commit the code, now our GitHub action runs the 400 curl commands and it finds out that something failed. Okay, so now that developer can no longer break your production app. You can't, they're stopped from pushing. When we go and do a little bit deeper inspection, I just went and looked at the logs. You probably have a centralized logging system, but I just did it with KubeCuttle logs. I find that I have this strange new entry of somebody changed it to my user instead of user, and so now I'm getting 404s. So, what we're showing here, what's happening is we're finding out that when you changed that, when they changed that API, right? When they changed that API, the old production traffic won't work and it's gonna break when they send it out unless they change the downstream service. So, let's use case one. Now you can make 100 of these different examples, 500s, whatever, but I made it simple for a talk like this. Now we're gonna do something that your developers, if you're the one managing pipeline, are not gonna like because we're gonna test for performance and we're gonna do it in a really cheap way, right? Obviously in the real world, you might have load test scripts, right? I'm sure you all have perfect load tests, right? Everybody, right? Integration tests, I'm sure it's great. There's some great tools out there. There's K6, there's Hay from Yonah Dogen. My company has one, although we give it away, mostly. Whatever, you've got something, right? But those scripts are probably out of date, right? They probably don't get touched very often. So we're gonna go repurpose what we made, turn it into a load test. And it turns out that's super easy because of the power of Kubernetes. It's super easy. All we're gonna do is open up the YAML that we created. That includes the curl scripts, the curl script and we're going to change it to 100 parallelism. And some interesting things start to happen when you do this to your existing services. Now I'm gonna use the Pixi interface because Pixi is a community project. It's free, you can go install it, right? And what we find out is that when we start ramping up the load against this in our CI pipeline, the latency of the service jumps over a second. Now, depending on what you work on, you may think a second is great for a single service. Most people don't. They know they need to go do some optimization. The key thing about what we just did though is all of this happened automatically. Every time somebody makes an MR or a PR, we go and we break it if they slow it down too much. So you can kind of tune your list and now you don't have to keep up with it. You don't have to write your K6 scripts. You don't have to do anything. You're just using what you're recording from production and you can refresh it periodically. So those are the two simple use cases we have time for. I wanna address something that I'm sure lots of folks are thinking. You're probably thinking, that's fine for the Trivial Little Test app, but I've got all this other stuff going on in my systems. Of course you do, right? So I don't have a quick answer for that other than to say that you can whittle away at this problem. There's some suggested reading that we have at the end of the presentation where you can see what Facebook and Twitter and others have done to handle things like simulating third party transactions or running against transactions that change things. Simple things you can do is you can filter for only gets instead of posts, right? So that you're not changing the data. There's all kinds of things you can do, but I'm aware you have to go solve some of these, but it is definitely a doable problem. Cool, thanks Matt. So we're gonna start wrapping up. Yeah, gonna start wrapping up here. So just wanted to say as we're wrapping up, if you want, if you're more about Pixie, learn more about Pixie. Pixie is again a CNCF sandbox project completely open source. So you can go play around with it. The website is there. GitHub is there as well. We welcome contributions. So if anyone wants to contribute, we'd love to have more contributions. Also give us a visit at the Project Pavilion where all the open source projects are. Come chat with us. We'd love to hear from you about use cases like this or others. Yeah, so that's it. We're right on time. Here's some other reading. We'll be up here for a little while. If you wanna weigh in on the proper pronunciation of cube cuddle, I have some of these shirts as well. You can grab, because I have a bunch carried, but that's it for us. So thank you. Thank you all.