 Hey everyone, we're going to talk about data science for infrastructure. So three steps, what we think are the three steps, observe, understand, and automate. Before I get started, just a really quick overview. I'm Zane, and this is Natalie. I'm a GM at New Relic, and formerly the co-founder and CEO of Pixie Labs. We got acquired by New Relic, and both of us are working on the Pixie project now. And Natalie is a principal engineer on our team. So we're both machine learning people by background and have basically spent the last decade building scalable data and machine learning systems. And we really wanted to try to understand how to apply our knowledge from that space into observability and infrastructure. So when we started in the space, we started to think that this is just another data problem. It's easy for machines to really very quickly generate gigabytes of data, but it's really hard for us to actually get complete coverage, especially in distributed environments where it's hard to instrument all your applications. It's also hard to make sure that all the data you collect is relevant, and it is difficult to then distill this information into something that's usable to automate workflows. So we wanted to take what we learned from the data space and apply it to infrastructure. And we kind of realized that there's a few different problems that need to get solved. So one of them, I think you've probably heard about this and everybody is collecting the right data is very difficult. And we say it's half the battle here, but in reality, it could be way more than half the battle. And one of the other takeaways from our data and machine learning days is that simple models on good data or relevant data usually outperform very complex models that are built on skewed or inaccurate data. The other thing we really learned is that it's really important to be able to audit and understand what's happening in your data pipelines and that you don't just have ad hoc systems running your infrastructure. So for this, we kind of think about how to break down the steps in the three different stages. The first step is to observe your data. The next step is to understand it and the third step is to automate. And we'll kind of walk through each of the steps and see what that means. So for data-driven automation, the first step is that you gather the raw data. So this might be something like getting the CPU utilization. It might be capturing request information. It might be trying to understand what are specific parameters in your request and how they impact your latency. In reality, most of the time is actually spent over here, even though a lot of coverage is given to the next two steps we'll talk about. So the next thing you typically do after you gather the data is you do some kind of data transformation. And this actually gets the most disproportionate emphasis. A lot of times, you will talk about machine learning models and all sorts of sophisticated tools that can build over here. And something over here could be a simple statistical machine learning model. It could be a regular expression. It could be aggregates. There is a whole gamut of things that it could actually be. And the goal over here is to be able to transform the raw data into some signal that can be usable downstream. An example of this is you basically take, let's say, CPU information for a bunch of different replicas of your application running, and you say, oh, look, this particular replica is extremely high relative to the others, what's going on. And the third step is that you take these signals and then you do something to automate it. So it's pretty common to be able to generate limits and alerts with existing tools. But one of the things that we've been pretty excited by and we'll talk about over here is the possibilities by using the Kubernetes API to really connect the first two steps and then close the loop. And by doing that, you can ultimately take data all the way to the entire pipeline, have Kubernetes go adjust the actual application parameters and then go back into the cycle, gather data, transform it and automate. So you can create like a full closed loop system. So there are a lot of steps over here. And I don't really wanna go into the details of exactly what types of metrics or what types of information is relevant. But just quickly, for raw data, typically people are looking at logs, metrics and requests. For transformations, you're looking at aggregates, anomaly detection, regular expressions. And on the signal-based stuff, you're basically looking at trying to update some systems or be able to do things like allocate more resources. There are a bunch more things you can do that will leave for a reference, but not that important for this particular talk. So there's a lot of stuff to do over here and how do we actually build a system that can handle this? So when we started to build Pixie, our goal was to build a data system that we could deploy on clusters to really understand what's going on and then automate workflows. So Pixie was built to solve these problems. We kind of have three core principles that we follow. The first one is to try to do as much automated telemetry as possible and we utilize this new kernel technology or maybe relatively new kernel technology called EVPF. And there are other talks that talk about this, I'm not gonna go into details, but in industrial it kind of allows you to go insert code dynamically into the kernel to be able to observe applications. The second thing we really wanted to do is make everything scriptable and API-driven. So you can actually write scripts that can work with the data that's captured automatically. And the third thing is to have a Kubernetes native system so that we can actually understand the entities and be able to work on that. So just a quick overview of the features or feature set of the project. Yeah, I think I forgot to mention, but Pixie is an open source CNC of sandbox projects. You can check it out. But in terms of auto telemetry using EVPF, we can capture application network and infrastructure data. This includes things like application profiles and full body requests. And we have pretty low overhead while doing this. You know, we say under 5%, usually under 2% CPU overhead. We basically understand all the Kubernetes entities and connect things together. And the third thing, which will spend a little bit more time on in this talk because it's relevant to the latter part, but not only will we be working through some demos, is that it's 100% scriptable and API-driven. So we really think about this as infrastructure as code. We can basically write code to be able to generate all the metrics we need to automate. And then this can be accessed to the API. So instead of Pixie, everything is a script. Our UI actually runs as a script as well. And we can easily integrate in the downstream tools for our API. And they're pretty active, Grafana plugins and plugins for other SaaS products for Pixie. So like I said, we'll talk a little bit more in detail about Pixel, which provides, which is Pixies like language that provides us with the capability of building these data science like tools. So we kind of have like three principles when we're starting to think about what Pixel would be. The first thing is we really wanted to be able to use the query data and build workflows. The second is we wanted to be able to automate some more data collection that we didn't think about a priori. And the third thing is we really did not want to invent a new language because everyone starts off starting to invent a new language and it gets very complex for users to use. So you can actually see a little snippet of Pixel code on this, I guess on the right side for everybody. And that particular code, not to go into the details, but it's based on a Python, it's based on Python, it's actually 100% compatible with Python. And it's based on this library called pandas, which is for data processing. And it's very commonly used in data science. And you can basically see over there, we're saying, oh, look at all the HTTP events over the last 30 seconds and then capture what taught information's there and then pick out some specific columns. I don't really want to walk through the details of the language, but it basically gives like a pretty simple way to work with data in your system. So just to recap, Pixel is an embedded DSL. So it is still like a domain specific language. It doesn't work like generic Python, but it is valid Python. And in particular, it is valid pandas, which is very common in data science. And it's built for data analysis and machine learning. So we have core capabilities like, running TensorFlow models inside of Pixel, being able to run aggregates and joins and stuff pretty quickly. In particular, just to go into one more level of detail, Pixel is what people call a data flow language. So it specifies what is a logical flow of data. And it's really Pixie's job to figure out how to optimize and execute this so that it's efficient in your system. So the next step you're probably interested in is what can Pixel actually do? So Pixel's scripts can be used to transform and analyze your data. So some of the things we support are aggregates, joins, filters. There are a lot more things to do. But the good thing about Pixel is that it is composable. So it's a declarative spec, which means that it's kind of like SQL. You say these are the steps that are done. It is fully functional with no implicit side effects, which means that you can build workflows off of each other. And if you take a look at Pixie's UI, if you get a chance to check that out, you'll actually see that we pretty much compose multiple scripts together to be able to generate our entire user interface. So to recap, Pixel provides an interface to work with data. It allows us to construct powerful composable workflows. And the next step over here is that we have a couple of demos that will actually demonstrate those capability that Natalie will walk through. So the first one is doing slack alert on SQL injection attacks. So it's like, Natalie will talk to what SQL injection is, but if there's a SQL injection attack in your cluster, we'll be able to flag that and generate a slack alert. The second demo will actually take a look at HTTP request throughput and auto-scale your Kubernetes cluster. Oh, thanks. These are, thank you. So these are all like pretty simple examples, right? But our goal is to be able to build these up and get more feedback on what we can do. With that, I'll hand it over to Natalie for the demos. Great, thanks, Zane. Okay, sounds like this mic works. So as he said, I'm Natalie. I'm gonna be taking you through some of these demos that are kind of meant to be proofs of concept to illustrate some of the principles he was talking about before about how we can automate more of our workflows using the raw telemetry data that exists in all of our applications. First, we would need to deploy Pixie to a cluster. This takes about three minutes, so I've created a video to kind of fast forward that for you. But I think what we're trying to illustrate here is how we're trying to set a path where it's much easier to collect data with EBPF. Instead of spending a lot of time manually instrumenting your application, you can just run something like PXDeploy or other EBPF-based tools and automatically start collecting this raw data from your cluster. Okay, so as Zane said, the first demo is going to be creating a Slack alert in order to detect or creating, sorry, creating a pipeline to detect and do a Slack alert when we think that there may be a possible SQL injection attack. So for this demo, before I get into what a SQL injection attack is, we're gonna be using DVWA, which is a well-known security project. And I'm just gonna pop up for you. Oh, I think my port forward has stopped. Let me just get that back up. Basically, this is kind of like a test security application that is used in the security industry that has a lot of known vulnerabilities in it. And so basically it is intentionally built to be insecure. And so people can use it as a testing ground for running different kinds of attacks. So basically, we're using this in order to demonstrate the SQL injection detection that we're doing. I guess I'll go back to sharing. Oh, actually, one thing first. So just to kind of show some of the data that Pixi can collect, let's run a SQL query with DVWA. That isn't actually a SQL injection. So let's say I just want to run a single SQL command. It should have executed. Now I'm actually gonna use Pixi in order to see that this query has been executed. And this is a very simple view in Pixi, but it basically collects all of the MySQL data on your cluster. So I guess I should say that we've deployed this DVWA application to our Kubernetes cluster. So that's how Pixi is collecting this data. So if we were to kind of inspect this row, we could see that we're actually able to capture the full SQL request that I just executed right here in Pixi. The font may be a little small, but that is the foobar123 that I just executed. So let's go back to what actually is a SQL injection. Well, I didn't want to come up with a definition for you because security experts are much, much better about that. The way we think about it is it's just a way for people to try to run malicious SQL in your application in order to get information that they're not meant to have access to. So just to take a very, very simple example, it's worth noting there are many, many kinds of SQL injections, so this is not a cohesive, you know, complete example. There are many kinds, but an example one would be, let's say I have an endpoint where I pass a user ID, like the one I just showed you in DVWA. So I pass in user ID equals 123. On the back end, my application executes select star from users where user ID equals 123. This is intended to give me the information for that user, but a malicious actor put something that isn't an ID in there. Let's see what happens when that occurs. The application would execute if it was naively implemented, select star from users where user ID equals 123 or one equals one. This would effectively return all of the users in the table. This would be really bad. So I'm going to talk about SQL injection detection, which we think is really important because no application is perfect and free from vulnerabilities, but it's worth noting that prevention is really, really important when it comes to SQL injections. So detection is only part of the story. We really want to be as application developers thinking about security and doing prevention as a first step. But let's say that we have an application out in the wild and we want to make sure it's not running any SQL injection attacks. There's two main ways that we can do this once we have a list of all the SQL statements that the code is running. The first is a rules-based approach. You might say, I'm going to parse all my SQL queries and make sure that it's not actually using any syntax that I have deemed prohibited, such as unions. You could also use reg-axis to detect this type of syntax. There's a complication with this approach, which is what if there is a legitimate use of that type of syntax in your query? Another way you can do it is using machine learning. You can train your model on real-world examples of things that are legitimate queries and things that are SQL injection attacks. That could theoretically learn that using a union in this scenario is okay, but using it in a different scenario is not okay. A major complication with this is where to get the data set. You have to have a list of data that has known SQL injection attacks that reflect your application. So this can actually be a major blocker to doing something like the machine learning-based approach. And so for this demo, I'm actually just going to be doing the rules-based approach with reg-exes, but as Ian said, Pixel supports running TensorFlow models. So if you were to have the data set available to train a model on detecting SQL injection attacks using machine learning, you could easily plug that in instead of the reg-ex-based approach. I just want to call out, in addition to DVWA, a tool that is helping this demo is SQLMap. It's a really cool command line utility that basically tries to attack your database and figure out are there vulnerabilities in it that exist. And so we're going to be using that today. Okay, so let us get started. So first off, we want to have the script that actually detects the SQL injection attack. So what this script is doing right here is it's actually just showing you, hey, what are the SQL queries that exist in my cluster? But we actually want one that runs pattern matching to detect what ones could be a SQL query, or a SQL injection. So let's first populate the table with some SQL injection attacks using SQLMap, like we just said. I'm going to point it at my DVWA. I'm going to say yes to various things. Now it's running a bunch of tests. It's trying to attack the database and execute malicious code. The data tends to pop up pretty quickly. So what we can see here is that the SQL commands that this thing has been running are showing up in this view. So I've written a script, or I know I have not written a script. We have written a script. We asked actually the help of a security team who wrote a great blog about this, gas blog on our blog website, to basically look at all of the request bodies for the SQL queries and try to label them, like which rule has it broken. So what we're going to do is instead of the script running here that's querying the MySQL data, we're going to add a little bit more functionality to this script and actually try to classify possible injections. So basically this script has successfully given us a list of these injections. Now some of these might be legitimate queries, but we just want to flag things that are breaking the rules that we've said aren't allowed. So what we can see right here is that someone wrote a very, someone aka SQL map, wrote a very suspicious looking query right here. This does not look like a user ID to me. And we have the rule broken, unmatched single quotes. But the thing is that I don't want to have to actually try to manually run these queries every time. I want to actually be alerted when there is a SQL vulnerability in my application. So because Pixi has an API, we've actually created a Slack bot that will alert my Slack when there is a problem that looks like it could be a SQL injection. So basically the Slack bot took the exact same script that I just showed you, and it's going to use Pixi's API to ping Slack or to create the dataset that will then ping Slack if we see it in attack. So we're going to let's first create some more attacks. And if the demo gods are with me, we should see, oh, it's detected a possible SQL injection. So this thing has told me that a similar, I guess it just always runs this one first, unmatched single quote. It's running this one a few times. And so what we can see now is that I have a full data pipeline that is looking at my SQL data and pinging me on Slack whenever it detects a possible injection. So here's a new one, comment dashes. This definitely looks suspicious. So to recap, we've basically applied the same thing that Zane has talked about, which is we use Pixi to gather raw data and transform it into a signal by collecting these raw SQL events and using a script to try to use regXs to establish which ones might be SQL injections. And then we're doing something based on the signal, which is pinging Slack because we see that there's been a problem. All right, let's move into the next demo for a completely different use case, which is basically workload scaling. So I think a lot of us who use Kubernetes might run into these questions often. How big should my deployment be? How many pods are necessary for the service? How many pods are necessary for the service if they have a lot of CPU and memory associated with them? Both the size of the pod itself, in terms of the resources it gets and the number of pods are important factors in sizing a deployment. It's also worth noting that the size that my deployment needs might not be the same all the time. 2 a.m. in my time zone versus like Black Friday or something like that, I might have completely different needs for the size of my deployment. So there's various things that we could use to size these deployments. One of the most common ones is using the CPU and memory usage of the pod. If my CPU and memory usage of my pod is high, I should probably add more pods because that one looks overtaxed. I might want to look at things like request latency. This request is taking too long. Maybe the pod has some kind of resource utilization issue. Maybe I want to look at the latency of my downstream dependencies. Maybe I want to look at my number of outbound connections. Maybe I want to look at application-specific metrics, like how long this particular function is taking. There's so many things that we could use in order to drive the auto-scaling of the deployments. Also, all of these slides are part or uploaded to this session, so if you want to look at them for reference later, they're available. Kubernetes is amazing. It has such an API-driven approach and it makes auto-scaling based on these complex factors possible. Right off the bat, it supports both horizontal and vertical auto-scaling. Horizontal basically increases the number of pods in a deployment. Vertical increases the amount of resources allocated to a pod. So built-in has the ability to auto-scale on CPU and memory. Like I said, these are very common ones to use and they're very valuable. But it also allows you to define your own custom metrics, things that you know your deployment cares about. Something like CPU or memory applies to everyone, but you may have something very specific for your application that you need to scale on. The Custom Metrics API is really useful because you can hook in your own metric into the auto-scaler to add more pods or add more resources to a pod. We're going to be using a very sophisticated demo app. It's basically just an Echo server, but it is enough to kind of make the point about auto-scaling. That was meant to be a joke, people. No. Anyway, you know, the Kubernetes SIG did this amazing, basically like demo app of how to make a custom Kubernetes metrics API server. So we're really appreciative of the work they did in putting that together because we based this example off that API server. If you want to build your own, you should totally use this project as a starting point. Also, we're using the tool called Hey for HTTP load testing. Rackle, aka Yana, who's one of the board members of the Pixi project, created this tool, and it's really easy to use. So let's go into it. I'm going to go into K9s, of course. And I see that I'm running this Echo service on this external IP. And these are the number of pods that it's running. It's just one pod right now. And I've deployed this Echo service with an auto-scaler that basically uses Pixi metrics in order to say, I would like more pods when my application is serving a lot of requests. So I'm going to launch this and then show you a little bit more about how that works. Let's give it some load. Let's give it a lot of requests to the same endpoint that we just saw. And while we do that, we'll keep it over here in the background. We can see that these containers have just been created in response to the request. So how does this look? Basically, we have this horizontal pod auto-scaler that we've defined. It says, I don't want any more than 10 pods. So this is the max. It's kind of like a switch in case you accidentally steal to 300 pods or something crazy like that. And I want to use this custom metric I've defined, which is PX HTTP requests per second. And I've set this target value to be 20. And so what happened was this pod saw that it was getting a lot more than 20 requests. So the auto-scaler decided to add a lot more pods. The load has died down. So now we see that these pods that were just been up are terminating. And you can configure using the auto-scaler API the amount of seconds and, you know, how many pods can get added or taken off at once. I just made it really fast for the purposes of this demo. The metric provider that we have is pretty simple. It basically takes this pixel script, which computes the average number of HTTP requests by pod over the time window. And it fulfills these endpoints. These are defined on the custom metrics API. Get metric by name and get metric by selector and list all metrics. So you can see that in just about 200 lines of code, I've easily created a metric server that will auto-scale my request based on, or auto-scale my deployment based on the number of requests. So to summarize what we just saw, we collected the raw HTTP requests in Pixi, although you could do this with anything that taps into HTTP requests. We've calculated the request per second by pod, and we've plugged that into the Kubernetes API at auto-scale or deployment. I'm passing it back off to Zane. Thank you. Awesome. Well, thanks for those demos, Natalie. So in these demos, Natalie basically showed how you can take some simple data, or run some simple data workflows on Pixi. In particular, for the SQL injection, there's a blog post on Pixi's source blog that goes into a lot more detail if you're interested on it. And then we're going to be posting a blog on the auto-scaling project next week, which will have a lot more of the technical details about how this works and code that you're going to look at. Actually, sorry, one quick thing is that these examples are in our Pixi demos repo today if you want to check it out, but as you said, the blog, what are we coming? Correct. So they're already all in the public, but there is a blog post coming next week that I'll talk through the information in more detail. So what are we working on next? So we're working on doing some stuff with cross-site scripting detection using FIXL, and really what we want to do is learn from all of you about more of these cases and things that we're interested in seeing. You can find us on GitHub or Slack and either create an issue or send us a message on GitHub that could either be the CNC App Slack or the Pixi Slack. And that's it. Thanks a lot for listening and we'll take questions. Thank you.