 Thank you for coming to our tutorial building an open source observability stack Let's just introduce ourselves here, so my name is Hannah and I work at New Relic and Primarily on contributing to Pixie Hi, everyone. I'm Michelle and I also work at New Relic I'm one of the maintainers of Pixie and later I'll be walking you through the demo that we have at the end Hi, I'm Vihang. I also work at New Relic and I'm one of the maintainers for Pixie Hi, everyone, and I'm Clemens. I am not one of the maintainers or developers of Pixie I'm actually from VMware and I represent more of one of you guys So essentially I am a user of Pixie and I will walk you through not only open telemetry But also how do we use these tools in one of the projects at VMware to show you kind of how to get observability and what to do with that data All right, let's get started so It's it's super interesting right there was a CNCF study last year where they were asking what observability tools to use and what are the challenges that you see and More than half of the people said they are actually struggling with just the sheer volume of different tools that different teams are using so This is a problem that we need to address right observability has many many different ways many different things You can collect many ways that you can collect them stored and visualize it and so on and so Part of what we want to do today is really walk you guys through. What is observability? How do you do this? How do you instruct? instrument your code to detect all these things how do you store them? How do you visualize them? Just how do you get a bit of observable observability in your cluster and? Cates is actually very specific in terms of separability because there's just so many open source tools and we want to guide you down a path of Understanding what are the most common tools out there? Which ones might be specifically interesting for your environment and how do you actually use them? So we first want to walk through very high level What are the tools and then in the second section Michelle is going to walk you through all right hands-on? How do I get this done? So one other thing that is super interesting is you know because there are so many open source technologies You can ask yourself. Okay. How do I avoid being locked into one because it is very very tricky if you make wrong assumptions Wrong decisions early in your project to say what if I later want to change something? Does that mean I have to rebuild my entire observability stack or can I just swap out individual pieces? Especially with close-source solutions the lock-in could be very dangerous, right? Let's say you're locked into a specific vendor and the vendor increases prices twofold threefold That will be an issue that you have to deal with so by understanding. What are the different tools and what are your options? You can think forward and remain flexible All right, so as I was saying Understanding this flexibility is super important and let's assume you have an application and you have already Instrumented your application to get a telemetry So let's just say you do distributed tracing which is awesome as we will see in you know The rest of this this this discussion and the tutorial later and you sipkin for looking at your distributed traces You have a working end-to-end workflow. You visualize your traces. You can debug your system. Everything is awesome But let's say for whatever reason you want to switch a switch away from sipkin So if you understand the ecosystem and you know what the system sipkin do What is the format of the traces that it receives? What other systems in the ecosystem does it work with you can essentially say? Well, you know sipkin does distributed traces. What else is distributed tracing? Well, there is the agar just one random example that could essentially replace one component and You essentially replug your end-to-end workflow to understand. How do I get observability within my cluster? so again understanding your options and knowing what open source tools there are Really help you think forward plan your system plan your environment your ecosystem So I want to take a step back though before we jump into individual systems just like we talk about observability But what does that actually mean in practice? So in my opinion, there's three main pillars of observability that you need to know about which are logs Metrics and traces probably all of you know what logs are essentially it is a timestamp event of your system Maybe your application emits Literally a log line that says hey I have an issue processing a particular request or your system streams for help Saying I just ran out of memory or something didn't work as expected That is a log a metric on the other hand is a numeric representation of Something that happened at a particular time a point in time. So for instance, what is right now? My CPU usage on a particular Kubernetes node or how many HTTP requests have I processed in the last 30 seconds? Or how many messages are in my Kafka queue that I need to process? So metrics very very useful for observability Especially as many of you probably know for automatic scaling for alerting a queue that is too long It's probably a problem things like this, but it's not as clear What are traces to some of you maybe logs and metrics almost everybody knows traces are slightly different So when you think of a trace you need to think of essentially a series of ends of events that belong together That show how one particular request is Processed within your application and what is important here is that a trace might be distributed as a request comes into your load balancer It gets processed by the load balancer Maybe buddy or it should be endpoint that might make distributed calls to other services within your application and The trace encapsulates all the steps that are involved in processing this Single request coming into your application a span is one single event within a trace So a span could be an individual API call that you make or Visual query that you make to your database or it could be an HTTP request from one pod to another And if you summarize all the spans that are relevant to processing of your inbound request You call that a trace it's a combination of things So here we have logs metrics and traces and this is typically what we talk about when we talk about observability so What we want to do very concretely today is we want to show you how do you now tackle these logs Metrics and traces possibly distributed with open source tools and in this specific case actually for differences in CF projects So we're gonna be talking about fluent bit for logging. We're gonna talk about Prometheus for metrics We're gonna be talking about open telemetry, which is really about metrics distributed traces and logs And then finally, you know us being from pixie or working with pixie We're gonna talk about the pixie project, of course Which also tackles metrics traces and application profiles And what is actually interesting is that these are different projects in different stages of their C and CF life cycle So we have fluent bit and Prometheus which many of you might already be using they've been around for quite a while They're you know, very Mature project open telemetry slightly newer and pixie is currently a sandbox project and we're working on making it, you know Advanced in its C and CF lifetime All right, so first let's focus on logging and fluent bit So what is the problem with logging in Kubernetes? Everybody knows logging and it seems so simple, right? but there is this key problem that you probably all know if you've ever looked at the logs of a pod and There is an issue and I just want to know more about it, right? so typically first thing that I do Kate edit deployment and I Increase log verbosity and of course now these logs are gone because the pod restarted and they're gone So pods in Kubernetes as you probably know are ephemeral so getting access to them and storing them Outside of the pod is super important, but it's also difficult now typically when we have a Kubernetes application will have Five ten fifty different applications You can be guaranteed that you'll have fifteen different log formats because why why standardized, right? So we need something that is able to to be able to handle different Logging formats and also I don't want to go to different pods to get the logs I want to have one central place where I can see all the logs of my entire Kubernetes cluster The problem with Kubernetes as you probably know is it's intended to scale massively, right? We could have thousands of pods on hundreds of nodes or even more So logging has become a real issue in terms of performance So whatever we do for capturing these logs it has to scale massively and so really what we're looking for is a lightweight logging solution and actually we have one it is fluent bit so fluent bit has been around for for quite a bit and We'll talk about a little bit You know, what's the difference between fluent D and fluent bit? But essentially it is a system for capturing your logs Processing them so you can have a standardized view now I do not want to go too much into the details of the architecture of fluent bit, but I think it's important to understand What does it offer you? What can you do with fluent bit that allows you to say this is the right solution for me? Maybe it's not for you, but you know understanding What can it do helps you selecting fluent bit maybe over fluent D or some other log system? So as it's core Fluent bit really is a very plug-and-play architecture that has multiple stages that take care of your logs So first of all is able to consume information from different sources It can monitor your pods by integrating with journal D or using tail to capture logs But it can get data from many different sources. It then essentially goes through a series of stages to process your logs So it has a parser to make things more More structured so, you know, sometimes you have logs that are simple log lines What fluent bit allows it to do is say I choose a particular parser That could be a regular expression to say, you know I always have to say the process name first and I have a timestamp then I've a pit and so on because that might be Different in your different applications. So by having this parser step It essentially brings you a structured metadata could also be a parser for a JSON or for whatever else there is Then it goes through a filtering step because you might actually not be interesting in all the logs of your application So fluent bit allows you to filter. Let's say I only care about things that are Python processes from a particular node and so on Typically, you would if you can capture as much as possible, but maybe in your environment That's not possible fluent bit allows you to do that The interesting piece here is that filtering is not actually not only removing information You can only I can also add information So for instance, as I was saying, you know We have the ability to get kubernetes logs or pod logs by integrating with tail or channel D It can actually in the filter step also integrate with kubernetes API So you will get metadata around a log line captured a certain time stamp belongs to a container in a pod in a name Step in namespace and so on so if you have rich metadata that helps you And then it allows you to buffer and essentially route your data. You might say hey if something has a particular tag I want it to go to this log storage or if it's another tag It goes to another storage or go to all of them if you have multiple and so on so it is incredibly incredibly configurable Which makes it great to decide, you know, hey, is this for you? One thing that is very common and that we'll try throughout the presentation But also later in what michelle is showing you is to help you distinguish between different Projects that are out there and one of the things that people ask very often is hey What's the difference between fluent do fluent D and fluent bit aren't they the same thing and it is confusing because they come from the same From the same source like the same people worked on them and Essentially you can say that Fluent bit is a more scalable version of fluent D. All right, so fluent D excels as How configurable it is it has tons of plugins for capturing data outputting data filtering data and so on they share the same underlying architecture, but Fluent D allows you to do much more of this fluent bit on the other hand while it has fewer integrations is Incredibly scalable. So if you have a cluster that is incredibly large We're talking about hundreds and hundreds of nodes thousands of pods Maybe fluent bit is something that you want to look at fluent D If you have very particular data sources data parsing data output needs Maybe fluent D is more for you just because it is so vastly configurable So that's the difference fluent bit is just incredibly fast and kind of the new cool thing So that's what we're going to be showing in the demo later. All right. We have talked about logging. Let's talk about metrics So metrics probably many of you might be using Prometheus already. So let's talk a little bit of Prometheus It's clearly one of the most stable systems out there for dealing with large forms of metrics So we want to tell you what does it do? How does it work? Could it be for you? But like logging before let's briefly talk about what is the problem even with metrics in Kubernetes? I mean, isn't that easy? So we already talked about parsing ephemeral and just having this massive amount of data that we have to capture Which is not only logging also in metrics a problem. So we need a system that scales really really well But what is also interesting is that there is this key idea of dimensionality of metrics in Kubernetes So it's not just typically today that you want to say my application records this one Value at a particular timestamp you want to be able to say well, but I don't care about one part I care about let's say I have a replicated set of pods that belong to a daemon set to a Deployment to a whatever that are just you know sibling containers I want to know about all of them So the idea that when you capture capture metrics is we want to annotate them with metadata and do that automatically to say Well, give me the sum of all the number of htp requests processed for Pots that belong to the same application. So we want to be able to augment the metrics as we capture them So we can later aggregate filter using these metrics and of course We need to centralize the access, right? So we do not want to go to different places to collect information about different problems We want to be able to go to one place and Prometheus or the ecosystem that the open source Environment has been built around Prometheus that allows us to do exactly this It's very scalable very configurable way to capture annotated metrics and have a centralized place to visualize them So let's talk a little about how does Prometheus work because it will be very very important for you to decide is Prometheus for me and even more importantly How can I use Kubernetes? How can I use Prometheus within Kubernetes to configure the data collection? I think the key thing you need to know about Kubernetes is at its core It has two models of capturing metrics. One is the pull-based model is one is the push-based model Let's talk about the pull-based model because I think it's the more Outlandish one, but the one that allows it to really really scale So the idea of Prometheus is it does not so it does want to capture metrics by going to the Individual pods in your environment and say hey tell me about all the all the metrics that you have to report so what Prometheus does it, you know, it uses a kind of a Service to say what are all the pods that are available and essentially goes and says hey tell me all your your metrics The reason why this pull-based system is so powerful is if you think of the opposite where you have different applications pushing data to To Prometheus you quickly run into this problem that you have Different pods possibly all at the same time pushing metrics to this Prometheus instance Which is of course scalable, but if they all push at the same time, they might just take it down So by having a pull-based system essentially Prometheus can say I know what is the best Cyclicity or times at which I can go through the different pods and collect data when it works for me The downside of this pull-based system is every single system that it talks to must have an interface Could be an HTTP endpoint something that allows it to pull information in some standard that it supports, right? So Prometheus has a way that typically goes to an endpoint as an HTTP request To get the metrics in a specific format parse it and put it into its storage But that implies your endpoints or your pods will have to have an endpoint from which you can fetch this So if you have a closed source application, which doesn't have it the pull-based model of Prometheus might not be for you But we have a solution for this. Don't worry Another thing that is important to understand about the Prometheus architecture is it is very configurable in terms of how does it store its metrics, right? There's different time-stem series databases things that it could integrate with for storing data But also for visualizing this data very typically people use Grafana for Visualizing things which of course means that there's a query language underneath that you might be able to integrate with directly But this is something that you will want to look into and was also interesting very typically You will want to have alerts from your Kubernetes environment, right? You want to be able to say hey if there's more than 10,000 elements in my queue I want to get an alert so alert management is also something that Prometheus can give you by Integrating with other solutions, but it's the probably nicest place in your architecture where you can implement this So let me just very briefly briefly talk about now. What if you can't do this pull-based model? And there's as I said already one reason that you couldn't do that is if you have a closed source application That might just not have this endpoint where you can go and fetch the metrics But there's other problems with this pull-based model for example if you think of very very short-lived containers So, you know because Prometheus decides when to pull its data Let's say you have I don't know a Kate's crown job or something that runs for a millisecond if Prometheus not right at That second when you know this thing runs pulls the data It's just going to be lost as I said pause or ephemeral. So how would you get this data? So there is the model to also be able to push information to Prometheus which exactly Handles this case if you have very short-lived processes or if you have systems that do not allow an endpoint Or maybe it has an endpoint, but it's just the way your system is designed You cannot access it think that there's this push-based model as well That might be able to like circumvent that problem So I hope by understanding the architecture of Prometheus You have an idea of what requirements does it have very often this this interface from which you can pull information And kind of how does it work? Is it for you? I think in very many cases it might be interesting to you So definitely check out the details of Prometheus. It's an amazing system Alrighty, so now we've talked about metrics now We want to talk a little bit more about open telemetry Distributed traces and so on and for this one hand over the microphone Alright, so I'm gonna talk about open telemetry Which is also called hotel for short if you've heard that so One general observability challenge that we have is that there are a ton of options So this is a picture of the CNCF open our landscape mode picture of all the Observability monitoring tools out there And the open source ones are shown with the white background and the commercial vendors are shown with the gray background so you can see that there are just so many options for you to use and It's possible that each tool here could have its own custom way of instrumenting your application to generate data Collect data and export data So this could make switching between observability solutions or adding a new one very difficult So this is the problem that open telemetry set out to solve Now what is open telemetry? This is actually very confusing and I hope that one of the main things you can go away with today is like a clear understanding of all The cool projects that are happening under the open telemetry umbrella because I think each sort of thing I'm going to mention could be a project on its own So at its core open telemetry is a collection of standardized vendor agnostic tools to generate collect and export telemetry data So what does that mean? So first of all, it's a universal format for telemetry data and that includes metrics logs and traces So that's one part of open telemetry. The second part is that it provides client libraries to instrument your app so for example, you have like hotel for Java and Open telemetry provides ways to do manual instrumentation as well as automatic instrumentation Another thing that open telemetry has is an API for sending and receiving data in the open telemetry format Now this enables you to collect to connect open telemetry supporting applications and libraries together and Finally, what open telemetry is is the hotel collector and this allows you to receive transform and send data in the open telemetry format And this is useful because it gives you one single spot to collect and process all of your metrics And then it's a single thing that has to talk to your storage solution Now one thing to note that open telemetry is not is it's not an observability back-end like Yeager or Prometheus so I'm not going to go into the implementation details of open telemetry right now, but One thing I want to again demystify is like the overlap between Prometheus and open telemetry and these projects are very interoperable Prometheus is at the current moment more widely adopted It has an SDK for instrumenting your code It defines a standard metric format and it also has a back-end for storage and querying Now open telemetry is a newer project and it also defines a standard which is a superset of the Prometheus metric standard and It also provides standards for traces and logs in addition open telemetry also has an SDK for instrument instrumenting your app But unlike Prometheus open telemetry does not have a back-end So that was open telemetry now we're going to briefly Discuss pixie and pixie as we discussed is an incubating or if I write sandbox project we just applied for incubating and It is a newer project and the goal behind this project is that right now Adding instrumentation to your app is very tedious. It takes a significant amount of time for you to set up trace and metric collection This is some example application code and highlighted in red is the instrumentation logic And highlighted in green is the actual application business logic And so it just goes to show that like instrumentation can take a lot of time to add and so the idea is like What if we could automatically collect some of this telemetry data? And I think that's where the observability field is going one of the ways so And one way we're going to look at this is using ebpf technology So this technology allows you to dynamically program the kernel for efficient networking Observability tracing and security so we're going to talk about how pixie uses ebpf for observability But basically at a high level when you have pixie deployed to the nodes in your cluster It deploys these ebpf kernel probes that are set up to trigger on Linux this calls related to networking and So whenever any of your applications makes network related system calls such as send or receive These ebpf probes can snoop the data that's being sent parse it according to various protocols and store it for querying So the cool part now that you kind of know how ebpf works The like the main point here is that you can use this new technology to capture full-body requests Network metrics and even application CPU profiles So I think the cool thing here is that the observability ecosystem has noted that like automated Instrumentation would be the best way for people to instrument their apps well not the best But let's get like 80% of the metrics that we need off the table And obviously we need custom metrics still for very specific things that we're looking at So all of these projects are interoperable Pixie exports in the open telemetry format So you can use all of these together to get both automated forms of telemetry and more manual forms for specific things you're interested in So that was pixie. I'm gonna hand it back to Clemens to discuss like what would this what this might look like in a real-world use case Thanks, Anna Alrighty, so as I said in beginning, I'm part of VMware So I'm working on as part of the office of the CTO in a environment called xlabs And so we try to build cool new stuff that is not yet part of a product, but thinking forward, you know What will our customers our users need down the line and observable observability and security is clearly top of mind for all of you I mean, that's hopefully why you're here And so I want to just very briefly talk about what we call project training at which is what I'm personally working on Which uses many of these tools and technologies that we just talked about to solve a security problem So I don't want to dive into the details is not a marketing pitch But really what we want to solve essentially let's automatically collect open telemetry information about your Kubernetes clusters and Learn what is normal in your cluster in terms of security? So I want to know what are the pods? How do they talk to each other? What are the protocols that they use to talk to each other and just understand what is normal? What are the API's that are invoked from which types of pods and what are the API parameters? That are used right and so I can very clearly see there's certain patterns that you know how these pods are interacting and learn what is normal and Maybe one day suddenly one of your pods really misbehaves talks to things that has never talked to before Invokes API's that shouldn't be invoking sets parameters that we've never seen in the past and that could be a sign of an attack Maybe one of your pods got compromised and now there's an attacker Maybe trying to extract data from your database and send it out to the internet clearly a problem And the cool thing is if you look inside the Kubernetes cluster the east-west traffic is really very standardized And when you have lots of data very standard things and want to find out liars What could be better than machine learning to do that and that is exactly what we're trying to do with project To train it at we learn automatically what is your normal in your environment just to alert you Hey, there's probably a security incident if we see something that is not very normal Now we're using many of the tools that we've seen before but essentially we deploy Pixi at the core of your Kubernetes cluster to collect a lot of information and get this in the open telemetry format and send it To a cloud we use many open source Technologies by the way, I'm presenting this just to kind of show you how even in your environment You could plug these things together in a very similar fashion, right? So we use actually the trims the operator to automatically deploy Kafka for uploading data from the open telemetry collector Which is you know support it out of the box but a hotel collector and then we use a whole bunch of stuff For example, we use kubeflow we use ML flow to essentially Look at data which we store in elastic search which the Yeager tool can actually write for us Just directly taking it off of a Kafka queue store in elastic search And we can use kubeflow in ML flow to learn what is normal and then we can perform inference to find out normal stuff But and by the way in the end we can even use the Yeager UI to visualize what is going on in your environment and Take things that are not normal, right? But you can see I mean clearly maybe you don't want to do fancy stuff like ML for finding security relevant anomalies But a lot of this stuff you can build yourself using all the technologies. We've talked about today and just plug them together So again understanding the ecosystem how things interact and understanding these formats and how you can you know Swap one for the other really empowers you to do a lot of very very cool things All right, I'm sure you all want to get your hands dirty and see how does this work in practice, right? I see all have your your laptops open So I would say Michelle walk us through how does this work, you know in a real environment Okay, thank you Clemens. So up on the screen You see we have a QR code or if you guys are more comfortable typing You're full you should feel free to type out the URL as well But this is basically a set of demo applications we've built on top of the open telemetry demo and There's a lot a bunch of different steps, you know setting up your cluster There's deploying the demo application itself and then walking through the dashboards and the data that it provides you so I highly encourage everybody to go check it out But I also want to you know make note of the time that we have because we've actually seen you As some of you may or may not know Wi-Fi is very slow at this conference And so when we tried to personally deploy it ourselves it took about 30 minutes It's like okay We're not gonna have everybody sit there for 30 minutes while we do that So what I'm gonna do instead is just walk you through what some of this data looks like to give color Into some of the tools that you know Hannah and Clemens mentioned earlier So when you deploy the demo applications, this is basically the flow of what you have So first you have the demo application I'll show you what that looks like in a moment, but from the demo application We're collecting things such as metrics traces and logs So the metrics and traces are actually going to be collected by open telemetry using both auto instrumentation and manual Instrumentation and then we will have logs which will be collected by fluent bit which we also mentioned earlier These will flow through to the open telemetry collector the open telemetry collector will then send this data out to different data sources So first we have Prometheus, which will be collecting your metrics then you have Hotel or Jaeger, which will be collecting the traces and storing them then for the logs We will be sending that from for the open telemetry format to Loki and then at the end of that we have Grafana Which is going to be pulling information from all of these different data sources and showing you them in the UI So you can make sense of what's happening in your cluster Okay, so I'm going to actually Change up the display because I will need to mirror this instead of Splitting these screens. So just give me a moment Okay So if you followed through our tutorial that I linked to in the QR code earlier You're going to be deploying a few things in your cluster We don't need to go into what all these different pods are Just know that there's a bunch of different pods that all do different things Some will be for the purpose of the demo application Some will be such as you can see here. These are the fluent bit collectors There will be pods for Grafana and so on and so forth So the next part of the tutorial is that in order to actually see all this stuff in a UI You're going to want to port forward. So I'm going to go ahead and run that Great, so that's starting up and I'm going to open the Grafana UI Or actually let me show you the demo application first So this is the demo application that comes with the open telemetry demo This is a simple e-commerce application where you can basically just go and shop for a bunch of different telescopes So you can click in view the product, you know add more things that you want to buy add it to your cart check out so on and so forth and There's a lot of potential information that you might want to track here to make sure that you're always serving the best Experience for your customers. So we can go ahead and go into Grafana To see what some of that data looks like So with the demo application, we've also deployed a bunch of different Grafana Dashboards that you can see here We're going to dive into each of these and see what you might be able to see about the demo application So first we're going to start with Prometheus or metrics and you see here that you know There's a bunch of different metrics that you can gather you can visualize them in many different ways So for the top two I can let me actually zoom in to make it clear So you can collect things like system metrics. So a lot of this can be done It can be auto-instrumented by open telemetry So the top two are actually done through auto instrumentation You can get things such as CPU and memory that helps you figure out things like you know Your program might be running a very long for loop an infinite loop You might have you know your memory continues to grow and you can find things such as memory leaks You can trace or get metrics about your network requests So you can see the latency of your different network requests, or you might be able to track the error rate Here is a different kind of metric that I wanted to point out is the recommendations count So as you may have heard earlier, there's many different kinds of metrics And custom metrics are ones that are Specific to your application in the needs that you might have so in this case where we have this e-commerce application Where we make different recommendations to people about what products they might be interested in You might want to count how many recommendations that you're showing to people You know for example this count goes to zero Maybe there's something that's broken in your pipeline and so to do that you have to do some manual instrumentation You can use Libraries such as open telemetry or Prometheus to do that and then you can send that data over to open Telemetry and you visualize that in the dashboard such as this Next we're going to look at logs So everybody is pretty familiar with logs. I believe this has a lot of logs So it's a little bit slow, but here you can see we've collected a bunch of logs across from all the Kubernetes pods in your cluster and The nice thing that Flubent can do as Clemens was mentioning earlier You can add you know filters You can add different processing to what you're collecting from the logs to add color to the data that you're collecting So here we've actually added you know What is the Kubernetes container that this log came from you have the log itself and you also have things like you know Who's the person who exported it? What kind of log? This is was standard out standard error So you know you can get logs from across from all your different environments all your different pods and view them and filter them However, do you like to give you the most information about what's going on in your application and then we will move on to our traces So here you can actually work through different services so you can find you know which trace was Generated by which service so this is currently on the feature flag Service I just switched to the product catalog service which is responsible for showing the different products that you see on the Open telemetry demo website and so you can actually click into a trace So this one you can see this is following the get product request and if we click in you can actually get a nice view This one's not as interesting so I will click into a different one after you can get a view of you know Where was this request generated from and so this started the front end so somebody hit the UI they wanted to get the product and That sent a request down to the product catalog service And so we can click into that and also get more information about that request So with open telemetry or other tools you can add different attributes to these traces So at its core a trace you will be able to get information like who's making this request Who's receiving that request how much time is that taking but you can also add your own custom attributes to give you more Information about what's going on so here we're able to instrument this with what is the product name of the product that somebody tried To fetch what is the product ID and you can essentially add whatever information you might need that works best for you And so some of these traces let me see if I can try to find one they can get pretty complicated So this one for example, right? You're able to trace somebody made a request from the front end that then hit the recommendation service Which then hit you know the feature flag service and so it helps you dive really deep into what's happening in your application And what's really nice is that sometimes, you know The customers complaining they're saying hey, I'm hitting the front end Getting this product is really slow And you don't actually know where in that pipeline because you're making calls across so many different microservices You have no idea where that slow this is so you can actually go and take visualizations like these and Go ahead and see hey Where is this taking a lot longer than I expect figure out which microservice it is and fix that and hopefully resolve that problem for your users So I'm now going to hand it over to Vee Hong who's going to talk a little bit about pixie Thanks, Michelle. So this is sort of what you get when you deploy pixie So as Michelle said With pixie you can Collect all of this data using ebpf automatically So this is the same class server the open telemetry demo deployed on it And when you go to the pixie UI, this is the kind of view you get Straight off the bat we can see that we have a service map The service map is created By looking at all of the network traffic that pixie collects using the ebpf probes that we have installed and We can then parse that traffic compute stuff like latencies for it and understand What sort of services are running in your cluster and how they're talking to each other? So now if we zoom in a little bit more we can see the various demo apps running in here we can tell that there's like Loki running which is doing the logs processing part for Grafana. We have a collector for open telemetry running And we have a bunch of other stuff like that Let let's dive a little bit deeper into the the actual network traffic that pixie is collecting So the stuff that powers this service map and So we can look at the HTTP data in this cluster So you can see that pixie actually manages to collect all sort of requests in the cluster and if we expand one of these out We can see that we have all sorts of details for this request So, you know when you collect the network data, we have access for the IP address That's making the request the request headers The response headers and the entire body And then we talked to the Kubernetes API to sort of resolve these IP addresses and Enrich it with the Kubernetes metadata to sort of understand Which pods are talking to which other pods or services and so on and so forth note that one of the cool things about using a BPF is that we can trace this data even if you're using MTL us in your cluster Because we can hook probes into the open SSL library and capture their quest and response bodies Before they're encrypted and sent over the wire So that's one of the cool things you can do with pixie and Then another another feature of pixie That comes automatically because of the fact that we can use the BPF to install these probes is the fact that you can get Flame graphs. So let's say you're Looking at different applications that you're running and wondering about performance and trying to optimize that And you need to know where the CPU time is being spent Pixie can give you access to some of that. So let's look at one of the nodes in our cluster I can scroll down and you can see sort of like the entire CPU cores and this node being used and where all of the time is being spent They get broken out by our various pods and the containers running in each pod and then for any one of those Containers we actually can zoom in further and we get access to what the funk was actually doing at time of that As part of that application So this sort of information it's is why we think ebpf is a cool way to collect some of this telemetry Yeah, because collecting these frame graphs and stuff like that It's kind of hard to manually instruments this stuff And pixie lets you sort of do this on an application when it's sort of you might be debugging Something goes wrong in production. You don't have the time to add instrumentation Rebuild all of your apps and deploy them to try to test it out again With using ebpf probes, you can kind of get this sort of debug ability and observability into the stuff you're running directly Thank You V Hong so we wanted to open it up for any questions that anybody had about any of these particular tools that we mentioned About observability. I think Clemens and Hannah if you'd like to join us back on the stage for that Also, I wanted to mention that if anyone's following along with the guide and You're trying to get everything deployed on your cluster Over this Wi-Fi first of all kudos But second of all if you need any help if you're running into issues just raise your hand one of us can help Help out help come help you out Yes, and we also want to say for anyone who is trying this at home and then maybe later wants to order wants to ask any questions There's also the CNC f slack. There's the channel. I think believe I believe it's called to dash kube con dash sessions So we'll be monitoring that for any questions you all might have All right, so are there any questions? Oh, I see one yet in the back Can you walk up to Doug? Thanks. My name is Balinder and I'm a DevOps engineer and Thanks for this cool demo. I just have a question about this setup the storage So all these applications you have integrated together In a very busy environment how practical it is to To manage it, you know on the daily basis especially from the storage point of view and the actual usability point of view I mean, this is very cool in a demo, but I just want to understand the storage part. Thanks So are you particularly interested in the storage of pixie or of all the other logs? All of them the logs lucky. I know that we can store, you know The lucky comes up with the story engine engine, but I'm interested in if you have build this one as a one application We still have to manage storage for all of them separately So done in this case is all stored in Loki So most of it is stored in Loki. So, you know, you want to have a central storage because yes You definitely do not want to have to manage all these different storage systems. So yes Okay, in this case is like one place in a very busy Very very busy environment. Let's say is it practical. Do you have this running in a busy system? Or this is still under development or beta for example So, I mean the open telemetry demo, of course is something that we use right? So it is not something that we have deployed. Okay, but yes I mean this is used on a very large scale. So for our experiments clearly you will have, you know Dozens if not hundreds of nodes in which you want to deploy Pixie for instance. Yeah, and yes, there are issues that you need to look into the Scaling the storage. Yeah, I think it's really important like one of the things we talked about right is this this architecture Where you can say what is important to you because I believe that just deploying and capturing everything is just not reasonable Yeah, so there is a lot of this this this thought around. Okay, what is important to you to capture? deduplicate Filter for the things that are interesting to you and actually the demo when you do it yourself You can see how you can have them just config maps on your Kubernetes cluster where you say what do you really care about right? And actually one of the nice things that Hannah was talking to is the central piece of the open telemetry collector, which has several We didn't talk about the architecture too much there But the collector is actually quite nice and that is also multiple components that can be launched in a sequence So what you can actually say is that you have a collector you have a batcher and then you have a filter So maybe in your case you want to use head-based sampling you want to use tail-based sampling For those who might not be aware So head-based sampling is essentially when you get the like when you generate the first span in your trace You might want to say oh this is coming from the IP that is always causing the issue, right? So you could say I want that specific Span, I want that trace and all the spans connected with it So when you create the first span you decide do I want to use it or drop it? Tail-based sampling is very different in the sense that it is done at the end of your processing and you can say I'm gonna use let's say 1% of all my data right, but so you might not have a lot of information like what is the The most interesting one, but maybe having 1% 5% might make use of it make sense to you Batching for example in the in the example that I was showing we do collection on your cluster But we want to process the data in our cluster because you know Setting up ML is not that easy, you know talking from experience cube flow and so on you want to have in the cloud shared across different clusters And so uploading data could be an issue right so you could say well Maybe it's just too much or I need to batch it so the open telemetry collector has this this concept of batching prior to filtering So you can do so many cool things by understanding how they work So certainly when it comes to volumes always think about batching sampling what is your sample strategy and so on right? So that's quite interesting the other thing that because you said you know like are we talking open telemetry metrics? Are we talking pixie? And please you guys keep me honest here, but I want to say that Pixie has this really really neat solution that it's actually not Something that you install in your cluster and starts to stream data to your back end Because can you imagine the volume of traffic that we would see right? So the idea of pixie is that it captures data stores it locally And then you can from the portal go down into them Into a big solution and say hey, I want the hdb request of the last five seconds Right so you on demand go and say what is the data that I need and that is a very very scalable solution Right it's actually very similar conceptually to what I was talking about with Prometheus Where you don't want everybody spamming you at all times You want to be in control and I think this is something you know if you think about the architecture forward looking Always keep that in mind. Where are the choke points? Is maybe the pushing the issue so you can turn it around to be pool based Where do I do filtering batching and just with pixie as well, right? Maybe you do on demand access um, so there's you know a lot of these Conceptual problems that if you didn't think them upfront you'll have a very hard time solving them right? But if you think upfront we'll say oh, I want a collector that is able to filter later And maybe today it literally copies 100 of the inputs into 100 of outputs But you know on that Saturday morning 3am when your buzzer goes off that you have too much Data coming in that will come in very handy to just change one line in your logger and everything Works better right, okay. That was a lot of information. But yeah, I know my question was actually You know a complex one. I understand there's no one answer But where I was coming from is the root we want to get to the root cause and most of the time this information is just using up space And costing we want to be able to use this filters ideally at the time of production Consume it not store it just store the analytics massage the data Get what you want as you say, which is you answered my question, which is great, but So filters and a bit of you know, a bit of custom development maybe and and then we can Get the extract that out and what's useful for us. It's all great But it's the cost is always a for us is a big issue. So I'll be happy to Give it a go 100% But I think really what you want. I mean, it's exactly what you said right you want to stream up the relevant information So let's say a metric that will wake you up at 3am because you just have to and then maybe have something Like pixie that says all right something so wrong go filter go deep and on demand and I think pixie Just its architecture allows you to do exactly that. It's an incredibly powerful solution. That's smashing. Thank you very much. Thank you There's a question up front. Maybe you just want to use the microphone right next to you So Hannah does have to run around on the Hello, um, do you think it's a good idea that Showing clear texts of a data that should be encrypted Because yeah With ebpf we know that we can do it, but should we do it is an ethical question Sure Yeah, this is this is a great question and something that people are always concerned about Especially when you know, you demo off something that says hey We can look at the clear text data even if you use mtls encryption So we think about it in two different a couple of different ways One of the things that pixie does to sort of address the risk of having clear text data is the fact that The data that pixie collects is stored only in memory within your cluster. It doesn't really leave your cluster Um, when you query pixie for the data, uh, whatever querying you're doing once it's filtered and aggregated That final result does call does, you know sent gets sent to the client, but that connection is also end to end encrypted um However, eventually the the answer is like, you know You need to sort of evaluate the usefulness of that data pixie does offer Some tools to sort of like try to detect and strip out pii from such kind of data So eventually you have to weigh out weigh the pros and cons of being able to get that data And then, you know, maybe you don't want to use that Clear text data and you want to strip out the pii to try to say hey, I want all of the other Metrics, maybe you want flame graphs and stuff like that, but you don't care about the network traffic Hello question about pixie I'm not that familiar with ebpf as as in not in the use case that you showed here You showed nice trace graphs Yeah And do you have experience with multiple programming languages? How it could help because as I said I show it to my developers and they can even Debug the optimization problems with the application seeing what times what call spent Yeah longest, but there are virtual machines or like virtual machines python interpreters and sometimes those trace calls don't go that deep So what's your experience and in what language is it helpful? Yeah, I can talk about what sort of pixie can support and maybe if you have experience using it, you can talk about how you use it So out of the box, of course Combined languages are very easy. So as long as you have symbols in your binaries C C plus plus rust and go that stuff is very simple to do For Interpreted languages or languages with with a VM as you said like things get challenging So we do have an a solution for java where we use as agent to actually get Flim graphs for java binaries to and we are currently investigating what we can do from a node.js perspective But you know given that the kubernetes ecosystem has been heavily invested and like uses a lot of go Tooling I think it really sort of works very well with the the fact that most of the tooling that you use in kubernetes Is written in go symbols are typically included in the binaries that are shipped. So At least within pixie when we have needed to debug production systems The flame graphs have been super useful Just to add to this, I think there yesterday was an also an excellent talk just specifically around ebpf How do you stack walking things that you can do even in you know interpret languages and stuff like this? So please if you haven't seen that one That would be interesting The cool thing though is pixie and ebpf more generally right it allows you to capture data on multiple levels So I understand of course that you know, maybe the flame graphs might not be as Complete in a java or python application As in a in a c application But just as the fact that we capture on a user mode as well as on a kernel mode level Gives us a ton of visibility right if you think about networking How are you going to send a network request without going through the kernel? Right and that's just the beauty of these these kernel mode probes and in our project for instance We actually mostly use kernel data because from there you can Capture which process talking what other process pixie does this beautiful magic of resolving not only processes But you know which pod in what like which container in which pod in which namespace is talking to whatever other process on a different node Automatically resolves that for you and then by going deep into the payload right it parses htp for you for instance, right? It supports several Several other Protocols network protocols you get a ton of visibility just by using the the kernel mode instrumentation And then yes, you know mileage may vary for user mode hooks Might just not be supported for your application for your library for your programming language But just on the mere system level you get so much telemetry That allows you to debug most of the things that you will use or need for like finding the core of an issue But you're right. It may vary depending on your use case. Thank you for a very complete answer Any other questions in the room? All right, I guess Yeah, we'll continue to hang out here for anyone who's following through the guide Uh, so just kind of raise your hand wave one of us down and we can help you if you run into any issues And if if not, thank you for all the others for attending and please do reach out on the slack channels that we were pointing out