 So let's get started. So the first topic is obviously we need to introduce people about eBPF and psyllium Can I see hands? How many people know about eBPF? Awesome quite a lot Good, how many people know psyllium? I guess quite a lot good good and how many you people also use Grafana Awesome. Okay. Pretty much everybody. Yes All right For those who don't know eBPF and psyllium I do have will introduce the technology a bit and explain how we use eBPF for psyllium Metrics, which we're all talking about today First of all, we work for the company called Isovalent Isovalent created psyllium and eBPF open source it and together with the community were Developing psyllium more and more adding features today. We're also showing some latest features We're working on with Grafana and Anna will show that in a demo later today So first of all eBPF We like to say what javascript is to your browser eBPF is to the kernel What that means is that we make the kernel programmable in a dynamic way Without changing the actual kernel that allows us to basically run Micro micro micro program sandbox programs based on kernel events In the context of today We're looking at kernel events where for example a pod sends a packet or a network device Sends a packet on the why and we want to see metrics of that kind of events so today's focus is to observe traffic using eBPF and also we use eBPF for Identity based security and observability of which you will talk a bit more later psyllium is built on eBPF you don't have to be an eBPF expert to actually run psyllium psyllium enables the required eBPF programs on your cluster depending on the settings you set So for today's obviously we're focusing on Hubble and Grafana. That means that you will enable Hubble You'll possibly enable Hubble UI and you most likely will enable Grafana metrics through Prometheus This is the 30,000 feet view of what psyllium can do today Obviously psyllium started with core networking features network policies to secure workloads Transparent encryption using IP second wire guards. We also support things like BGP for on-prem deployments Also low balancing out of the box and on top of that we built a observability platform called Hubble and Hubble allows us to provide metrics and observability and also integration with Grafana Scene platforms we can export flows to Splunk, Fluent D, etc Last year released a surface mesh solution And this is also important in the context of today's session because we can provide golden signals Without side guards using psyllium surface mesh and on the right side We have to run time security piece with tetragon and that's super powerful for observing file integrity For example or privilege escalation and we can also export that data with Grafana and Hubble So let's talk about some of the observability challenge What we see a lot in the field is that customers struggle with troubleshooting for example when a user reports slow Responsiveness of their application. Maybe an application team gets reports of users Saying the application doesn't response or get errors and then the application team most likely will blame the network That happens a lot However, the network or the platform team may look at the platform and say not fine behaves normally I don't see any CPU contention and the network team may also not see any latency The point here is that obviously networking is a layered solution So networking might only look at layer 3 layer 4 layer 2 and an application team is only interested in layer 7 So it's hard to track where actually the real issue is and that's what we call the finger pointing problem And that's what we're trying to solve on the other hand at scale. It's really hard to identify where the problems are Moderate clusters with tens or hundreds of nodes may possibly run thousands of servers with a lot of replicas and in clouds especially if you want to follow all the logs and all the data of all those workloads Based on their IPs that will be super hard to track and that's what we called obviously the noise signal to noise problem Especially if you're also dealing with multiple clouds or multiple on-prem data centers with psyllium or sorry with With plain IP logging it will be super hard to track where the actual issue is Also, VPC logs don't have any context of the application So where do existing mechanisms fall short? So maybe you are looking at traditional monitoring devices where you centralized logging These devices can become a bottleneck also these devices don't have any context about the application They just see IPs a source and destination with related ports, but no application awareness Like I said before VPC logs are nice, but at scale unusable also no context of the app application If you troubleshoot a host you can do really low level troubleshooting You may do TCP dumps etc to see what's going on But again the host doesn't have cluster wide context and doesn't have application context So maybe you will modify your application codes to instrument it to add metrics to it to understand the application better But then you are again have only the application context, but not the underlying network context And that's also where part of the surface mesh came from right? We want to abstract this code We want to make it reusable for multiple applications Service mesh is widely used for monitoring workflows to provide metrics, but service mesh May involve sidecars has an operational challenge And that's also harder to operate where psyllium without sidecars is more efficient using eBPF to provide those metrics and golden signals based on application information in the context of psyllium it's important that We I explain the identity based observability and security. This is used throughout our Applying applying network policies on the network applying security, but also observability And how we do it is based on the labels on the major data you set for your workloads We create for each unique set of data and identity. This is a cluster wide Property which will use throughout the cluster to secure workloads and to observe work workloads What that means is that for example in this example when a front-end sense traffic to the back-end we Identify those as unique set of labels. Therefore, they are unique identities and when that Front-end sense traffic to the back-end We use eBPF to attach that identity to the data plane so we can secure the traffic But we can also observe and monitor that traffic and get metrics and to inspect and follow the traffic we use Hubble for providing a either a surface map a Hubble UI Which provides a namespace view of all the connectivity within a namespace But also connectivity egress and ingress to to and from the namespace And we're able to also see and inspect protocols and all even up to layer 7 Hobble CLI is also very useful for advanced troubleshooting or Exporting flows through JSON or other solutions and today we're obviously focused on both the Hubble UI and the Hubble metrics Being able to export application layer 7 context to Grafana through Prometheus You may have known that of learned that Grafana has invested in ISO valent We have a good partnership with Grafana And we're working together to provide better and better dashboards out of the box Which you can use easily the goal is that we extend what we already do using eBPF to provide Metrics and Grafana helps us to create meaningful dashboards and together we make sure that you get the best Not only data operations, but also go to signals dashboards Today we're focusing on Grafana and at a bit of tempo But also we're having more and more plans on Mimir and Loki in the future. So stay tuned for any updates there I mentioned before multiple times golden signals and I would like to highlight that you already can monitor a lot In terms of layer 7 golden signals without sidecars without added instrumentation by just enabling Hubble metrics This allows us to see for example HTTP request rates we can see HTTP durations between surfaces We can see return codes without instrumentation by just enabling layer 7 metrics in Cillian and exporting those for example to Grafana so this happens without any instrumentations any sidecars and also any Additional components actually right because this is also by Cillium Which is in the cluster already to provide networking so you don't have to install anything And beside that we also support the pure layer free Metrics right so we can see TCP retransmissions. We can see missing syn acts We can see DNS responses and error codes and also easy ICMP echo requests and replies Yeah, we have same pattern for different network layers from layer free to layer 7 different protocols TCP DNS We can we this these protocols are completely different, but we can always detect Network issues using same pattern for TCP. We look for requests with syn flag missing responses with syn act flag for DNS We also look for queries without responses and so on so far. Good. Thanks So this allows us also if you want to instrument your applications and the golden signals I just mentioned before We're not good enough. You can still use open telemetry or other means to instrument your application and that allows us to using the same metrics and even if you have to export those metrics to Grafana and and and And pokey and trade sorry tempo and then we will see exemplars So this red error is showing for example an exemplar in this specific Dashboard, okay, can you add something about that? So how it works is that if your application is instrumented with distributed tracing for distributed tracing you need to propagate trace headers and if applications are doing that propagate trace headers in for example HTTP requests then Hubble parses this trace headers and includes them in metrics it produces us as Exemplars thanks And this is allow you can query those exemplars with tempo And then you have an example of an API request with transparent tracing enabled So you can see the span of this trace point and see where time has spent and for example also a specific error codes for specific service Also want to highlight that psyllium and brafana have provided out of the box ready to use dashboards for you to use You can go to the Grafana marketplace and find those Dashboards ready for you to use based on the version they are released So one of the first I want to highlight is this is especially powerful for data operations If you want to monitor the health of your cluster We have a specific dashboard for psyllium agent metrics This is obviously related to pure node cluster performance meaning that traffic being Transported from a node so things like nick performance throughput latency Things like BPF map pressure on a host Also operator metrics, these are clusterite metrics So for example the operator keeps track of identities keep tracks of BPF maps across the cluster so you can monitor the health For for your cluster also things like IP address allocation if we have enough space. We're using our IPen And obviously Hubble metrics and that's the focus of today for the demo as well to monitor all these application performance metrics Also want to highlight another talk on Wednesday about network policies and how to deploy deploy network policies in enterprise environments And a lot of these enterprises obviously not only want a zero trust Environment they want to secure it but they also struggle to deploy deploy network policies in their enterprises and they also want to have a confirmation or for compliance reasons that they are matching Specific flows so we also developed a policy verdict metrics dashboards, which Provides you on the cluster or on the namespace level metrics and data if the network policies are matching all the flows in your network So this can give you confirmation that you are securing all flows in your namespaces in your cluster So enough talking I'm going to hand over to Anna to show a live demo how it is works and also some of the new features Okay, are we ready for the demo? I will switch to my terminal So for the demo We will use the open telemetry demo application. It's Maybe many of you heard about Microservices demo from Google There is similar application provided by open telemetry project which is very similar many services In different languages, but also instrumented with open telemetry Tracing in particular, this is what we will be interested So let's take a look at how it looks like in the cluster Okay Will it work or yeah, it works on a conference Wi-Fi great So as we as we can see we have several pods Running in in a namespace We also have an Ingress here. So the cluster is running Silium with Silium service mesh features Silium service mesh is Is a feature of Silium. It's it's not really anything extra you installed but Set of features that comes with with Silium that you can enable one of them is Ingress Silium ingress. So we have Silium ingress here and When we created ingress in Silium We also get the so-called Silium Silium envoy config which provides routing defines routing that Happens In in our ingress. So let's take a look at this Silium and my config we have here So as you may know Silium uses and for if we're a layer seven capabilities So also for service mesh on the nodes without sidecars Silium ingress controller programs Envoy configs so you don't have to manage them You just create a simple ingress resource and this is an example. We use this ingress resource for Accessing this demo application. So this is the Silium and what you can think that was generated and we can say it's actually very simple. It's routes all traffic from Slash all paths Into the front end proxy that we can see here All right Let's get back to the browser and let's take a look at Grafana this time. So Earlier this week we released Hubble Grafana plugin Grafana has this plug-in system It allows you to configure many different data sources To plug in different panels and mix and match them in in one dashboard so that you can aggregate all your observability data together with the Hubble that data source plug-in There are some dashboards that come with it This is one of these dashboards the HTTP connectivity dashboard and here we can see what we just saw in the terminal but visualized We can see HTTP service map for our open telemetry demo application I will zoom in a little bit and here we can see we can see two sources of data Two sources of traffic. One of them is the load generator So that traffic that's generated inside the cluster and another one is Ingress so this is the traffic that we just generated Visiting the the website through the silly Mingress is visible here Thanks to that we can easily separate different sources of traffic and see if any and one of them is Causing actual problems in the service map. We can see statistics about Requests per second and the latencies and some of them here are suspiciously high for for a reason and Also in the dashboard apart from service map. We have a few graphs These are standard Prometheus time series with so-called and you can call it golden signals or Software engineers often call like to call it red metrics So request arrows and duration metrics for HTTP traffic in this namespace We can see that traffic here is is pretty steady So yeah the Open telemetry demo application Has this feature flag service which allows you to enable features some of these features are Actually generating errors so we can see that here I enable feature to Generate product catalog errors on a specific product and if we take a closer look at the Service map we can see that these are here are these Red arches so which indicate that some requests indeed are failing if we look at At the bottom at here is the errors graph so 500 requests Plotted if you look at this graph we can see indeed There are errors We can see also between which services and we have exemplars here so this Dots that you can see over the graph are our exemplars as I mentioned earlier Hubble if application is traced with And propagates trace headers then Hubble parses these trace headers and includes them as exemplars in Flows it produces and in Prometheus metrics it produces so If we click over here, we can query this individual Trace in in tempo. I have one opened here This is trace visualized in tempo and in that trace we can see Many more details so we can see first of all Latencies so where the application spends the most time, but also we can see attributes additional metadata that were attached to this trace and this is Often very useful when we are debugging errors We can see here that in this trace confirms that there are some errors but also we can see for example the full HTTP URL which includes the product ID So the feature flag we enabled causes all requests for this specific product to to fail So from the exemplars we can see that all the exemplars All the traces for for error three quests are for the same product ID So actually there must be a problem with this specific product ID Okay, let's get back to the Open telemetry demo application. So that the application itself is a shop a shop with With telescopes and and other astronomical equipment We have a card here and Let's try to place order and it's not working And we could blame it I'll take conference Wi-Fi, but this is actually not the case here it's not working and we can't really see in the HTTP metrics or the service map why it's not working, but Hubble metrics Provide visibility on multiple network layers This is where they are really powerful We have here we have a different dashboard that shows Network policy drops so having network policies in Kubernetes clusters is generally considered Good practice. It's often required by security teams. It's often required by compliance Here in this application you also have a network policies And we can see that some of the traffic that is happening is actually denied by the network policy now there Drops in the network can happen for many different reasons This can be What engineers often like to call a network issue like I can't say in my HTTP dashboard What is going on so I will call in a network issue. I'm doing that myself I can't I can't say I'm not and It can be in it. This can be many different Issues with the underlying network then the network drops will tell us What what is The reason which layer is is problematic and then the application team can Show that that graph to network team and ask them well fix the network Here though we can see that the reason for the drops is policy deny So no way to blame network layer for that. This is the problem with with network policy and we can see that there are drops between Checkout service and cart service. So it does make sense indeed that we couldn't check out in the application These are very simple prometheus graphs, but what You would normally like to do is Probably create an alert on such drops because network policy drops can mean two things Most of them most commonly mean one of two things either malicious traffic happening in the cluster or Misconfiguration now Hubble has also this Hubble UI There's Hubble CLI and Hubble UI that can be used for More detailed Investigation by more detail. I mean extremely detailed because in Hubble UI we can see Also a service map a bit different service map, but we can also see individual flows streaming. So these are like literally individual flows handled by by Linux kernel Again from all Layers of the network and some of them have layers of an info others don't and One feature of Hubble UI that I like is this Here we have this filter so we can filter flows by verdict a filtered flows by Dropped verdict. So here we can see what I can call I call negative service map It is a map of services that are trying to communicate But can't they can't communicate because the network are are dropped and again we can see very detailed view of Many many Network flows between these two services that got dropped Let's take a look at the terminal once again and here Let's check the network policies that we have here. So we have a few network policies in the namespace one of them is To prove to allow our traffic To get layer 7 visibility so HCTP service map this dashboard is so earlier We need to enable layer 7 visibility in Cilium and this can be done for annotations, but A very convenient way to do that is through network policy. So we have this Polish that work policy that is Literally providing Layer 7 visibility and and Not not really denying any traffic on the allowing and HCTP traffic and We have this suspicious network policy that says Deny so Cilium network policies also can can allow traffic or Deny traffic here is an example of deny only network policy and we can see that indeed It is a mis configuration a user error probably there is Network policy that denies traffic Between hard service and check out service Alright, I think that would be it for the demo crew So this shows that by just creating a simple layer 7 visibility network policy You will trigger all these layer 7 metrics We can see all layer 7 protocols in Hubble in both the UI and the CLI We out of the box then get this golden signals without instrumentation without added Edit tools And in this case it wasn't deny policy sure that's that's obviously for demo purposes But in practice you definitely see perhaps new applications new versions of your applications And then you have secured application But most likely with a new version a new protocol is introduced or a new port and both Hubble and Grafana allows you to Effectively troubles such changes and effectively troubleshoot any issues between nodes between services Using those dashboards. Thank you and again for the demo great. Thank you. Cool. So What some call to actions for next steps if you want to try this out yourself We have some excellent demos Configured in a hands-on lap environment. We already have 21 laps you can try and these are all about psyllium Hubble Grafana Based on the feature if you get started due to getting started But we also have some specific a Hubble demos or Grafana demos you can try out If you would like to create it yourself We have all the documentation on docs.syllium.io if you have any questions or want to contribute or struggling with perhaps a specific set up of metrics Feel free to join the Grafana and Hubble slack channels Yeah, we have psyllium and a BPF slack. Yes where We have Hubble and Grafana channels, which are the best for asking questions about observability specifically We are happy to help you had our yes, so obviously mentioning the psyllium slack Channel with the specific Hubble and Grafana channels for those specific questions So feel free to join if you want to know more about how ebpf works go to ebf ebpf.io Our colleague Liz Rice has written an excellent book explaining how we use ebpf for networking and security observability You can download her e-book if you didn't get the chance to see her at our signing sessions With that, I'm happy to also take questions. There are two mics at the side And also we're happy to stay a bit longer if you want to ask us questions directly So thank you so much for joining this session