 Dzień dobry, jestem Anna Kapraszczyńska, pracowałem na Polskim Ingenii Isovalentem. Dzisiaj będziemy rozmawiać o networkach. I również o obserwacji, że zacznijmy z networkami, ponieważ Isovalent, w ogóle, to, co jest najważniejsze, jest projekt, zwłaszcza Sylium, który używający IPF dla networkingu w Kubernetes, a nie tylko. Zacznijmy z krótkimi networkingami 1.1 w trójstwach. To będzie bardzo basicszne, żeby zobaczyć, że wszyscy jesteśmy na samej pagece i używamy samej terminologii. Kiedy ludzie mówią o networach komputerowych, przynajmniej mówią do modelu OCI. I w tej modelu są sześć lajów, od fizycznego do aplikacji lajów. Każda lajów znalazła na to, a lajów z powrotem, czyli idea jest, że lajów 4-tych oddał data z lajów 3-tych w jakiś sposób, a znalazły protokol na to, a na to, to jest model, to jest model, to jest model. I to nie jest dokładnie zrealizowane z realności. Nikt nie pamięta, co są lajów 5-tych. Ale wiele osób pamięta, co są lajów 3-tych, 4-tych i 7-tych, bo to są bardzo komuniczne, używane termy. Tak więc, zobaczmy na 3 z nich. Lajów 3-tych jest lajów networku, gdzie oprawiamy IP adresy, IPv4, IPv6, lajów 4-tych z powrotem, a to jest transport z lajów, dwie principie protokolów na tej lajów są TCP i UDP. I lajów 7-tych jest aplikacją lajów. I są wiele lajów 7-tych protokolów, jakieś z najlepszych są HTTP, Kafka i DNS. I lajów 7-tych jest lajów 4-tych, więc HTTP i Kafka używają TCP na duchu, DNS używają CDP. Teraz lajów 7-tych ma wiele protokolów, bo są servejące kompletnie różne aplikacje niskich nisków. Są ludzie mogli się zrozumieć, dlaczego są 2 protokolów na lajów 4-tych. I to jest moja ulubiona z dłużej i zrównym zespołu na duchu dla protokolów TCP i UDP. W TCP jak komunikacja wygląda, jak servejące zbierające się z nami i tak wychowają. Chciałeś o TCP? Tak, tak. Tak, tak. I tak. To jest superwerboc. Ale dzięki do tego werbocu można mieć reliabilność, więc TCP jest zakończony do, na przykład, pakotu odpoczywania w niej lub doplikowanej lub zniszczonej. UDP nie jest tak, UDP jest po prostu to jest UDP i nie wierzę, czy to jest to jest komunikacja UDP. Oczywiście to jest faster, ale lepsze i reliable. Ok, więc to było networking 1.0.1 w 3 slajdy i w tych dniach mamy kubarnety i kubarnety jest dobrze. To się widzi bardzo komplikatnie podczas kubarnety nauticalnym tematem. Myślę, że kubarnety powinni być reprezentowane jak jakby tangle, rhope dompc금owe constantsznie się wystandy. To jest po prostu kilka metrow, które na kubarnety są rozwójowe np. serwiz, krawy i kubarnety. To są też być kilka komponота odsposibliany w坐ich wszystkich kubarnety, dając netwór cianietaly realizowanie basicy działalności wńskim To jest dla podstawowej aktywy. Tu jest kub proxie, prowadząc laje dla włożonych, i włożonych kubernetesów i ten punkt. Tu jest włożony kontroler opcjonalny, prowadząc laje 7 włożonych i włożonych włożonych włożonych włożonych włożonych. Tu jest kordy NS, prowadząc DNS. I myślę, że jedna z理zy, jak to jest włożonych kubernetesów, jest to, że nie jest jedno sposób aby zrobić to. Będziemy wybierać wszystkie te komponenty. Jesteś C&I plug-in. Kub proxie ma default standard i implementacja, ale jakieś C&I plug-ins pozwalają na to. Jesteś kordy NS, ale często trzeba konfigurować to. Więc wiele, wiele różnych sposobów, jak to konfigurować. I te różne solucje will provide different capabilities, including different observability capabilities, so different visibility into what's actually happening inside the cluster when it comes to networking. So today I will show what can we achieved with Syllium for Kubernetes. OK, so let's introduce our new characters. Well, that's new really, but EVPF, Syllium and Table. EVPF is, and you probably will hear that word many times this week. It's a very hot technology these days, which I like to think about it as lacking system for Linux kernel. So EVPF allows you to write code, that is sort of injected into Linux kernel in a safe way. And the main use cases for it is observability, networking and security. Syllium is it's a CNI plug-in, but a very rich and comprehensive CNI plug-in. There are many features. There is, there are layer 4 and layer 7 network policies. Network policies are act sort of like firewall for POTS inside Kubernetes cluster. There is Kube Proxy replacement. So implementation of services and layer 4 load balancing. There is also Ingress and Service Mesh powered by Envoy Proxy built into Syllium agent. So these provide layer 7 implementation and load balancing. And Syllium uses EVPF as the underlying technology for networking. So I like different plug-ins base, usually on IP tables. Syllium does networking with EVPF. And there is Hubble. And in a moment we see Hubble in action. But Hubble is basically observability layer for Syllium. There is CLI and UI. And it also exposes Prometheus metrics. And one point before we get into Hubble details. EVPF is widely used for observability and the kind of canonical use case, canonical way how to do that is that there is EVPF program that hooks into some event in Linux kernel. It can be a Cisco, but it can be many other things. It records what's happening in reports back to the user. And there are multiple tools these days that do some variation of that. One of them is for example Tetragon. Tetragon is focusing on security events. So it's like security observability tool. And this is not Hubble. Hubble sort of uses EVPF for observability, but very indirectly Hubble is visibility layer for Syllium. So it basically pickbacks on Syllium. Syllium uses EVPF for networking and Hubble just watches what Syllium actually does. So what Hubble does, Hubble collects flows and to may ask what is a flow? A flow is network event. A flow indicates some network transmission. It's a very wide event. It can contain information from all networking players from layer 2 to layer 7. It also contains Kubernetes metadata. If it's run on Kubernetes because it actually doesn't have to be run on Kubernetes, but if it is, we have all the data. We can also have things like trace ID in the flow. If the applications are instrumented with trace context propagation, then Hubble will automatically extract trace IDs into individual flows. And Hubble also exposes from it use metrics. So here you can see example, configuration, how metrics can be enabled in Helm. They are enabled. Different groups can be enabled independently. So different groups correspond to different protocols and make some a few other groups. There are also, labels are also enabled separately. So you can control cardinality in that way. And this is it. Now additional component is needed when installing the CNI plugin that you need for connectivity. You can automatically get metrics about network activity and configure them to your needs. There are also a few other options you can enable for convenience like service monitor or automatically created config maps with Grafana dashboards. And there are exemplars. So if application is instrumented with trace context, then Hubble will extract trace ID into flows and the metrics will actually contain exemplars to these traces. This is enabled with the metrics option. So let's look into how Hubble actually works, how this actually looks like. We have a stadium agent. This blue box there. Stadium agent runs on every note and it injects eBPF programs onto the note. This green box are eBPF programs that are injected into kernel and provide network connectivity. We call it Datapath. eBPF stores its state in eBPF maps. eBPF maps are basically like hash maps or there are different kinds of maps but basically a way to store state of eBPF programs. And Cilium reads these eBPF maps through the perf ring buffer. There are also a few other components inside Cilium agent. There is Envoy Proxy in layer 7 trapping, layer 7 routing and providing visibility. Love optional you can enable that if you need layer 7 routing and visibility or not. Also metadata cache that is used for identity, for identifying and the points that are communicating with each other in Kubernetes that will be Kubernetes metadata. And there is Hubble and Hubble is this purple box over there. Hubble runs inside Cilium agent and collects all of that, collects data from proxies for layer 7 visibility from identity cache for Kubernetes metadata from Cilium monitor that collects events from eBPF maps for the basic network visibility. Hubble parses packets that it collects and stores them in the ring buffer. It's a ring buffer implemented in Go. And then it exposes network events in two ways. GRPC endpoint with raw flows and metrics. So GRPC endpoint can be queried by either UI or CLI through components we call Haberilla which is like centralizing centralized components collecting data from all the nodes and metrics can be it's like open metrics metrics so they can be collected by any from user. All right, so now we have an overview of that. Let's debug the networking problem. So I will walk through a problem with communication between services. I want to do it live because I don't trust computers that much, especially when it comes to networking but I will walk through like debugging scenario. So we have a new demo app deployed to cluster or maybe not a new one, maybe old app redeployed to new version of an app deployed to cluster. It's a very, very simple app. There is frontend pod with frontend service and two worker pods with worker service. User is talking to frontend service frontend pod is talking to worker service on it's calling it by its name by worker on port 8080 and if we try to call frontend service as a user let's say it's port forward to localhost we are getting an error. We are getting an error that frontend can talk to the worker. If we open a Grafana dashboard that we have, we can see something like that. What we can see that, we can see a spike in graph with DNS errors. So this is something that we can get alerted on if it's suitable in this environment. But we can also query Hubble CLI for more details and we can query Hubble CLI with filters filtering by various Kubernetes metadata, like namespace and in this case we are identifying apps by labels that could be like work of name, but we can query by labels. And we query DNS responses to this app. These are only responses coming from core DNS in cube system namespace. And we can see some interesting things there first of all. We can see how actually DNS happens in Kubernetes. We see that there are both A and quadruple A queries. So these are IPv4 and IPv6 DNS queries. And we can also see that there are queries for different domains. There is domain worker demo app sbc cluster local and then it's shortening and we can see that we are getting this non-existent domain errors. And if we are closer and we can spot that actually we made typo and the worker is powered by with double error. So, yeah, we are actually calling incorrect domain. Let's fix it. Let's fix it. This is actually configured in the frontend deployment. So in frontend deployment has environment variable with an incorrect address. We can easily fix it. But when we fix it we call frontend service again. We get exactly same error. And what now? My can take a look at another Grafana graph showing drops. What are drops? So when there is some network connection happening in Linux kernel it can be either forwarded by eBPF implementation in this case or dropped. And here we see spike in dropped connections with reason policy denied. So policy denied tells us that this connection was denied by network policies. Network policies are generally are acting like firewalls for pods. And it's generally considered a good practice to configure network policies for connectivity in the cluster. When we have spike drops because of policy denied that means that there was an attack in the cluster but there often that actually means that we have misconfigured resources and for example some policy is missing or there is something incorrect in the resources labels for example. So let's if we try to investigate here if we try to investigate with Hubble we can see we can query about CLI again with same filters and thirdly dropped and we can see again some interesting things we can see that there are retries it's not only one request, only one request but there are retries. So there are so different connections made and that's because network is generally considered unreliable and in general when we make TCP connections and it says there are retries always built in. So there are a few retries we can see that they are happening with exponential backoff so the first retry is after a second or so and second one after two seconds and then there are some other ones four and eight seconds later and we see that we actually have colored in red outputs that were closed and were dropped and policy denied. Let's take a look at the network policy for the frontend and we can see that oh well we made same typo in the network policy for frontend because we were just copying test tickets so the network policy for frontend doesn't allow to connect to worker to proper worker service let's fix it and when we fix it we again are getting same, exactly same error what happens this time we have another graph in Brafana that show us missing send so in the TCP protocol as we saw earlier we how the communication starts is that first there is something sent with sin flag and there is and what we are expecting after that is a response with sin flag TCP protocol has this concept of flags where the whole communication and whole TCP life cycle is controlled and is managed using this flag so that we know when there is something unexpected happening and how we expect communication to start is that we have this so called freeway handshake so that there is request with sin response with sin act and then it's another request with act and then communication continues here we have request with sin without act without sin acts and if we investigate that we have a CLI again we can see that we can filter in have a CLI by protocol filter all TCP traffic so this is entire TCP traffic that involves from temp app and what we can see that we can see indeed that it sends a few requests with TCP flags again we see retries it's not just one request there are retries and there is nothing obvious telling us what's wrong there is no word error nothing red as it was in previous examples but it tells us exactly what is happening it tells us exactly that some of the one of frontend bot is trying to reach one of the worker pots on port 8000 now well if you remember we configured our frontend deployment to hit workend or port 8080 and worker is exposing port 8080 so where this port 8000 came from we actually didn't want to call it well it came from the service it turns out missing sense can mean a missing syntax sorry, sense without syntax can mean a few things but very common thing is that the port is just incorrect the app is not listening on the port we are going and here we can see the mistake that the worker service is configured to actually listen on port 8080 but on port banding to port 8000 on the pot this is a mistake and if we fix that mistake we finally can fix our problem and finally frontend and worker pots can communicate so yeah, that was quick walkthrough what's possible and we used Hubble CLI here a lot and we love CLI but we thought wouldn't it be convenient to query flows directly from Grafana if we already have metrics in Prometheus that we can graph in Grafana dashboards and we are doing that and if you are familiar with Grafana UI and you pay attention to previous slides you might have noticed something that would suggest that we can actually do that when I showed the table with flows this table was actually table in Grafana so Grafana has this concept of plugins Grafana has data source plugins and panel plugins and there are many built in data source plugins the most common ones are like for Prometheus for MPo, for Loki but you can write your own data source plugin and me and my team thought hey we have this Hubble thingy let's do that let's write Grafana plugin for Hubble that would allow to query flows directly with very granular filters so yeah this is a sneak peek into what's coming up because this plugin is actually not released yet but will be we are working on Grafana plugin for Hubble data source Grafana provides great tutorials and libraries for developing plugins actually following the tutorial after following the tutorial we get scaffolding where even without much frontend knowledge it's kind of clear what we need to do and how it looks like to go backend data source plugins look like is that there is frontend component that's interesting with typescripts and react and backend components written in go using Grafana plugin SDK that allows us to for example query grpc endpoints that Hubble exposes there can be multiple underlying data sources in data source plugin but what I mean by that we can actually connect for example Hubble grpc endpoint and Prometheus and tempo inside one data source plugin which is super convenient for correlating things like for example in our case correlating correlating metrics traces and raw flows that trace ID and all visualizations in Grafana are basically available and can be handled by our own plugin so this will be a quick preview this is how the plugin this is how it looks like we can query flows with very granular filters same as we would do that in CLI you can see this mysterious industry service map and let's keep the mysterious for now if you are curious about what's up then feel free to hit me up during QPCOM available time and to learn more to get started with Hubble I would recommend following tutorials in film documentation, observability section and Isovalent and Grafana announced partnership today so if you want to see more those cases for using Hubble and Grafana for debugging various issues including layer 7 or hardware issues then there is a blog post available and also hands-on demo available on github so I encourage you to check out that thank you very much