 Hello everyone, welcome to Cloud Native Live, where we dive into the code behind Cloud Native. I'm Annie Talvastro, and I'm a CNCF ambassador, as well as a senior product marketing manager at Camunda. And I will be the host tonight. So every week, we bring a new set of presenters to showcase how to work with Cloud Native Technologies. They will build things, they will break things, and they will answer your questions. So join us every Wednesday to watch live. And this week we have Chris from Tigera here with us to talk about amazing topics as always. And another exciting thing happening in Cloud Native Sphere is remember to register for QCon Europe. And now is really the time to secure your spot there. So get over there. And as always, this is an official live stream of the CNCF, and as such, it is subject to the CNCF code of conduct. So please do not add anything to the chat or questions that would be in violation of that code of conduct. So basically, please be respectful of all of your fellow participants, as well as presenters. So with that, I'll hand it over to the speaker of today to kick off with your introduction. Chris. Yeah, hi there. Thank you. So my name is Chris Tomkins. I'm a lead developer advocate at Tigera. Tigera make a piece of software called Calico, which is an open source CNI for Kubernetes. I did a previous session actually on Cloud Native Live and lots of other sessions you can find on YouTube about Calico generally. So in the interest of time and seeing the things we want to see today, I won't go too much further into that today. But yeah, my job is to help the open source community to hear everything that we do with the product and likewise to make sure that our teams and open source community are hearing and developers are hearing from the open source community. And that's it in that middle position. Yeah, that's a great spot to be in that. So when we get started so briefly, what's the Calico EDF data plane? So yeah, I did a previous session on this, so I'm going to be really quick on it today. But if we can pop up my slide, I don't know if I need to do that or you do. That should happen perfect now. Yeah, there we go. Yeah, perfect. Great. So people will be pleased that I'm not going to take every part of this diagram, but what we're looking at here is actually the flow of traffic through a Linux node. And this data images courtesy of Jan Engelhardt, you can see it on Wikipedia. But what this actually shows us is as a packet comes into a Kubernetes node or any Linux host, it comes in on the left-hand side, it works through this complex flowchart and then packets leave on the right-hand side. Now, the reason I've got this diagram up is because there's a fair amount of complexity going on here and people who are familiar with computer networking will know that you've got the link layer and network layer and protocol and app layer. And most of the packet processing happens in the color boxes in the middle. But what ABPF is, is a technology allows code to be hooked onto hooks in front of and after this flowchart. So I don't know if you can see my mouse pointer or not. Yes, I think you can, right? Yes, we can. Yeah, great. So with ABPF, you can attach our networking code at these hooks before and these hooks at the end of the packet flow. And you can actually do the networking outside of this flow. And in doing so, you avoid needing to go through most of the complexity in the middle of the diagram. So just to kind of restate it, what we're seeing here is an ABPF, is a Linux node. We use ABPF and then we've reimplemented Calico's Kubernetes data plane in ABPF as one choice, one option basically. Perfect. So what advantages does it bring? So the cool thing is, because you're attaching at the start and end and you're avoiding some of that complexity, I've really wanted to avoid having a lot of slides. So I should say up front, I've only got I think it's three more slides after this and then we'll dive straight into a live demo. But these key benefits apply across all environments. You basically get improved performance. And that happens because when you attach code at those ABPF hooks, you're running the code inside the Linux kernel with the performance that you would expect to gain there. And you can cut out bits of code that you can intend to run. So you get better performance, which is less CPU or more throughput, which are opposite sides of the same coin. You get native Kubernetes service handling that we'll talk about more in a moment. And then you get source IP preservation and direct server return. I'm going to dive into all of these, but very superficially because I've done much longer talks on all of these things. And the previous Cloud Native Live session discussed these. So I feel like if people want to see these in detail, they should check out those older sessions. But they need to be said today. So just really briefly, the data plane replaces Kube Proxy functionality. And by doing so, I should take a step back. In Kubernetes, there's a service called Kube Proxy and it runs on every node. And its job is to manage the services that allow traffic to flow in and out of the cluster. So when we rewrote the data plane in EBPF, we had to replace Kube Proxy functionality. But instead of that being a negative, actually we re-implemented a ton of the code and improved upon it. So as well as the performance improvements that I mentioned before, I'm going to actually jump forward to here. Yeah. You can see that there are three different ways to implement Kube Proxy. One is in the IP tables mode, which is the traditional way of doing so in blue. And then the IPvS, Kube Proxy implementation in yellow. And then our EBPF implementation of the functionality. Technically it's not Kube Proxy, but it's that same functionality. And what we're seeing in this graph is as you add more services, IPvS and EBPF, the connect time remains constant, regardless how many services you have. And the EBPF data plane is even faster. But the IP tables data plane, the old way of doing things was basically increasingly slow as you added more services to your cluster. So you've got performance, you've got that TCP connect time advantage, which just to be really frank, if you only had a couple of services, you wouldn't notice it and you wouldn't care. But if you have a large number of services or a lot of session churn, then you really start to care about this. And then the last big advantage really is that this one, sort IP preservation, which is where as an external client comes into your cluster. So at the bottom of the diagram, we've got these two Kubernetes nodes. Now this could be 50 Kubernetes nodes. It depends on how many is in your cluster. But as your external client comes in, their traffic hits Kube proxy. So what we're seeing here is the Kube proxy way of doing things without our EBPF data plane. And you can see that the first thing that happens is the Kube proxy destination that's and source that's the traffic. And it does that to make sure that the traffic gets forwarded across to the service pod correctly. But also the source that is required to make sure that the traffic returns back through Kube proxy. So then the traffic gets forwarded onto the service pod. And as you can see, as a result of the destination that and the source net, the pod never sees the IP address of the external clients. And the side effect of that is that, let's say you have an auditing requirement to capture your, to capture your client's IP, source IP address on the service pod, you wouldn't be able to do that you wouldn't be able to do that with the Kube proxy implementation. So I'm speeding through this a bit, but like I said, if people want to hear the same content, but in a bit of a more peaceful way, then I would suggest going back to that previous one. So when you enable Calico EBPF, instead of that, the source net happening, the BPF code happening on the Kubernetes node forwards the traffic across without needing the source net. And that basically means that the service pod that's actually serving the customer's or client's workload actually does see the real IP address of the external client, which means maybe you want to block a certain country or a certain set of users. I'll avoid telling the obvious example at the moment. Then you could do so using this code. So yeah, that's it at a really high level. I'll move on to the next slide in a moment, actually. So those are the advantages, performance, source IP preservation, direct server return, which I kind of alluded to here, but you can see that the return traffic doesn't have to go via the ingress node, and that has advantages in terms of latency and throughput. And then finally, you get less latency to services on setup. So those are the advantages. Perfect. Really great extensive advantages there. Yeah, and everyone, as I said in the chat as well, leave all of your questions throughout the presentation as well to the chat box in your streaming service, and we will get to them throughout the presentation as well as in the end. But yeah, Chris, what targets are then available? Okay, so yeah. So given what we've just said, we're going to build a cluster that looks like this in a moment, and I'm actually going to build this live and we'll see it explode probably, but the crux of this is that we're replacing the data plane, and any time you have a distributed system, you want to be monitoring all the components of it, right? So both to capture logs and so on when things go wrong, but also proactively, like distributed systems are complicated, so we tend to proactively monitor as many components as possible. So when I was thinking up this session, what I wanted to capture was this idea that just to talk about, well, what components can we actually monitor in this data plane? What data can we capture? And how do we go about doing that? And so in a moment when we build this cluster, you'll see that it's got four nodes and there'll be some service pods running in each node. And then because we're using the EBPF data plane, on each node, the logic that actually helps the traffic come in on a service and get redirected to a pod either on the same node or on another node, that's implemented in EBPF and using maps to store the data that needs to be stored to do that. Now, there are three components that can be monitored here, specifically for the data plane. And I'll show you how to set up all three. So the first is Tyfa. Now, to be honest with you, I should have had this on the diagram and it slipped my mind to add it. But in any Kubernetes deployment, you have the Kubernetes API and that's something that lots of components within the cluster need to speak to. And anytime you have lots of things talking to one thing, you have to be conscious of whether that one thing is going to become overloaded. So in Calico, there's this thing called the Tyfa Demon and it sits a couple of instances run on different worker nodes. And it's basically a service that sits between the Kubernetes API and the resources that need to talk to it. And that's the first thing that we can monitor with the EBBF data plane on Calico. So we'll be monitoring Tyfa. Then the second one is the Calico Kube controllers. Now, they are actually, again, pods that run in the cluster and their job is to perform actions based on the cluster state. So maybe a new pod gets created. It's Calico Kube controllers job to set up the networking to match the desired state. So we'll be able to monitor that as well. And then the final one, you've got this agent called Felix, which is our agent that runs on every node. So Felix is the thing that's responsible for locally programming the EBBF data plane and it runs on every node. And those are the three things that we're going to set up monitoring for. Tyfa, Calico Kube controllers, and Felix. Great. There's a few questions from the audience already. Oh, okay. Yeah, great. Perfect. First of all, really great to see the enthusiasm from Jonathan as well. Thank you for saying that this seems very awesome. I agree. And then there's the first question is from Chintika. Hopefully not destroying the name too badly, but is GLI Service Mesh's complement EBF or EBF providing similar control plane making Service Mesh concept obsolete? No, it doesn't make Service Mesh obsolete, but there's a larger conversation that we could have. And I have had several times actually the most recent time we talked through this in detail was we do this session called Calico Live. And if you look back on the most recent Calico Live, I think that's where I was discussing it. We talked a bit about this. Oh no, excuse me, it wasn't. I knew I was wrong. That's why I was hesitating. It was DevSecOps London Gathering. We had a chat about this on their podcast, which will be released soon. But a Service Mesh adds a lot of extra functionality above and beyond the CNI. They can complement each other, but this doesn't replace that functionality. What this replaces is the functionality of the traditional IP table CNI that we run. So if you want to dig into this in more detail, if you look back on Calico's YouTube or on the Tigera blog, you'll find articles discussing the different data planes and how and the strengths and weaknesses of them, because we offer more than one choice, basically. Perfect. And then a question from Jonathan as well, which is, EPF replace kube proxy or EPF is an implementation for kube proxy? Yeah, I was a bit vague about that. So I'm glad he picked me up on that. As you'll see when we do this, the actual demo in a moment, we literally turn off kube proxy. And the functionality is happening. The same functionality is happening inside the data plane. So therefore, it no longer needs to run as a separate pod. So it replaces it. So the EPF implementation isn't kube proxy, but it does the same job with slight improvements, like the source IP preservation, essentially. I guess I could address one more thing, which is depending on the nature of your Kubernetes cluster, how you turn off the kube proxy varies. So in the case of what we're going to do today with, well, this is a good opportunity for me to talk about what we're going to build. As you can see, we're going to use Google Cloud. We're going to build a four node cluster with the pod subnet 192.168.16 slash 16, and the addressing 10 to 40 slash 24 for the nodes. And in the case of GCP, because this is just vanilla Kubernetes, we're going to turn off kube proxy by stopping the demon set. But if you run something else like K3S, where kube proxy can't be turned off, because it's not running as a separate component, then there's different ways to disable that functionality. And the project calico documentation for EVPF, which I think I may have a link to at the end, if I remember correctly, it actually tells you how to turn it off in different deployments. So the overall goals are the same, but how exactly you do that depends on the kind of cluster. Perfect. So would this be a good moment to ask, what's the process for getting access to these metrics? Yeah, it would be ideal. So let's dive in. I'm going to do the demo now, but before we do, this is the last slide, I think. There is an oddity, and I thought I would point it out at this point. And that is that... So here we have three nodes. I don't know why I've shown three nodes here, rather than four on my previous diagram. I apologize, it's a bit unnecessarily confusing, but usually the job of a service is to run and traffic hits the service, and then it gets directed to whatever the thing that you're monitoring is. But actually, when we use Prometheus, which is a time series database that we're going to set up in a minute, we don't use a service like that. We use the service to discover the location of the Felix agents. But what I've tried to show here with the dotted lines and the crosses is that we don't actually use the service for polling. We only use it to discover where the things that we want to monitor are. And then the polling traffic actually goes directly to the Felix agent on each node. Perfect. That's actually a question from the audience as well. Oh, right. Yes, there's so many. It's great to see a lot of people engaged. So there's a question, how do you install agents in GKE or any other managed communities? Well, actually, we are going to dive straight in now. So I think watch this demo now and see if it answers that question, because I think it probably will. And if not, we can answer it again at the end or I can take it at the end. So I should have a terminal here. Let's see if this works. Now, I was apologizing to Annie before we started that I'm having a minor problem with my IT setup today. So I am not on my usual setup, which means that I'm going to be looking down here a lot more than normal. So I apologize for not looking at the camera while I'm talking about hopefully you can put up with my side profile. No. So, yeah, so to start with, I've built this for node cluster. And I built it literally 45 minutes ago, as you can see. You can see that it's got a control plane and master node and it's got three workers. And this is the same one that I showed in this diagram. So the details are exactly the same. But you can see that the status is not ready. And that's because I wanted to show the whole process including building the EVF data plane onto this. But again, I'm going to do the EPPS part enablement part quite quickly because we've already done this in a webinar. So I thought that although it's nice to see it again, I didn't want to spend too long on it. So if we look at the pods running on the cluster, we can see that the nodes are not ready. And the pods, we have two DNS pods and they're in a pending state because there's no CNI. So there's no networking on this cluster at all yet. No calico, no other networking. Just checking though, are we seeing the correct screen? Oh, are you not seeing it? Oh, I'm so glad you interrupted it. Oh, not interrupted, but you told me, yeah. No, you're not, are you? Thanks so much for pointing that out. Hold on, let me just try sharing again and see if I can fix. Yeah, we only see the slides. Oh, thanks so much for pointing it out. Let me see if I can figure out how to fix that. I think I need to stop sharing. Great catch from the audience as well. There was a Sam as well saying it. Yeah, I was just starting to start. I've done the whole thing without you telling me. There, okay. So I think I've removed that. Let me try resharing, which window is that? We're getting a tip to try to refresh browser as well. That could help. Yeah, it's, hey, actually, I think we should... You see it as an option now. You should see a terminal that you can share now, I'm hoping. I think maybe if our production room can have a look. Yeah, I think that's what we need. Yeah. Yeah, I was sharing the... I thought I was sharing a screen, but it turns out I was sharing a window. No, we're perfect. Yes, we're perfect. Okay, so I'll just... I'm quite background that I did before. So here we go. We've got a four node cluster, master node and three other nodes. Everything's not ready, which is what we expect. And the reason it's not ready is because there's no networking on this cluster yet. There's no Calico, no other CNI. So it's not ready. So we can see the pods and we can see that the DNS pods are pending because they have no networking. And everything else is running, but there's no networking. And there's no mention of Calico here. So the first thing we do is put Calico on. And like I said, I'm just going to blast through doing this because I've done this in a demo before. So I don't want people to feel like I'm just over over old ground. So all I did was I created a Tigera operator resource. And that runs as a pod. And the Tigera operator's job is to bring the cluster into conformance with the networking that we mandate. So I have this other piece of YAML here. And this is an installation resource. It's a custom resource. And it tells the Tigera operator that we want to install Calico. We want to use this IP addressing, this block size and VXLAN. And VXLAN is the right encapsulation to use when you're using the EVGF data plane. So the only other customization at this point that's important is this typhometrics port. Now, I mentioned that one of the three things that we were going to monitor was typhometrics. And so by including this argument key pair here, we're telling Calico that we want Typho, the fan out agent for the Kubernetes API to respond on port 1993. So with all that said, if I now just feed that YAML to excuse me, I just had a missed call from my daughter. I'm going to watch that. And just if she calls back, I may have to take that because she's only 12 and she's on her way home from school. So I'm sure the audience will be able to take it. If she calls back, I'll have to pick it up. Yes, of course. Thank you first. Yeah, thank you exactly. Yeah, I'll tell you what I'll do actually. I'll ask my younger daughter to call her. And they can talk to each other. One second. Perfect. Yeah, can you call Justin? I'm not very sure everyone will get your questions eventually. So I'm sorry about that. I'm back. No worries. I was just saying that there's a few audience questions, but we can get to them at a nice place in the house. Yeah, let's let's push on for a moment and then we'll come to those. So hopefully everything's under control over there. I'm going to keep looking over there now. Excuse me. Okay. So what I've done is I've fed that manifest, which we saw before to the Tiger operator. And that and as a result, the operator has gone away and it's created nodes. You'll notice that there's one per Kubernetes node and then it's created the tie for the fan out agent. And it's created this. And if you recall, these are the three things that we're going to set the monitoring up for as well. We're going to monitor Calico Google controllers. We're going to monitor Felix, which is part of this pod. And then we're going to monitor tie for the fan out agent. So while I was doing a little check there, those things were coming up anyway. So we didn't really lose any time. So that's okay. So we have enabled Calico on this cluster now, which is why the DNS servers are running, but we haven't yet enabled eBPF. So I'm going to do that now. Now, one nuance of this is that we need Felix, the Calico agent, to be talking directly to the Kubernetes API. And that's because it usually talks to the Kubernetes API through Kube Proxy. And you can spot the problem that since we're taking Kube Proxy away, so what we do is we have a quick look and we look at an existing config map for Kube Proxy and we find out where the Kubernetes API server lives. So here's its address and here's its port. So the next thing we do is I prepared some YAML beforehand that looks like this. And we're going to apply a config map in the Tiger operator namespace. And that's going to tell Calico that it needs to talk directly to the Kubernetes services endpoint and we've been the details here. So if I apply that now... Now, it used to be that... Yeah, good. Okay. So we can see that as soon as we've applied that YAML, the Tiger will be restarting the other components to tell them to reconfigure them essentially. Now, I do this last commands kind of superstitiously and I'll probably get told off later for doing this by my colleagues, but it used to be a long time ago that you need the Tiger operator pod to make it reread this config map that we created. I'm not sure you no longer need to do this, but I still do it and it's kind of out of habit. So I apologize. So now's a good time. We need to wait 60 seconds. So let's have a look at one of those questions, shall we? Yes, perfect. And to quickly answer Rohit about the journal screen seeing a bit blurred, I am not experiencing at least the same issue. So you can maybe try refreshing your browser or closing tabs and so forth. But then to get to the technical questions, we had one, is it possible to deploy both Qproxy and EPS in parallel in order to migrate? So in other words, how would you replace Qproxy by EPS on a running and production cluster? That's a really good question. The good news is that enabling the EPF data plane is non-disruptive. I always say the same thing when I want to discuss this, which is in theory, you could enable the EPF data plane on a running cluster. And as long as you met the prerequisites, you wouldn't have any outage. Basically, as it changed over, any new flow would start to use the new data plane. Any old flow would use the old data plane. In practice, I've been networking for 20 years and I'm too cautious. And I think in practice, why make your life difficult? It's best to use the data plane from the start. Having said that, depending on your appetite for risk, it is actually possible to non-disruptively switch over. And as long as you follow this sequence, which I'm going to go through now, then it will switch over seamlessly. Because what we've done so far, actually, I'll address that point as I go through. So this current point, we've told Felix to talk directly to the Kubernetes API. And then we've restarted Felix. But that doesn't disrupt anything because Felix programs the IP tables data plane, the old data plane. And so restarting it has no effect on performance or production services. So now that we've stopped the Felix agent from talking directly to... Sorry, from talking to Kube proxy, the next step is to remove Kube proxy. So that's what we're going to do now. And I mentioned earlier how this can be done in different ways, depending on your deployment. In this case, I'm going to patch the daemon set for Kube proxy. A daemon set is just a construct that tells Kubernetes to run something on every node. So what I'm actually saying here is, I want to daemon set for Kube proxy. I'm going to patch it and modify it. So I only wanted to run nodes, which are not running Calico, which is the same thing as saying, I don't want it to run at all, right? Because they're all running Calico. So you get up here, we were still running Kube proxy. If I run this command, and then I immediately run to get pods command again, you can see, yeah, we were quick enough, and we managed to catch Kube proxy terminating. And then the last step is to, to enabling the EPF data plane is to run this command. And what that's actually doing is it's patching the insulin and merging in this new bit of config specifying that we want to use the EPF data plane. And that's it. So you can see that Calico node is restarting again. You can see that one of them has restarted, and the other three haven't yet, but they will in a moment. And that's it. So now just to, just to take a second and say, you know, and just kind of discuss where we are and make sure we're all on the same page. Now we're running a four node cluster. It's running Calico EPF data plane with VXLanit picking encapsulation. Kube proxy is gone. Felix is talking directly to the API. And we have Tyfa, the fan out agent. We have two instances that are running, and they are running a metrics node because the original point of this session was to show the metrics. So, so we've turned on metrics for those two and we're going to now go ahead and show you how to actually see those metrics. Did you want to address any other questions before I dive into that, Annie? Yes, there was one other question that's put in for this. So right at the beginning of the demo, there was a question from Jonathan about Exit, Exit a repo with these commands or manual for install in our clusters? Yes, yes, there is. So if you go to the project Calico docs page, I remember as docs.projectcalico.org, but there's actually a new URL, but that old URL will redirect to docs.projectcalico.org. You will find that in the documentation, if you go to, you'll find that it describes the exact, pretty much the exact steps that I'm taking now. And it tells you how to do that on different cluster types as well. So you just have to identify what is your cluster type and it will tell you how. There's also a blog post. If you search for the Tigera blog about, I'm going to say about six weeks ago, I did a blog post which is similar content to what we're doing today. So if you prefer to consume it in a blog form, that you could take a look at that. Perfect, Jonathan, thank you so much. And for my side, of course, thank you to all of the people who ask questions for a whole month to end and everyone else. So keep them coming. Yeah, it's actually really nice to get questions, especially working from home, know that people are there, you know. Cool, okay, so Tyfa's running. So let's have a look. So you can see that all the nodes have restarted, which is intentional. So we can see that we're running two Tyfa nodes and these are the IP addresses of those nodes. Now, Tyfa runs with host networking. That means it doesn't have a pod IP address of its own, it uses the host's IP address. So if we... This command is quite long. So what we're going to do is because it's running in Google Cloud, we're going to SSH to the controller node and we're going to run curl and then the IP address of the first node that's running Tyfa and the port that we specified. And all we really want to do by doing this is to show that there are some metrics there. Okay, so a ton of metrics came back and all we're really doing at this point is proving that Tyfa is there and it's responding with metrics. So that's cool. Okay, so the next thing we do is we create a service to expose them. Now, if you recall what I said about how, unlike a normal Kubernetes service, this is actually going to get used to discover the Felix agents, but it's not actually going to be used to poll them. That's done separately. I'll show you how later on. So I take some YAML and I apply this directly. And all I'm doing is I'm creating a service called Tyfa metric service, the namespace. And I'm just saying that it should address any pod that has this label, Kubernetes app, Calico Tyfa and it's just addressing the port. So we will actually see that if we look at the service, we'll see that we now have this service 29 seconds old and it has a cluster IP. And if we take that cluster IP and put it into this command and again run this command from within the cluster, we can see that we are able to hit the service directly and we'll get some metrics back. Now, just to reiterate that point, when we actually monitor this, this is not how we'll do it. We won't ever hit this for metrics because this actually illustrates the problem nicely because if you hit the service, you might end up on one Tyfa agent or you might end up on the other Tyfa agent. But we want to monitor both, right? So we don't want to monitor the service because we don't want to monitor one Tyfa agent or the other, we want to monitor both. So that's, but this is a good way of testing that the service is there and that it's working. Which it is. So Tyfa component done. We'll do the same thing for CalicoCutal controllers. Oh, I don't have CalicoCutal. One sec, I had to rebuild my laptop a few days ago. So I just need to install that. It won't take a second. Okay, great. So this isn't really part of the demo. This is just me installing should have had in the first place. But CalicoCutal is a tool for what I'm going to. All right. And there's a few questions coming in as well. Oh, great. When there's a good spot, let me know and we can do them. Yeah, let's just get CalicoCutal working and then what do I do wrong there? On 64. Why did that get to me? I just found a error in our documentation. Which I will fix as soon as we finish this call. Okay, I think I can sidestep installing CalicoCutal now anyway. Am I going to need it? Let me just check the port. Okay, good. I think I can sidestep using this right now. So basically all I was going to do was run which was run CalicoCutal to find the port on which Prometheus was responding wanting to collect metrics for CalicoCube controllers. So I already know the answer luckily. So when I run that command, I would have got it. It would have output into port 1994. So what I'm doing is I'm creating another service this time called CalicoCutal is metric service and the same thing with port 1994. And let's just test the service. So we get the IP address. Now I thought your audience will probably be happy to hear that my eldest daughter just arrived home. So we have been perfect. Yeah, that's good. You're an update. Yeah, that's good. I was slightly worried that she doesn't usually call on her way home. Right. Okay. So we have heard that's fine. So here we go. So we're going to test the service. We're doing the same thing as before which we're going from the controller node and we're just testing the service response. And it does. Okay. So that proves that we have that done. And then the last thing is to turn on the metrics for Felix which is the per node agent. And here we go. So if it's enabled true, now is this going to work? I might have to fix my Calico curtail problem because I think I'm going to need it. Just give me one second. Sorry about that. Don't worry, live demos, there's always something that happens. Yeah, yeah, exactly. It's fine. One sec. I'm actually going to guess the URL and hope that I get it right. I'm not 100% sure if I'm reading the docs wrong because I'm hurrying or if they're wrong. But if they are wrong, I'll fix it after this session. I have a feeling I might actually be reading them wrong because there's several sections in the docs that look similar. Anyway, I think I might have just... Yeah, but that's great. You either discovered a thing to fix or is this the regular kind of demo? I don't know. I've done a demo where I'm doing something and then it didn't work. Then later on, I realized that I did something completely wrong. Yeah, yeah. For the moment. Yeah, even if you've done loads of live demos, you still kind of find that when you're doing a demo, you don't look at it as quite as carefully as you normally would. Anyway, I have Calico curtail, so we are good. We can move on. Just to remind ourselves where we were, for Felix, what we need to do is just slightly different. We need to turn on Prometheus matrix because they're not on by default. We patch the configuration and we turn on Prometheus matrix. We should find that we can then... Those will be on port 1991 by default. So we can hit any node that we like. So again, we're switching to the controller and we're curling the first node, but any of the nodes will work. On port 1991. And here we go. We've got some metrics. So now we're getting to where we can start to see interesting visual things. So I'm going to just quickly create a service again. And again, we won't be using that service to actually pull that we will test it. So this one's called Felix Metric Service and there's the IPP. Great. So now we've got all three types of metrics that we needed. We've got them on port 1991, 1993 and 1994. And we can start to pull that into Prometheus. So we'll create a namespace called Calico Monitoring. And that's just simply creating a namespace. There's nothing going into it yet. We need to create a service account and a cluster role for Prometheus and then bind those two together. So this is quite a lot of YAML. Let me just grab it, paste it in and then we can discuss it. So if we just scroll back up, can I just copy in? Oh, I did a little bit more than 90. Never mind, that's okay. So here we can see basically I copied in some YAML. I created a cluster role. So this is Kubernetes role-based access control. So we created a user called Calico Prometheus user and this user can look at the URL metrics and do a get on that. And we created a service account and then we bound those two things together. So we just created some role-based access control. Then I accidentally went one step further and also applied this config map. And this is a config map for Prometheus which stores the configuration file. Now Prometheus is the time series database that's actually going to scrape those metrics endpoints that we created. And it's going to gather that data and store it in the time series database. So I won't go into the detail of this. If people want to see the exact detail of this, then the best place is probably a combination of our documentation and the blog post that I mentioned that was about six weeks ago on this point. So that's the Prometheus configuration file. And now we create a Prometheus pod. And this pod is going to read the configuration file. And so we're creating a pod again in the Calco Monitoring namespace. We're labeling it appropriately. We're telling it which service account to run with and we're using that service account we created before. We're using a vanilla Prometheus image and we're feeding it the config from the config volume. And it will respond port 1990. So this is a good point to... Actually not quite. Let's go one step further and then I'll stop to see if there are more questions. So you can consider now that we had those three components that we want to monitor and then we've added an extra layer above that. We've added a time series database layer above that. So now we're going to create a service to allow things to connect to that time series database. So the time series database is Prometheus. So we're creating a new service and we're selecting the Prometheus pod and we're making it available on port 90. So let's just test that that service is actually working. And now at this point I'd show you in a browser usually but because we are sharing just a window and not the whole screen I'm not going to easily be able to do that. Let's just curl it for now. I think that will have to do for this moment and we'll see. So what I've done here is just to reiterate I've created this Prometheus time series database. I've created a service for it and now I've created a port forward from my local laptop into the cluster running in GCP and then port forwarding port 9090 on my local laptop to port 9090 in the cluster on that service. So finally if I run this curl against my local laptop on port 9090 I should see. Yeah here we go. I see a response from the server so this shows that Prometheus is running. I'd like to be able to show this in a browser so shall we have a quick go and see if we can fix the screen sharing so that I can see so that people can see my all my windows not just that. Yes we can see our backstage. Yeah okay. They can do it yeah. Great okay fantastic. So let me see if I can do that now. Apologies while I figure out how to do that. Okay so I think if I remove that source and try to re-add a screen share. No so we have a we have a technical problem. I'm just going to share my my terminal again. So the problem is basically that for some reason it won't share my entire screen including the screen that I'm using for my personal notes and so on which I'll be I don't really want to share. So we're going to have to make do but this is okay we can we can work through it just means that we won't actually be able to see the graphs which is a bit of a shame but we're a little bit low on time anyway so I think we'll just push on. So what we've done is we've created a Prometheus time series database that is now scraping the endpoints that we created and it's listening on port 1990. So are there any questions that we should answer before we move on do you think? Yes there is three questions that have come up so far so in standard calligrap it uses bgb without any encapsulation if I understand correctly why epf data plane need yxl and vxl yeah that's right. So that's a really good question and the answer is it doesn't so to take a step back and not talk about ebp for a moment traditional ip tables data plane calico can use bgp and no encapsulation or it can use vxlan or it can use ip encapsulation and which of those is the correct choice depends on your environment but in most environments the correct choice is to use vxlan but actually any of those three are possible once you switch to ebpf you can still use bgp and have no encapsulation and that will work just fine or you can use vxlan encapsulation but you shouldn't use ip encapsulation because it's not the performance it will not be as good as vxlan so just to reiterate that once you move to the ebpf calico ebpf data plane you want to either be using no encapsulation and bgp and that will be great or you want to be using vxlan encapsulation either of those options should work fine perfect and then the other one was ebpf replace minus working definition like c and i by default from my cloud provider let me just read the question myself oh i see it does it replace coupon here oh okay i think i understand the question so it depends unfortunately it's the only real the only answer i can give that what i can fit into the time we have so you can if you go to the calico website there are several courses that we have that go into a lot more detail about this if you want to if you want to deal into it but let's say you're deploying your cluster in aws one option is that you use aws's eks service and that's an entirely managed service so it's kubernetes but you don't care about any of the workings of how it works you just accept the amazon make it work for you but that's not what i'm doing today i am running in google as it happens but i'm replacing the instead of using their managed service i've built my own cluster so ebpf is in this case replacing the data plane so i suggest that if that's not still not clear i made a video a few months back called the importance of data planes i think and that one this goes into a lot more of detail about why what a data plane is why you should care about it and how you can switch between them and then similarly if you have a particular environment like you have an aws environment perhaps we actually have a free course that will give you a ton of information about what the various options are and what the pros and cons are one thing to keep in mind though is that all of these are like advanced options so if you just deploy calico on a vanilla kubernetes cuff it will work what we're doing here is kind of a more advanced configuration to bring out some of the benefits so none of this is kind of essential it's optional perfect and then the last question so far is is it possible to get ipam block address in ebf i think there were issues in api version one version um set the um commenter um i'm not 100 sure on that one and maybe we can follow up on that one so i'm on the calico user slack or on linkedin i don't want to give the wrong answer on that so i prefer to give no answer rather than the wrong answer so um if if the if the poster can reach out to me on twitter calico users or linkedin um i will find out what the right answer is and come back to you on that one i'm not totally sure about that to be honest perfect so chan done done and you don't head head head over to twitter for chis um to get a qualification there and then there's another if you have anything to finish the demo we can obviously go there but then there yeah i think we should actually let's do the last let's do the last because we've only got five minutes left so let's do the last part of the demo which will take just a couple of minutes so uh remember how we added the the metric services and then we added well actually no we we did before we did that we we turned on epf then we um added the various components and we turned on the metrics endpoints on those components then we added a service um in front of each of those and then we added a time series database uh which is prometheus so the last thing that we need to do is to add a visualization tool and that that is we're going to use grafana um obviously there are more options than just grafana but it's by far the most popular one i'm aware of so what we're going to do is we stick grafana on the front of this and what grafana does is it's just a visualization tool that allows you to take the data in the prometheus time series database and visualize it in ways that are useful so we create a config map and that's called grafana config um and it just specifies it tells grafana where to find the url for the prometheus dashboard um i'm not sure if it tells it anything else important at this point i think that's the key thing it shares so that's the config um and then we can supply some default uh dashboard so we're just applying a manifest from project calico.org that needs to change to 322 um and we're just applying some vanilla dashboards sadly we're not going to be able to see these dashboards because of the shared screen sharing issue but you have to take my word for it they're very beautiful um and the last thing to do is to actually create the grafana itself so we here we are so we create a pod in calico monitoring namespace using the vanilla grafana image and we create some mount points and pass in the um config volume uh where is it config monitoring oh grafana config yeah sorry so you have the grafana config that we created a moment ago and we pass that in and mount it in a volume so if everything works properly we should now be able to see if we look at all of our pods we have the things that we're monitoring and those are here and i highlighted the api server we're not monitoring that i don't know i highlighted it and then we have prometheus which is the time series database and then we have grafana which is the actual visualization tool so if all of that has worked correctly which i hopefully it has we should be able to start a new pod from my local laptop on port 3000 this time and we should be able to if we hit that url calico philix dashboard then if we were doing this in a proper browser we'd be able to log in and we'd be able to see the graphs for for the cost of it sadly we can't at this point luckily we're out of time anyway so i think i would have slightly run over time if i had been able to do that um so yeah that that's that's all i can demo today without without limitation we've got so are there any other questions you wanted to cover up uh yes there's a few uh we have to be quite fast obviously but let's get to at least one of them yeah sure there was a question what's the difference between calico and cilium um so i'm going to take that in the context of ebpf calico and cilium are both cnis for kubernetes that use uh the ebpf that you that use ebpf to implement their their data plane i won't speak to the details of cilium because i am not on their team so i don't i don't know the details as well as they would but essentially we're both using ebpf to create a data plane for kubernetes in the case of calico we have other choices that can allow us to suit other environments and not only to use ebpf but to use other for example you might choose to use our ip tables data plane because you want battle tested code that's being run in production for five years uh approximately five years um so both calico and cilium implement uh kubernetes data plane using ebpf i think that's the best answer perfect then time for the last question and after that we'll wrap up so right last question of today where i where i get christie custom resource yum oh uh good question um the blog post that i mentioned um i don't think i have a way to oh actually i can put it in the chat can't i let me share the let me share the url in the chat one second yes you can send the chat and we chat and we'll get it over to the streaming service chat fantastic well done uh one second while they find that now i've put some pressure on myself to actually be able to find it yeah here we go here we are perfect it's always high stakes in in um yeah well it's exciting it's exciting isn't it um so will i be able to post it in the other chat yeah if you can if you can pop that into the main chat then essentially that is a blog post that goes through most of the same thing that we've been to today so you'll find that the files in there and and a written you know something that may that may work well in a written form of people in but i appreciate that more perfect we got it posted and it's time to wrap it up thank you so much yeah yeah thank you i'd really filled uh session but yeah yeah exactly we got there we got there that's the main thing i'm there thank you thank you so much for um for helping me yeah of course and as always thank you for everyone for joining the latest episode of cloud native live it was great to have this amazing session on setting up monitoring for calico's etf data plane and really amazing interaction today thank you everyone for commenting i'm sticking with us and bringing a lot of good comments and questions to the table um love the interaction today and as always we bring you the latest cloud native code every wednesday so tune in next week as well and we have a great session coming up then as well so thanks for joining today and see you next week thank you so much