 Okay, hi welcome to our talk trust is good control is better a short story about network policies and short introduction about our Ourself This one okay So I'm Maximilian. I'm working as cloud platform engineer at Inovex This mostly means currently at least in my current project. I'm working with Cuban needs so we're building Istio on top of an existing Cuban needs cluster for our customer and Yeah, I started with Cuban needs one and a half years ago with my master thesis from which this current project We are talking about sparked Yeah, I'm basically doing the same trust a little bit longer Yeah, so what can you expect from our talk? The first thing is I should know some challenges when using network policies because that's not trivial you should see some different aspects of Testing and validating your setup. So most you focused on Cuban needs obviously and you should also see some different testing strategies and how you can use different tools to solve different problems for example and Just a little question here to the audience and a little bit morning sports or just raise your hand So who of you is using Cuban needs? Okay, I think nearly everyone who of you uses network policies in Cuban needs Okay, not not not so many people. Okay who had ever some issues with network policies because they're blocked something. Okay Okay, great And we also had some fun and some challenges. So yeah, that's why we are here and why we built this thing Well, Maximilian will talk about So I'm we'll talk a little bit more about the theoretical parts. Why should I test my network policies and What aspects are here and how you can test them? So everybody that knows Cuban needs or had Opportunity to use a Cuban needs and knows that it's a pretty complex part a system with many adjustments screws you have thousands of Flags that you can provision your cluster your M components and Interesting question always is do all this flex work together or maybe I have set a flag that will be ignored or just don't work And the main question is how do I know about it? If everything works or still works? and The next funny part is that Cuban needs doesn't implement the network policies itself. So it's just a Resource in the Cuban needs and CNI plug-in for example has to reach the network policies and implement them or not because The network policies are not part of the CNI for example. It's just a trusted part of Cuban needs itself and the next Not so funny part is that you don't get any feedback So for example if you use flannel or you set up a GKE cluster without network policies enabled and you create a network policy You don't know if the network policy is implemented or not because you don't get any feedback. It's just there or not and Also, the network policies have evolved over time. So they got more features got a little bit broader Features and the other thing is how do you know which part of the Cuban needs network policies are actually Supported by the CNI plug-in. So for example, maybe they only support just a little part of the Cuban needs network policies But how do you know if this it will support? everything and The next funny funny thing is this is just a little out Just a little out cut here from the official network Plug-ins for cuban needs it's from the cuban needs documentation and this are not all plugins that are there listed that this are trust all plugins with a logo and You can see it are pretty many things and all the things they are doing is a little bit different So some some tools are using IP tables for example for the network policies some using an IP set Some are using an ebpf like selium for example some others using NF tables some others doing I don't know some other stuff maybe so it's pretty interesting to see how these things are solved and The next thing is here. We have Two policies and policies are not always easy to read and who can spot the difference It's not a missing. It's both policies are valid and The left one here without a dash is a and combination and the one with the dash This one here is an or for example, so Both policies look pretty similar just a little dash which can be easily Overseen but have a completely different meaning in the end and it's not that trivial to see such things and Actually, if you have a looked into network policies, these network policies are pretty trivial in If you compare it to what options you have How to implement or how to use the network policies? Yeah, the next Funny part or interesting part is you have many Components for example, you have the cube proxy normally When you're running Kubernetes or you don't have it if you use something that can replace the cube proxy and if you switch for example the back end mode from cube proxy to from IP tables to IPvS, do you know if your Network policies still work or if the routing is now different and maybe the addresses now look different or if you update your CNI plugin for example through another version You still know if the network policies still work or maybe the network policies are now Working as expected and for that just didn't work as they shouldn't work But they work for you context and this was the case in the early days for port security policies for example, whether there are some bugs in the validation of network policies and Yes, so it's a pretty interesting and complex world and Yeah There are also conformance tests for Kubernetes which are pretty nice at least to know that you have a roughly working Kubernetes cluster, but for example the conformance test currently don't test any network policies or any air bug rules and there are some Ideas how to create conformance test with different Testing Profiles for example, so that you can test network policies with The conformance test, but this is just something for the future. Maybe if someone implements this Yeah, the now the next question is What can we test? I think everybody will agree that testing is probably a good idea if you are running in this Complex world and have a more or less big or complex setup So in general we thought you can Have these two I am areas so on the The left side from your point of view we have some conformance for example, so we can say Every user has to create Default deny policy for example if he creates a network policy. This is more like conformance or governance around your cluster and it's not really testing or looking into the policies how they're actually working and on the Right side you now have the actual implementation and you can validate for example if the Network policy door so the control plane is actually synchronized between your Kubernetes cluster because there could be some delay and depending on how large your clusters the delay can be pretty big for example and on the Right you have the data plane in the end so you will probably check if you are Network policies are actually taking effect for example and there are different approaches So you can do active testing or passive testing, but Maximilian will talk about this a little bit more in the in-depth now Yeah, so I'm taking over now And I want to tell you how to actually test policies So I'm going back to the same diagram again. So for conformance Like what users can submit or should submit we can just query the control plane the cuban needs API and just check that the policies that are in the cluster like fit our company policies, so Gonna name opa later on too, but we've already heard about open policy agent that does exactly that so you could use that for Checking that your users conform to your company policies when using network policy For synchronization We can either like passively check seen eye locks So like this pattern here with control plane and data plane for e.g. Calico, which I know Best of all the CNI plugins would be like they have this agent on each note running and You can check these agent locks to see if you picked up the policy and then generated IP table rules that Bend the traffic to fit these policies and We could also so that would be entirely passive because we're just reading locks and we could also apply policies to like Get some reaction and maybe even measure like how long does the synchronization test when does it work and when doesn't it work? Yeah, and for effective policies we can like passively just read traffic So using TCP dump or reading locks again Or like check the IP tables rules that are e.g. generated by Calico or we can actively like using curl or something Generate traffic that mimics the traffic the pots like that's limited by the network policies And that's something I'm gonna talk about more in depth now. So this is the approach we chose Because the upside of this is that we are basically testing the whole process Because if we are like testing pot-to-pot traffic, we are also testing that the control plane synchronized the policies Otherwise, they wouldn't work So there are some Technical approaches to like generate this traffic. So one Pretty transparent one would be just like taking a pot spec. So let's say we have to spot food and It's affected by a network policy in some way and we want to test like can it reach another pot Then we can just take the pot spec copy the whole pot and Replace the container image with a test image and this test image could just run like curl or W get or Whatever to check That traffic can go out of this pot or cannot go out of this pot Or if we are using docker in our cluster, we can just attach to an existing pot So you probably know when Cuban needs Generates or when the cubelet starts the containers for your pot It starts a pause container and your actual workload container or containers and the force container also Set ups sets up the network namespace. So you can just use docker run dash dash net pause or foo pause or whatever it's called this really long name you've probably seen before and Attach your own testing container that then in turn Uses curl or whatever you choose to test your traffic So that has some one downside. So for each test case you want to execute You have to run a new docker container to attach to this pause container And it obviously only does work with docker But the upside is it's like really simple if you're using docker because you can just use the CLI and it's like doing all the stuff for you and You have no problems with privileges or anything and if you want to go even deeper or even more to like Here the OS even lowers you can directly attach to a network namespace. So again Each see Container runtime runtime container runtime interface runtime will create this pause container and we can just use any existing port in our cluster which then requires privileges to see the other processes and run privilege and We can then attach to the network namespace using e.g. NS Enter from our test pot To the pause container of our tested pot. So I hope it's not too abstract, but I'm also Going to show a little bit more about that soon So, yeah, how would this look in practice practice? So we could to just we could use different tools like I already named Wget and curl So maybe you just created your network policy in your cluster You only want to know your case traffic denied like I expected to be then you can just exit into your pot and use curl or Wget and like in the pot that should allow traffic that should traffic allowed be from and Use curl and then you'll see either. Okay traffic got through like I'm seeing my back end services page or whatever or It didn't work and the downside to this tool is that you have to provide it in the container for each application you want to test from There's also an existing tool called net assert that's on github. I've put a link here and What you do there is like you provide your configuration that's specified on the github page to and Let us read this configuration. So in this case it would like test from a deployment In default namespace named food to a deployment in the same namespace named bar and it would test that TCP port 80 Is open and then what net assert does for you is this Docker approach I explained so it runs like a container and this container runs in the network In the Docker network of the post container of the food pot and then it can In turn use ad map to check to probe weather traffic to the bar Bar pot through the bar service is possible So this works pretty well, you just have to provide your own Test YAML your own counter config YAML and you also have to have Docker and SSH Access because that's how net assert like starts it pot it or its containers for testing And then it also collects the results for you obviously and prints that to standard out So our issue with that was that we didn't use Docker in our usual cluster setups So we developed our own solution. It's called illumination. It's also on github can also check it out and What we also wanted to solve is the part of test case generation Because we I mean it's partly useful to say I like okay These are my expectations as an operator like what the network policies should do but illumination also gives you like its opinion what they should do so it parses the network policies and Generates test cases like go from pot food to Pot bar and tell me whether traffic is possible or not or tell me that traffic is possible because it should be allowed and What we do in the background is like create a Damon set and that launches on each node a pot and this pot uses NS Enter to enter the network namespace of the pot. We want to generate traffic from so in this approach we basically Don't interfere with any production traffic at all because we just took into the network namespace of an existing pot we Get traffic out, but there's like Nothing coming in or it's it's not like we have to Deploy an additional pot which then would probably accept production traffic The same is also true for net assert because they are running a separate container Yeah, so I also want to talk a little bit of this about this test case generation part because that's like Quite complicated So I want to look at this network policy If you've probably done all I've seen it because most of you raised your hands this is just a simple network policy that takes effect in namespace default and then it selects pots with some label selector and Like how network policies work internally is every pot that is now selected Like every pot in a space default that is selected by this label selector is isolated from every traffic Except the one that is defined in ingress or egress rules Yeah, so this is Quite easy to parse and to generate test cases from because now we have like this power game My pot is isolated Except for this traffic so we can test like okay a pot that matches the conditions in this ingress rule Should be allowed to reach our Prometheus pot and a pot that doesn't match the conditions should not be allowed to connect to this Prometheus pot But if we have multiple policies It gets complicated very fast like maybe our Prometheus pot also has another label that is used in another network policy and now we suddenly allow some traffic that wasn't intended initially Yeah, and another Yeah, what I have what I've already named is like and we want to generate two kinds of tests So we don't only want to test that traffic is allowed because the default in Cuba needs is all traffic is allowed So if network policies never take effect, we wouldn't even recognize it, but we also have to like Inverse basically what is contained in these ingress rules and To test from pots that don't match this Label selector or these conditions in the ingress rules to be sure that traffic is also denied from pots which should not be allowed Yeah, and for multiple policies it gets complicated fast because so maybe you can spot how these two policies interact So the left policy only selects only allows traffic from app Kravana and right policy allows traffic from namespace selector empty so all namespace labels and Podselector team ops So in this case our traffic would be allowed because our namespace Does like match the condition so whatever namespace pot V is in it matches the condition and it also has the team ops label But maybe we initially only meant that proofers should only be allowed to be accessed from Kravana and this other policy just overwrote What we Intended to do So one upside is that illumination also generates test cases that clearly state these intents at least like How we programmed it to so you get at least like an overview in your cluster. Maybe your test cases all work but it still looks like Wonky and the test cases that were generated so you can see that with illumination Yeah, so we prepared a short demo Where's my terminal? Let's see Okay, so we've got two clusters here and both are on Google Kubernetes engine I think your hand is already mentioned like there when you create a cluster and GKE you have to set a flag Specifically to enable network policy. So we have one Currently selected as context that has exactly that like it has network policy enabled We got Oops We sorry, okay, so we got some namespaces Let's start with that Works I'm set up for our demo So the other namespace is also labeled and we've got a network policy So this network policy affects namespaces allows ingress from namespaces that are in That are labeled team operations. So our other namespace meets this condition and only from pots in these namespaces that also have the labels type monitoring and It allows traffic to pots in namespace default that have these labels So we should also have one meeting that condition there. So if we now run Illuminatio Yes. Okay. So if we now run Illuminatio, we see it generates four test cases. So this one is The easiest one. We just want to know that from namespaces labeled team operations. We allow traffic and Pots type monitoring in this namespace We allow traffic to our application and then the other ones are different variations on traffic that is not allowed So the top one is we have Another namespace labeled in team operations and the matching pot label so this one was not allowed and We have one where both labels do not yet match and one where the namespace label the namespace label matches but pot label doesn't and Syntax for this asterisk is Like all parts are allowed because network policies can also specify ports or any port should work and the minus Means it should not be allowed. So in this case our network policies were enabled on GKE and our tests were successful so our policy works exactly as we expected and how we tested that is like Illuminatio generated one namespace called Illuminatio where the daemon set was deployed and they did all that stuff with NS Enter in the background and used n-map to check that ports are open or not open and then it also generated extra namespaces for testing So this is currently how we implemented it but we could also like if you do not want namespace to be generated or anything that Could also be changed in the future. So we are very happy if you want to submit a patch for that or like make this optional Yeah, but currently the to behave is like we generated a new namespace where the labels are different than team operations and tests from that one and Yeah in namespace other we have also some ports that are some dummy ports that are generated at either meet our condition or don't meet our Condition of being labeled type monitoring. So I can also show the labels here Yeah, so one has type monitoring the bottom one and the other one has like not type monitoring basically Yeah, that's Basically how Illuminatio works and just to show that it works and that we didn't cheat I'm also gonna run it on my other cluster where everything should be set up the same and I get a warning because I shouldn't use my cube config like that according to GKE And so we are gonna be generating the same test cases So the setup is probably the same and then our tests are failing for denied traffic because like we stated before If network policies aren't enabled all traffic is allowed. So now Illuminatio detects. Okay traffic wasn't denied that's bad and Complaints so Okay time to wrap up So I've shown you some tools Illuminatio and NetAcerd mainly that you can use So and you can use them to test like use whatever fits your setup and to test whether your policies take effect If you also want like to check this user policy interaction So users should maybe only be allowed to submit a network policy and namespace where default deny policy is applied To not get any confusion and then you can use something like open policy agent or specifically open policy agent I think this tool has quite some hype and it is also deserved hype because it works very well It's if you understand Rigo the weird language they ship with It's pretty easy to use and if you just want to validate your policies in any way without user interaction You can also use open policy agent run once against your cluster or you can use cube audit that has at least one check I believe that checks that network policies are in your cluster and every namespace I think default deny policy is what they check and You could also which is not mentioned on this slide run the Cuba needs and to end tests Specifically for network policies, so they are not part of the conformance suit But you can still run them and they test some network policies in their own setup So I'm giving back to you for a short recap So yeah the first recap or the first thing that we noticed is That we or you always should test your assumptions for example, so if you create a network policy It's it's nice to trust that a CNI plugin will implement them but in the end it's better from our perspective to actually test it and not test it just once and Not just tested multiple types for example, so regression testing can make your life easier since the cubanese ecosystem is Pretty complex or you have the multiple component components for example They need to interact in some kind of a way and Request and trashed regression testing makes it easier to see okay some update for example question on my setup or behaves completely different like before and Also from our perspective at least network policies are still hard to get right Also, we are just focusing currently on the cubanese network policies Plugins like calico for example or selium also have another type of network policies So there are their internal policies for example, so they can allow global network policies or Calico policies for example, and they also behave some kind like the cubanese network policies and It's pretty sad that you don't get any feedback how and if they implemented from our perspective and the next interesting thing is so we So It's not really a bug. It's more feature from me for example So we tested it with beef and in the default setting the behavior was that Graphic to services was are affected by the network policies, but if you will call directly a Sorry Call directly a pot for example the network policy didn't take any effect Because I think in weave the default is to don't masquerade the traffic So that's just something that's not as expected because you talk to the service everything works fine But now you could call directly the pot, which is not that hard And now the network policy don't take any effect Not like it should be So yeah a little bit outlook where we want to go with illumination We are probably looking at the infirmary containers, which are now in alpha for one dot 16 and This could help us to make it a little bit easier just to spawn a container inside of a pot And this would make it easier for us and Illumination would need such many privileges as it currently needs because in a to make an ns enter into a Network namespace you need some kind of privileges obviously and If you're running in a more or less restricted environment with put security policies and you saw you want to have to lease privileges This is probably not the best idea, but currently it works and it was the easiest way to implement it photo from current kind Yeah, so in the end our main goal in the future is to write some kind of conformance test suite to test for example the CNI plugins in your Kubernetes cluster and just to check how much of the policies are implementing and Which kind of the policies implementing so today only implement the basic stuff or also the advanced stuff for example Yeah, and in the end we also want to add more network policies So network in general so that we for example also support the calico policies the celium policies Or maybe the Istio policies for example, so Istio As a service match has a pretty big hype currently Because you can set MTLS with policies and co But I probably believe that not so much people actually test if these policies actually working as they should and that's always And it's an interesting part because if you use some policies or if you have a firewall for example And you believe it should be working, but it doesn't work. It's ready that So, yeah This is a little measure feel free to try it or give some feedback It's probably pretty opinionated by us from our experiences, but feel free to Have ask us now as anything or after the talk For example feel free to come to us and talk about your experience you made with network policies or testing in This Kubernetes world, so you're pretty happy to get any feedback Yeah, and that's our contact. Yeah Any questions? Thanks I've noticed you create new pods In the meanest you create new pods with the correct levels to be able to test Is there a risk that the normal traffic to service are redirected to those new pod and break normal traffic? No, maybe I didn't explain this well enough. So the idea is the pods that we create I Mean they don't have any labels that should be like matched by any service themselves So those pods Just run Basically waiting for the test cases to be put as a config map and then they attach using and as enter to the actual workloads So the pods never have like any labels that should be selected by a service They also don't run in the namespace together with any of your production services They run in a separate namespace through The linux magic basically that is a network namespace we can still We can like enter the network namespace of our pod we want to test from and Then create traffic as with if we were a container in that pod without actually Like having any traffic routed to us because we don't open any pods or anything I think the main point here is that only the client side gets generated and not the server side mostly Yeah, sorry, maybe I misunderstood the question. Yeah Thanks, so it's safe to run this in production cluster From our experience. Yes. So it should be and if you know see anything that's not a safe so please tell us but like we said we only Generate the client side and we only generate the client side if it's needed So if there are for example pods that already match the criteria, we don't need to generate a pod We trust you with the pods that are actually there Hey, so cool talk. Thanks a lot from my experience of breaking network policies The bugs that I found were in when you update a condition So the policy is static, but you add or remove labels from either a client or server pods Obviously, you can't do that on a running cluster, but have you tried to do this in in offline or destructive tests of clusters? That would be awesome Thanks for the feedback And I noticed you use Python on end map. Do you require that as a dependency on the node on the host? Currently what we are doing is we spawn an so-called illumination runner pod and Inside of the pod we already installed and Python and end map and so you don't require anything on your host Do you require SSH on the host? Nope. No, that was one requirement from our side because in the current production setup We are we don't have any SSH access to the clusters And we also didn't want to have like an Ansible style so that you always have to do some SSH to the containers and also our idea was that if you have The illumination runner for example on each node they can they can run independently and so for large clusters it should be a little bit faster than doing in a parallel SSH for example, so That's why we build it like this Thanks, so any more questions then thanks for listening