 Cool all right, so welcome to our talk everyone we we know it's in the last section of the day So we're between you and the booth crawl and Yeah, I know appreciate you everyone coming by so yes welcome to whose packet is it anyways? Awesome. Thanks Kevin. My name is Doug Jordan, and I work at Airbnb on our Istio based service mesh called air mesh I focus on extending the mesh to virtual machine based workloads TCP and other ingress use cases Previously at Microsoft I worked on our bare metal control point where we adopted Winkardy My handle on github Twitter and other socials is usually DWJ 300 and as you can see from the photo on the right. I enjoy cycling Climbing mountains and here's a photo of me doing both those Yeah, so my name is Kevin. I'm cooler. I am a software engineer at buoyant the creators of Winkardy I've been working there four years now and have worked on all the control plane components as well as the proxy My social handle on github and Twitter is K line cooler You can also reach me on the linkerd slack for any questions you may have after the talk or in watching this recording Thanks Kevin. Here's a quick agenda of what we'll be talking about today First Kevin will walk us through the how Specifically how a TCP packet gets routed through the mesh Then I'll discuss TCP debugging and walk you through a real-world example breaking out TCP dump and wire shark Cool So we'd like to start out with an overview on how a packet is routed in a service mesh So the things that we cover here will help lay the foundation for understanding some of the debugging steps Doug will take us through later. I'd also like to call out that we're going to try to keep this as service mesh generic as possible While both Doug and I have had a lot of experience with SDO and linkerd the concepts We talked about today are generally shared between the two as well as some other service meshes So this talk is ultimately about debugging traffic in a cluster with a service mesh So just want to make sure that we're on the same page about what a service mesh is and the common architecture of one that we Or you may be debugging So a mesh provides key properties today Those tend to be the four listed here observability for things like logs and metrics Routing things like traffic splitting and endpoint discovery security think mtls and authorization policies and reliability for example transparent retries of HTTP requests Circuit breaking things like that In order to provide these features the service mesh needs to intercept traffic into and out of the pod So how do we achieve this? So this leads us to the service mesh architecture here we see the sidecar proxy model This is the model that most meshes follow these days. It's worth noting because Some messes some meshes including the previous linkerd version ran a proxy per node In the sidecar proxy model each pod gets its own proxy container Inbound and outbound traffic can be redirected through this container, which is how some of those features I just discussed are implemented Multiple pods may be injected across multiple nodes and the grouping of all injected pods makes up what we call the data plane The second part to the sidecar proxy model is the control plane in linkerd This consists of the components that inform each of the proxies in the data plane Probably has a destination component used for routing decisions an identity component used for assigning TLS identities to the proxies and a policy component for determining who can talk to who Depending on the mesh there may be different components So the sidecar proxy model injects a proxy container into each pod Somehow inbound and outbound traffic from the other containers end up going through that proxy So in order to understand how it's happening. Let's take a look at what a container actually is So the first thing to know is that Linux doesn't actually have containers. It has namespaces namespaces partition the kernel resources such that different sets of processes see different sets of resources This means that each process or group of processes is associated with a namespace and can only see the resources within it the Isolated resources depend on the namespace and we'll see what some of those are next So here we see that by using namespaces. We can create a pod with multiple containers The blue boxes represent containers within the red bordered box pod The red bordered box represents a single network namespace There can be multiple network namespaces on a host which is how we end up with multiple pods on a node Looking at the blue boxes each process ID is associated with a network namespace With a network namespace and can be a container Each of those process IDs can be associated with for example a separate mount namespace So that they see their own file system so now that we have a Idea about like a higher level representation of a container and a pod in Linux We're going to build one up from scratch and focus on the parts Important for routing a packet. So first we start off with a host which We are referring to as a node and Next we create a network namespace on the host You'll recognize the red border box from the previous slides Again, this virtualizes the network resources so that processes using this namespace only see those network resources Not the ones on the host or other network namespaces Upon creation, we have the loopback interface for local traffic as well as a virtualized ETH0 interface For traffic into and out of the pod Additionally, we have a private set of IP addresses routing table socket listing connection tracking table firewall all the network related resources Then each container we create is a process that shares this view of the network resources You'll recognize the blue boxes from the previous slides And finally we return to the fact that we can create multiple network namespaces on a single host and end up with multiple pods on the node. I want to reinforce the idea that each network namespace has its own view of network resources This is highlighted by the fact that here we have four containers running Two of which bind to port 8080 and two of which bind to port 3306 These bindings don't conflict because they take place on separate pods Which means that they're in separate network namespaces Each pod is then also given an IP address that is reachable from other pods So in the last few sides, we've seen that within each network namespace I've been highlighting two specific resources with green boxes IP tables and sockets So why are our IP tables important for the stock? So well since IP tables are unique to the network namespace We know that all containers within that namespace observe the same IP tables configuration So they are responsible for the redirection of certain packets to the sidecar proxy Before arriving at an application container or leaving the pod So if a packet matches any of the configured rules They reroute that packet so that it is redirected to the proxy container But how are they actually doing this? So here we get into the fact that every IP packet has a header and we can see what that looks like here A packet header is the part of a packet that precedes its body Each row that we're looking at is a 32-bit word that encodes all sorts of addressing information for that packet This can include things like the total length of the header and data packet identifier and Most importantly for this talk the source and the destination I've highlighted the destination address with the red box Because this is the field in the packet header that IP tables is rewriting IP tables has determined based off some piece of information Say in this case the destination that this packet matches a rule that it is responsible for rewriting The destination to a different one The proxy in this case So returning back to this picture we can see now that instead of going directly to the original destination Inbound and outbound traffic first passes through the proxy container So inbound ends outbound traffic We have to consider the fact that the proxy will have different behavior Depending on the direction of traffic So therefore the proxy will actually bind a separate port for each direction So here the proxy container is going to bind port 4143 For inbound traffic and port 4140 for outbound traffic IP tables will have separate rules that rewrite the destination address fields To these ports depending on the direction of the packets Also note that these ports are not arbitrarily chosen. These are the actual ports that we use in think or D for example So this is great IP tables have done their job and the proxy container is now receiving traffic We have another issue though. Remember that the packets destination addresses were rewritten So even though the proxy is now receiving this traffic It still needs to ensure that it all ends up at the originally intended destinations This is where the other network resource I've been highlighting in previous slides comes into play socket tables So when a connection is opened we can examine the listing in socket tables that corresponds to that connection of The things tracked in each listing the one that we care about here is the original destination for that connection the proxy calls into libc and That gets sock up function and gets the original destination for that socket Using that we ensure that the traffic is going to the pre IP tables destination I've been talking about IP tables having rules that match traffic The mesh cares about and ensuring that destinations are rewritten The presence of these rules are also a responsibility of the mesh and there are few ways that these can be added to the pod The most common way to handle this is using an internet container an internet container Runs before any of the pods application containers start The mesh is responsible for injecting this in that container similar to how it's Responsible for injecting the proxy container The internet container will run to completion and add the necessary rules to IP tables for that pod If we're just adding rules though, why do we need a separate container? So rewriting IP tables requires elevated permissions a problem that Doug will Cover in his side of the talk And we don't want to give the long-running proxy elevated permissions for something that really is only going to happen when the container starts so an internet container helps solve this by separating the need for permissions between containers So the internet container is run and the application containers have started and now they observe the IP tables configuration required for the service mesh So therefore traffic that we care about is redirected to the proxy container So another way to solve this issue is to use a CNI plug-in Without getting into the details on this it ensures that each pod has properly configured IP tables One of my co-workers Alex gave a talk yesterday at service mesh con about how Lincordies CNI plug-in is implemented If you're interested in that I would definitely recommend checking out the recording on that awesome Thanks Kevin Now that we understand a bit more about how packets actually flow through a service mesh I'm going to walk through how to debug a real TCP stream. I Encountered the need to capture TCP packets when debugging an issue with our Istio based service mesh at Airbnb called air mesh specifically within our custom metadata exchange For this talk, we're going to rely on a few key assumptions Note that they aren't requirements per se, but they will make talking about and visualizing the underlying networking model much simpler First is that you're using a CNI container networking interface such that each pod has a unique IP address from the directly from the VPC that's routable from other pods on the network for instance the AWS VPC CNI Second we use it will be using the cryo container runtime, but all the commands shown commands shown today could easily be modified for container D Last and probably most importantly having SSH specifically route access to the nodes We'll discuss how to avoid this requirement at the end of the talk The use case that brought about this debugging involved talking to patchy Kafka So I'd like to start with some quick context about what Kafka is and a little bit about how it works for those who may be unfamiliar Kafka is a distributed messaging queue at a very basic level a client can either produce a message to a topic or consume one Internally Kafka consists of brokers, which are just instances of the service topics, which are like categories or feeds and Partitions which are well, you know partitions of the data within a topic Kafka uses zookeeper to run leader election at the partition level So every partition will only have one leader shown here on the slide in green When a producer wants to send a message to a topic they'll first compute the partition ID Often using a consistent hash function Then it will use its internal metadata of the cluster state to write that message to a broker that contains the leader of that partition When a consumer consumes that message It's actually pulling the Kafka broker for recent messages on that particular partition Now that we know a bit more about Kafka. We need to explain where our service mesh fits into the picture to help us trace some packets In order to send producer consume requests Kafka client Kafka's client first needs to discover the initial state of the world ie all the brokers topics and partitions To do this it uses a special request called metadata request This request happens on client initialization and can be routed to any broker in the cluster as they all share the same view of the world Thanks to zookeeper Once the clients have this information it will then route requests to specific brokers on based on topics In our case We'll use our service mesh for that initial metadata request because we don't actually care which broker responds to it And we just want to route it to a healthy node something that the service mesh is really good at After that all subsequent publish or consume requests will go directly to the specific brokers thus skipping the service mesh entirely This is because Kafka's client wants to be in control of exactly which broker it's talking to as it knows the internal mapping of partitions to topics The issue we encountered was during this initial metadata request. So in order to reproduce the issue. Let's use Kafka cat Kafka cat is a CLI used to talk to Kafka and is incredibly useful when you don't want to spin up a heavyweight JVM Just to check the state of the cluster Here we use to control exact to exactly to run a command in a pod in this case a pod called pod test against our app container We'll specify the following arguments to Kafka cat minus L for metadata listing and then minus B to specify the broker We want to talk to in this case. It'll be the address of our Istio service or our service mesh service and called Kafka dot service on port 1992 which is Kafka's default TCP listen port And we get this error right fatal error at some line of metadata list Broker transport failure what does that mean well after a bunch of googling and lots of code spulking through Kafka's code base All I can find out is that the request is malformed But what does that tell me? Isn't our service mesh supposed to give us all this rich observability? Well, yes and no For HTTP the telemetry emitted by service mesh is incredibly useful. We have response codes Logs and even response flags in the case of envoy As shown in this slide, there's 25 unique flags for HTTP requests that are emitted in the logs when things go wrong But for TCP connections, we only have seven flags that are extremely generic Without response codes or detailed response flags. It can be really difficult to find the exact cause of some of these issues So what do we do? We break out our favorite tool TCP dump and Then look at packages in our shark But how do we do this in the context of Kubernetes specifically Kubernetes with the service mesh? We're going to take a look at two different packet captures The first one is what the client Kafka cat is actually seeing and the second one is what the proxy is seeing As Kevin explained earlier all the containers in a pod share a same single network namespace so They use IP tables to rewrite packets destined for meshed services to the proxies inbound port on the loopback interface So to capture what Kafka cat is actually seeing we can just TCP dump on that loopback interface Once we have that capture we can take a second one This time to see what the sidecar proxy sees by running TCP dump against the host on the virtual interface for the pod We'll talk more about that later Let's start out simple. We'll just run TCP dump in the main container We'll run kube control exact against our pod same one pod test and we'll run it against our application container case app And we get another executable file not found in path. Well, that's obvious, right? We don't include TCP dump in our container images We do this for variety of reasons. Hopefully these are kind of obvious, but things like we run distro as images We'd want a minimal security footprint and in general. We don't include debugging utilities in our images So that we can reduce our security surface So let's just install it, right? If it's a VM just the app update app install and call it a day But it's not that simple doing this in Kubernetes at runtime in the container would require both root and a package manager So it's probably just easier to bake it into the image of build time In order to bring in extra dependencies to our main container image We've gone ahead and modified our deployments back to run an additional sidecar container We'll use the net shoot image which already has TCP dump installed As a reminder all containers share that single network native space So we can just run TCP dump in any of those sidecar containers to capture traffic on the loop back interface and observe everything that's going on in the pod But again, it still doesn't work You don't have permission to capture on that device. What gives? Hmm turns out it's because we're running our containers as non-rate users Using non privileged containers is becoming more and more common and to be clear. This is a good thing But how do we capture on an interface without it? We have two options We could add net admin and net raw security capabilities to our pod security context But that would be less secure and doesn't really help with the just-in-time access model If we only need temporary access, this would be too heavy-handed Alternatively if we have or can get pseudo sh access to the nodes we can take a package capture directly from there So let's go do that Okay, but how do we actually do that? Many of you folks know how to do this But as a reminder we first are going to get the node name from the pod object We'll run coup control get pod And specify that this JSON path to get the node name in this case the pod that we're trying to look at is running on node one Okay Now that we know the nodes name. Let's get its IP address We use coup control get node with node one and then this JSON path to get its internal IP address and Note that in our situation We want to get the internal IP address because it's accessible on our corporate VPN as opposed to say maybe a public IP address Which may not be While we're here. Oh, sorry, and the the node is running internally on our 192.168 1.9 network So let's use that While we're here. We're also going to grab the container ID and I'll explain why we need that in a second This can be any container in the pod, but since we know we have our sidecar proxy injected We'll just use that one So we'll run coup control get pod pass this to jq Look at the container statuses and finally select the one whose name is proxy and then get its container ID kind of long command And we see our container ID is cryo colon slash flash 9 4 a d dot dot dot which makes sense right we're using cryo Then lastly, what's a stage to that internal IP? awesome as Kevin mentioned earlier each pod has unique net NS So in order to run tcp dump within it, we need to first find it Now that we're on the node and have the container ID We can get the network namespace by using cri CTL, which is the COI used to talk to the cryo container runtime so we'll run cryo inspect on the container ID and specifically Select the namespace of type networks path using jq Note this namespaces slice has all the namespaces not just network right the mount the pit the user etc But we'll get the network namespace in this case. It looks like var run NS and then 1 4 9 some on good Now that we have this network namespace. Let's run a quick sanity check If we run if config from inside this containers network namespace It should see the same IP as the pods IP right because ultimately it is the pod So our pods running on 100 dot 116 dot 95 dot 102 so We'll see if we can see that same IP address To run a process inside a particular namespace of any type not just network. We can use the NS enter command So here we'll run NS enter with the minus minus net argument and Specify our network namespace that var run NS and then some on good And look if we look at the output we see the I net address of 100 dot 116 dot 95 dot 102 Which is our pods IP address So we know we're on the right path now that we found and verified the network namespace We can finally run TCP dump the moment. We've always been waiting for Again, we'll use NS enter to enter the network namespace and run TCP dump with the following options Minus I for loop back because again, we want to capture a traffic between the application container Kafka cut and the proxy. We'll use s0 to capture all packet sizes We'll use NN To not resolve port or host names And lastly, we'll use minus w capture dot pcap to write it to a file pretty standard With this TCP dump we can actually open a wire shark. So here it is And we're going to look for something very specific. We want to make sure we observe this metadata exchange. I Don't have time to go into the full wire protocol of Kafka nor do I fully understand it But if you we look at the first packet after the standard TCP handshake number 137 We can see the client ID in the body's point text Rd Kafka This is the default client ID said by LibRD Kafka, which is the underlying library that Kafka cat is using Note that the reason we can actually see this body in plain text is because the capture is taken between the proxy and the application Next we'll look at capturing data from the external side of the proxy So if we want to observe the external traffic, which is encrypted between pods We can use We need to know which interface to listen on if we want to avoid getting hit by a fire hose of packets Ie we could capture everything on the host, but it will be pretty hard to find the signal in the noise Since we're using the VPC CNI There is a route on the host routing table for each pod IP to the virtual interface created for that pod So we can use IP route show to look at the host routing table and just grab for the pod IP as We can see the route for our pods IP address the one ending in dot one or two is using the virtual interface EMI blah blah blah ending in zero nine seven Using that We can finally run TCP dump on the host to capture traffic outside the proxy. This is again traffic. That's Transiting the pods boundary. We use the same TCP dump command as before except this time We'll listen on this interface instead of loop back and we won't won't run it inside the network namespace so we run Just vanilla TCP dump minus I with the ENI the virtual interface ending in zero nine seven and this other the rest of the options Are the same as before? As you can see we've captured five hundred seventy nine packets I won't open this PCAP up now as the traffic is encrypted using the proxies to a certificate and won't be very useful to us I'll leave it as an exercise for the attendee to figure out how to get the certificate needed to decrypt this traffic Spoiler you can use the debug port of your proxy After going through all these hoops and hurdles just to take a simple pack of capture There must be a better way and the good news is there is ephemeral containers are a relatively new feature of Kubernetes that allow launching a new container that shares some of the Linux namespaces of The running pod in our case. We want to share the network namespace, which is done by default There is one major limitation, however They currently do not support changing the security context of a running pod So if you launch a pod with a restrictive security policy that prevents net raw or net admin You still can't use ephemeral containers to run TCP dump But for argument's sake, let's say we don't have that Restrictive security policy. How would we use ephemeral containers assuming the security context allowed it? Well the coop control and debug command will attach a new container to the running pod using a provided image So we can just run coop control debug pod against our pod again pod test and Pass in a custom image in this case. We'll use net shoot again as we know it has TCP dump installed Same argument supplies before and we can capture packets Okay, so ephemeral containers are cool and all but it doesn't help us get around this core issue of not being able to Set a different security context within an ephemeral container I have good news. I was kind of lying before this issue is actually just recently been fixed upstream And we'll be coming to a coop control debug version coop control debug command in a future version Going forward. I hope that we can integrate ephemeral containers into other open source tools like case sniff a coop control plug-in That runs TCP dump and use a wire shark to capture remote to start remote captures on pods All right, so for quick recap Thanks. Yeah, so Linux doesn't have containers. It has namespaces These isolate certain resources from each other The network namespace Isolates the network resources So when multiple containers run in a pod, they all observe the same network resources a p-tables rewrite the packet headers In order for traffic to be redirected to the proxy IP tables will rewrite The destination part of the packet header and then the proxy looks at the socket table Since the destinations on the packets were rewritten by IP tables The proxy needs to ensure that the original address is used So we can look that up in the socket table, which has a corresponding entry for the connection And on the debugging side TCP observability is extremely limited outside of things like response flags To capture traffic within a pod you can use NSN to run the host and run TCP dump on with back To capture traffic leaving the pod you can run TCP on the host TCP dump on the host against this virtual interface for the pod and Lastly, ephemeral containers are here to save us from all this friction If this work is interesting to you or you're passionate about kubernetes or service meshes At Airbnb, we're actively hiring. Please go to Airbnb comm slash careers to learn more Cool. And yeah, if you enjoyed the presentation buoyant runs a service mesh academy You can sign up for these are monthly hands-on workshops for real-world production users and it is free and We also work on a service mesh as a service Product that manages linker D for you on your own clusters. You can book a demo or come find us at the buoyant booth Yeah, we'd love to talk Well, that was a great session if you