 Okay. Hello, everyone. I'm Sylvain Beau, working for Red Hat, and I'm replacing my colleague that was supposed to do the presentation, and today we're going to talk about Skydive. Last year, we did a presentation about Skydive, so I don't know how many of you saw the presentation. Did someone? No? Okay. Okay. One person. Okay. So what is Skydive? Skydive is a real-time network topology and protocol analyzer. So what it does, basically, it collects all your network topology of your world infrastructure, and it allows you to do some traffic capture and to analyze the traffic. So the reason why we started this project is that networking is obviously very complex. For example, right now, you could have, like, on your cloud, you could have, like, an open stack, and on top of this, you can have a Kubernetes. Each one with a different SDN. So the complexity is huge. It changes a lot. So, basically, VMs and containers are created and deleted all the time, so it's very, very dynamic. There are, often, you make use of tunneling, so you have, like, VXLand, GRE, Genevieve stuff, so it can make troubleshooting complicated. And, yeah, that's the main use case for Skydive, which is troubleshooting, and there was a complete lack of open-source tooling for troubleshooting. You were basically stuck with the IP net NS, and all the usual toolbox, but that's not enough. So our goal was to design a software that is agnostic to the SDN, and not SDN, but any platform, so we are not tied to open stack, but we can work with it. We are not tied to Kubernetes, but same. We want to be able to do this troubleshooting and this analysis in real time, but also to be able to do this as a post-martem stuff. So, for example, basically, a user created a VM. He had connectivity issues. He deleted everything, and so we had to find a way to be able to go back in time and to see what happened. And we wanted something very lightweight, easy to deploy, because if you have an issue in production, you don't want to deploy a very complex software. So it's a single binary. You put it on the machines, and you're good to go, basically. So you can really be seen as a toolbox. You can use it in every way you want, but one very often used use case is just the visualization, just to be able to see what's in your infrastructure. So you can barely see here, but that's your network topology. So here you can see, for example, a top-of-rack switch, and on each part of this top-of-rack switch, there is a machine, and in the different nodes and stuff, there are the physical interfaces, the networked-in spaces, the open-v-switch bridges, stuff like that. And you can, of course, you can zoom, you can zoom out, you can restrict the view, because it can be huge on your infrastructure. You can click on every node and get information, so precise information, the MTU, the names of the containers and stuff. So another stuff you can do with Skydive is capture traffic to be able to troubleshoot. So on this screenshot here, you can see the node on the left, the yellow one. So this one is, we are capturing the traffic on this node, and when you click on this node, you can get all the flows. So that's the arrays at the right. So you can see the different TCP flows with the source IP and destination IP, and you can even have more information on a specific node. So you can see that we look at the link layer, the network layer, the application layer. We get the metrics. So we see the number of packets, number of bytes for this flow, the start and the stop. We also measure the RTT, so different information on the flow. We also compute something that is useful, which is the tracking ID. So for example, if you have two VMs talking on SSH together, well, the traffic goes on different interfaces, and sometimes there are two new links, so the traffic can be encapsulated. So we compute what we call a tracking ID, which allows us to follow this SSH traffic on all the nodes of your infrastructure. So if you, we selected one flow, and all the yellow nodes were the interfaces where this traffic was seen. As we collect all the metrics for the interfaces metrics and the flow metrics, we are able to graph them. So we developed a graphana plugin for this. It's available directly from your graphana installation. It's, I think, the graphana marketplace, something like this. And here you can, this plugin directly talks to the Skydive API, and you can draw like the bandwidth for a specific VM or for a user of your cloud or whatever you want. So that's a demo of Skydive, which is an action. So here you have the, oh, it's a pretty huge infrastructure. So the yellow part is the top of our switch and all the nodes. And so you can expand different, like in this case, it's the namespace, and you can create the capture. So for capture, you select one node, one source node, and the destination, and it will create, it will capture the traffic on the pass between those two interfaces. We created an ICMP, BPS filter. So that's it. And then we can also inject traffic. So it was the same mechanism. You select nodes and stuff. So that's, okay. So now a very short slide on the architecture, very, very top level. At the center of Skydive, there is a graph engine. So basically, we create nodes and edges, and we store the information on the nodes as metadata. And as I said, as we want to do the post-mortem analysis, every change on this graph is archived. So we are able to create, to recreate the graph as it was at a specific point of time. So like two years ago, what was my, how was my network topology? And so this graph is populated by probes, and we will see more later. And so the way you interact and you get information from this graph is using the API, and this API is serving, is accepting a syntax which is called the Gremlin language. The Gremlin language is a graph traversal language. And so it looks like what you see on the top. So here it's a very, very basic query that we do on the command line. We just ask Skydive the nodes that our name, that have a specific name. So for this case, it was my Ethernet interface. And so it gives you all the nodes that match as a json. And so if you go back to the example, the use case I showed you before. So with the tracking, so the traffic where my SSH traffic was seen on white, there is the Gremlin query corresponding to this. So we identified a flow. We got the trafficking idea of this flow. And then we ask Skydive the nodes that have seen this flow. And well, same for the graphana. The graphana plugin accepts a Gremlin query. So basically what you can graph anything that you want. So here we are graphing the ICMPV for traffic. And we are aggregating all the flows. So now a more precise architecture slide. So we have two components, Skydive. The first, what you see on the left, which is what we call the agents. Basically it has to be started on all the machines of your infrastructure. So your compute nodes or your Kubernetes nodes. And so they have probes and they collect their local topology. And they send this topology to one or more analyzers, which so aggregates all those local graphs and creates a big graph with it. And it serves the API and the web UI. And it stores everything in a database. So it's pretty common design. In our case, we support elastic search mainly. And so on the agents, where do we get the information from? So first, we talk to the net link. So it gives us information about the interfaces. We also talk to ETH tool to get to know what the features that are supported by this interface or this card. We also collect all the network namespaces that are, that exist on this node. We talk to OpenVswitch using the OVSDB protocol. We talk to Docker, to Kubernetes, to Neutron and so on. And even sockets, which is a probe that I will describe later. So that's, now we are going to see what's new since last year. Because basically that's what we already had last year. So we added a way to capture traffic using DPDK. So for high performance use cases. We are now able to capture traffic on OVS. So we were able to capture traffic on OVS, but on the whole bridge. Now we are going to, you can capture only the traffic for a specific port. We fetched the routing tables, the RP tables for the nodes. So we worked closely with the IBM team and we worked on adding support for the power architectures. So this is available for OpenStack and you also have the Docker image for the power architecture. We also have improvements on the deployment side of Skydive. We have a nice uncivil library to deploy on your infrastructure. There is also the blog post which describes how to install Skydive and makes use of the uncivil network stuff so that you can get information for using LLDP about your switch. So your Skydive is populated with your switches, for instance. And we have a very sexy airbag mechanism. So that's another feature we have since last year is workflows. So basically if you, with Skydive, typical stuff that you want to do is to check the connectivity between two machines. Basically what you would do, you would select the node, you would create, capture the traffic on the right point of your infrastructure, you will generate some traffic and then you will query the Gramlean API to see if the flow that you injected was seen properly. So workflows is basically a way to automate those actions. It's the bad side is that it's JavaScript. You have to write it in JavaScript. But the nice thing is that you can run it most everywhere. You can run it in your browser. You can run it as a separate Node.js program. So if you want to have a separate system interacting with Skydive and you can also, there is a JavaScript engine embedded into Skydive. So you can have Skydive execute your workflows. So and you have one to create a workflow. It will appear in the web UI and you can have a nice way to trigger your workflows. We also have a new Kubernetes probe that was created a few months ago. So it's still an early stage of this probe. It's basically synchronized the Kubernetes resources and put them into the Skydive graph. So we support many Kubernetes resources, the namespace, the service, the pods and some others. What we do with simply importing these resources is not enough. We create links between those resources. So for example, you can have what pods are part of a service or you can see the network policy. So which pods this network policy applies to. And you can also go down to the, so that's the application layer, obviously. And but you can also go down to the physical layer. So you can get your service, then your pod, then your Linux, then your Docker container and then your VTH and your Linux namespace and even the, so down to the to the machine. So it's pretty easy to deploy Skydive on your Kubernetes. It makes use of Demonset so it will run on all your nodes. There is also a hand chart which is available. And so that's what it looks like here. So it's pretty messy, but you can see here there is the mini cube and then the service and the pods and on the right side there are links to the physical machine and its Docker containers. What's new, we also have a way to capture traffic using eBPF. So last year we had the pick up IF packets and OVS. But for better performance, we created this probe. So it's separated in two parts, the kernel space and user space. So the kernel space obviously is where the eBPF bytecode runs. So it's attached on a socket and then so we get the eskiba from the kernel and we create a flow table on the kernel side. And also when we see a packet, we try to find which flow it maps to. So we compute a session key which is basically an hash of all the layers we saw in the packet and we also maintain the counters. So we are able to give you metrics on those flows. And periodically this flow table on the kernel side is synchronized with the user space side. And when we do this we compute a few stuff, a few more stuff like a UID, a striking ID, and then we are able to do the mapping between the topology. So how to do this is very easy. That's another capture type. So you just select this. This is the part of this probe. It's really, really, really simple. It's 300 lines of C. So it's really tiny. And then at the end we can see that we compute the hash using the FNV hash function. So really, really simple stuff. So regarding the performances, it's a bit comparing, like comparing APOLT and O-Ranges, but I'm going to do this. Why? Because it's the eBPF capture, the flow probe is not a complete, feature complete. There is no support for tunneling, no TACP reassembly, no IPv4 defragmentation, so it's still a work in progress. And it's pretty, it's not very easy to measure the performance as it does not account as the Skydive process. So this summarizes a bit the pros and cons for different capture types. So for IF packets, you have basically supports for every kernel. But the overhead is huge, but there is no limitation. We support all the Skydive features are supported with IF packets. There is, it's not really a capture type, but we have eBPF, so you can restrict the amount of traffic you want just by specifying eBPF filters. So this way, you don't have everything, so you have no packet metrics and stuff, but still you can do really useful things. And then you have eBPF, but you have to use a recent kernel. The overhead is really, really small, but no classification. So if we do, this is not really a small benchmark. So we were using Hyper, Skydive was pinned on just one core. And so if we generated traffic we're using and with a capture, with an IF packet at four gigabytes, we started to sell packet drops. So if we specify eBPF filter, so I did not put the filter, but it's only specific flags of the TCP flags, we were able to capture and analyze 15 gigabytes. And on eBPF it was 2027. Basically, the overhead as we, what we do is really, really simple. It's, there is almost no overhead. So one bottleneck was the way we, when we ask Skydive the flows and it returns the flows, it was returning the flows as JSON objects and the serialization was just killing it. So what we did was we switched from JSON to protobuf over Websocket. And that reduced a lot of the query time. Another use of eBPF in Skydive is that what we call the socket info. So basically, we want to see which process is talking to which process and what container is talking to which container. And so it's to mimic the equivalent of the SS command. And so far to do this, first we used slash probe parsing, which was invaluable. And so we found this very nice library which is TCP Tracer eBPF, which makes use of eBPF to do this. And we integrated it as a probe in Skydive. So we put on as metadata, as metadata on the host, we have a list of all the sockets active. So who's listening? Who's connecting to who? And you can write gremlin queries. So we can ask which host is hosting the HTTP process. We can see, we can ask who's talking to the 10.0.0.0.10 address on the HTTPS part. And you can also, when you select flows, you can also go back to the sockets that generated those flows. So with this, we have, we can show you a nice view of the flow, what we call the flow matrix. So you can see the nodes are the, it's an open stack deployment. So you can see which process is talking to who. So not run the HTTP, which is going to talking to OVSDB and stuff. And now for the roadmap, we plan to add hybrid capture. So what we meet by hybrid capture, we want to capture the first packet. So to do a huge analysis, so we look at the DHTPS, even the application layer. So what's inside this packet? And then for the other packets, we can do just lightweight capture with eBPF. We don't have to do a full analysis of all the packets. And we want to use, we want to increase the use of eBPF because we, right now, the way we retrieve the network name spaces is just by passing slash broke. And so we are not notified of new network name spaces. And we want to see, we are investigating about using eBPF to be able to get retransmission counters from the Linux stack. And so another thing we did this last year is that we thought that maybe it could be useful for some projects to have a graph engine. So it's, as I said, it's, we, with a full story. So we extracted this from Skydive to create a dedicated component, which is called Steffi. That gives us a nice joke. And you can integrate it in your tool. And there is also an LLDP probe, which is just being developed right now. And that's pretty much it, if you have any questions. What is LLDP? LLDP, it's linked to be able to discover the topology. So basically your switch sends packets every seconds to give you. And so just on your machine, you can know if you are on this specific switch and on what part of the switch and the link, the speed of the link on those information is to be, to discover the topology. Hi. Is there anything specific required to run on Kubernetes? Is it tested across different providers? Or is it expected to just work? It's expected to just work, yeah. You mean by providers, you mean? Yeah. So yeah, it works on open shifts because we tested. So yeah. Basically, it's supposed to work. Okay. Thank you very much.