 Good afternoon, everyone. I'm Himmel, co-founder of CTO at Canopus, a startup based in Australia, so long way coming here. And I'm going to be talking about building an application visibility analytics platform for telcos. Hello everybody, my name is Thomas Graf, CTO and co-founder of Isovalent. I live on the other side of the world in Switzerland, where there is great skiing. Isovalent is the creator of Silio that recently graduated as a CNCF project. We provide Kubernetes networking, and I will be talking about application and network observability specifically for Kubernetes. Great. Before we start, can I ask how many people here are from the telco network operator side? Awesome, a lot. Cool. Let's get started. Let me start with a few customer stories. So there is a tier one telco I've been working with in Asia, and they have a problem with WhatsApp experience. The WhatsApp is not working very well, the customer experience is pretty bad, and customers are complaining, but they don't know actually why, why is that happening. It's very easy to change operators in Asia in a matter of few minutes, and the customers are churning, and they are not able to figure out why, resulting in loss of revenue. Now, there is another tier one operator who is complaining about they have a lot of traditional tools to measure all the network KPI's, be it latencies and throughputs and so on, radiometrics, but they got nothing to measure actual application experience, and when they're making some network changes, they don't know how it's affecting YouTube experience, how my YouTube has been working before and after, and we're helping with that. Now, there's another operator, tier one in Australia and one in North America who's wanting to understand what are the streaming behavior on the network, which applications are trending, Netflix, Prime, or local content, how many gamers do I have on my network, how much are they gaming, so I can craft streaming bundles with the plan, or I can launch low latency gaming for my gamers. By the way, now excessive gaming is recognized as a gaming disorder by WHO recently. Now, there's another operator we've been working with who is trying to, a mobile operator, who is trying to monetize their data with the ad tech platform, with the ad exchanges that are being run, or also trying to build partnerships with enterprises with financial institutions to do something like telco credit scoring or even fraud detection. So all this is happening to solve one problem in the telco industry, and that is diminished returns on massive financial investments on infrastructure, and 5G is a classic example. So let's talk about data. What do we need? A mobile operator needs to understand what applications are flowing on the network, what the end user experience is, and how is it changing when users are moving from one cell to another, while they're moving from one region to another one, and when I talk about application experience, I actually need to know the end user experience. I'm talking about resolution of a video. I'm talking about glitches in a shooting game, talking about conferencing dropouts or stutters in the audio while you're on a call. Not just the network metrics, but we also need to know the underlying network metrics and the radio metrics to be able to perform root cause analysis to be able to see what the anomalies are happening and why all this problem is happening. A fixed line operator needs to know what are the devices in the houses? Are the IOT devices behaving properly? Are they exhibiting anomalous behavior or launching an attack? Are my gamers playing on consoles? Do they have VR headsets? Or are they playing on PCs? And also building subscriber behavior personas, so they can work with the attic, combining that with a real-time activity. When I'm at the airport looking for a hotel's website, that's the right time to send me an SMS, selling me a roaming plan or send me an ad to buy discounted hotels. So all this is coming under applications, devices, subscriber behavior and network metrics and collecting all this data at terabit scale and in real-time is bloody hard. And why is it so hard? Terabit speeds, too many packets. We've been told repeatedly by the operators that deploying traditional DPI appliances consume racks and racks of space in the order of hundreds of servers, which is almost just unaffordable to track data plane traffic at that scale. The second problem is traffic encryption. Not just the payload is fully encrypted today, but the headers are increasingly getting encrypted as well with DNS over HTTPS and DNS 1.3. We solved this problem by looking at the behavior profile of the application, what the time series profile looks like, almost by developing like an ECG or a heartbeat of a network flow. So we are able to build machine learning models to identify an application and also their experience. And the third problem is this all needs to fit in all these different architectures. We are talking about distributed fixed line network, we're talking about cable network, virtualized CMTS networks, 5G NSA and SA. And there are restrictions on what could be deployed, what could not be deployed for monitoring these networks. So let's talk about how we took the journey over five years in solving these problems, focusing on the scale and the architecture one. How many of you have heard of programmable switches or P4 or Tafino in this audience? That's awesome. So the way we started to solve this problem is by looking at, okay, all right, the computer is very expensive. And we don't want to be relying on specialized hardware. So we took the white box way. And what we deployed, we used Tafino on a programmable switch. So defining a match action paradigm. What you can do is write a P4 program in a language called P4. And being able to use any field in a packet, match on it and define a particular action that needs to be taken for a packet or a flow. For example, I could, and this all happens once it's compiled, it runs at line rate at 6.4 terabits per second without any issues at all. So we are able to see actions like, okay, if you get this packet or a flow which you don't know what to do about, just send it to the computer where all our application or business app logic, which is classifying traffic, measuring experience, machine learning models, working needs to look at that packet. Or we can say stream time series telemetry, the behavior profile of that flow, rather than sending all the packets up so that our models can work on just the profile of the flow on the telemetry. Or you can say, I need to know application specific telemetry. Now this flow is a Netflix flow. I don't need all the packets of a 3-hour movie. I just need to know enough so I can continue to classify the resolution of this video. By just doing this, and while this application is talking to the switch 10,000 times a second, and that's very hard to be able to talk to a function switch from outside at that frequency, we were able to reduce, suppress 80% of the traffic and only had to process 20% of the packets in CPU, which gave us five times more scale if the solution is completely deployed on the CPU. So that's how we were able to build a terabit scale platform for analyzing traffic flows by deploying this machine learning models and keeping the telemetry collection on the P4 programmable switch. But there are a few issues here. The first one is P4 is very constrained in what you can write in it. It doesn't even have a for loop and there are also very limited arithmetic expressions that you can use. There's limited flow table capacity so you can't like suppress all the traffic. But the other problem is specialized hardware. We are dependent on a supply chain of a chip coming from Intel. And all this specialized hardware can also not be deployed in cloud native architectures in the 5G SEO world where you can't really deploy a switch and tap the fiber network. To solve these challenges, we started looking at being how we go towards the cloud native approach. And that's when we started next. But nevertheless, using this approach, we were able to build just in three rack units and just two servers, a deep analytics platform which works at 400 gigabits per second line rate and is deployed in many production networks. As you can see a figure on the left-hand side, there's a massive gap in how much packet processing you can achieve in programmable switches compared to the CPUs. So there's always a need to find the right balance. So the next we looked at how we can use an emerging technology EBPF to build a cloud native solution using similar principles as we employed here. And I'll let Thomas introduce why EBPF is the next star here. Excellent. Thank you. It's been very fortunate to have been involved in the creation of EBPF back in 2014. Before EBPF, the answer to software processing, for example, observability was typically something like DPDK and simply punt the problem into user space. And that was fine until containers came around, until cloud native started to happen. Because user space processing is really incompatible if containers consume kernel OS or system calls directly. And EBPF was an answer to this, that we gain almost generic virtual machine to do packet processing, with bounded for loops, with practically unlimited expression capabilities, ability to run in the kernel, in XTP flavor, even accessing the DMA of the NIC, while remaining completely generic. And that was very powerful to clouds where we could not actually make assumptions on the underlying infrastructure or hardware. That's why EBPF has become defining technology to provide networking and observability in the cloud native space. Great. Thanks, Thomas. Before coming to the EBPF while we were building the programmable switchway, we also tried a lot of different methods, which were for the packet processing, working in the CPU alongside with the P4 switch. We worked with DPDK for a very long time. While these applications were running all in the user space, we could take the fast path from the NIC cards and process all the packets in the user space using DPDK, NSRIOE technologies. The problem with that, we were massively dependent on a third-party library. We were dependent on also choosing our NIC cards, which worked with DPDK. And in some of the deployments where we were deploying as virtualized functions, there were host operating systems. On top of that were hypervisors. And then there was another operating system. There was a stack of operating system, which basically troubled in getting that performance you can never get. You can still get around it, but it's a lot of work. Plus you have to write very complicated code to be able to interface with an existing library. We also worked with another, which is a pretty cool library actually by Intel as well called CNDP, Cloud Native Data Plane. They took a different path. Either you can process all the packets in the user space, or you can do it in the kernel space itself, or create a fast path in the kernel space. And CNDP took the route of using XTP to get fast path access to the packets in the kernel space itself. But the similar problems, we had to rely on an external library and interface with it to write the packets. Now, EBP solves all their problems for us. So what we did is employ similar principles. What you see on the right here is XTP packet processing. What happens here, as soon as we get packets from the interface, using XTP, we make it available as soon as possible to the CPU before it gets to any of the kernels. When it gets available to the CPU, we run an EBPF program on it, which we write in a language similar C-like language, which means we have way much more flexibility in what could be written in terms of logic. So we were able to extract very fine-grained 100 millisecond telemetry and complicated telemetry, which we had to all do in our app containers before, while we were working with the switch. And we were able to share all that information with the app containers through the shared state called the flow maps, which provides access from kernel space as well as from the user space to be able to access that state. Now, think of it similar to what we were doing before. These instructions come coming from the user space side and telling what needs to be done to the packet processing. Now that's happening in EBPF and user space application is just reading what it needs to know. It's reading the flow attributes that are needed by the machine learning models, not entire packets. In the events when we still need to send packets once in a while, we can still do that using packet buffers to be able to send those packets using a ring buffer, shared ring buffer states to the app containers, but there are a fair bit of limitations there. That part doesn't really scale. We're still trying to figure out what's the best way to scale it. But we shouldn't rely on that anyway. There shouldn't be too much packets going on to the app containers. And this resulted in four key benefits for us. It gave us bigger and flexible flow states in the flow maps. There's no limitation that we had in before. We were able to put millions of entries there and share all that state. We didn't have to compromise too much on packet processing speeds either. And this fits beautifully into the cloud-native architecture. If a customer at a telco has Kubernetes-controlled architecture, they can run Knoopa's full system workloads on their existing infrastructure. So we call it run it anywhere without any hardware dependency. There were a few, quite a few, challenges moving to this model for us as well. We had to do a lot of work on the application for balancing and reducing the number of packets to the app, which means a more complicated logic, a lot of business logic had to be implemented in the kernel space using EVPF as well. But it's great because when it runs, it's guaranteed to run. And ultimately, we were able to hit scales of 80 to 90 gigabits per second per 24 cores and around close to 200 gig on a single server, still giving us pretty massive scale to deploy traffic analytics. All right, I'll then let Thomas talk about Cilium. Yes, great. Thank you very much. So we've already seen impressive use of EVPF for observability where observabilities gained from a middle box perspective, something box that you can deploy in the network somewhere. What I'm covering next here is observability specifically gained through use of Cilium and Hubble. Hubble is Cilium's observability layer, which is giving observability in any Kubernetes cluster. Cilium is EVPF based networking, service mesh, low balancing, firewalling or policy, and is actually adopted by the majority of Kubernetes platforms today, but it's GKE, Anthos, EKS, anywhere, AKS, they're all using Cilium underneath in some way of actually providing the Kubernetes networking part. In today's presentation, I want to specifically focus on the observability portion even though Cilium of course can provide a wide variety of use cases including SRV6, service chaining, micro segmentation, providing overlays, BGP and so on. We see network observability really having evolved from the old middle box stage that I'm all talked about in the very beginning to now two new stages. The initial one being using EVPF for packet processing and protocol parsing directly in the stack that will be the traditional observability. That's very much what Hubble is doing. We're using EVPF to do protocol parsing, HTTP, TCP, UDP, SDP, GDP, whatever it is. And we've now introduced a new way of actually gaining network observability using Tetragon and for those of you who have never heard about Tetragon and for those of you who have heard about it, you may be actually confused because Tetragon is actually runtime security. So how does runtime security play in here? Tetragon allows us to look at network data and protocol data at the socket level, so independent of any network packets themselves. And this gives us the flexibility and the ability to actually gain network telemetry without even parsing network packets. If the server of the TCPIP stack of Linux already does so. So we can, as we'll see later on, extract latency numbers. We can extract and parse HTTP data without actually adding the delay to the network processing itself. That's what we really see as and what I call out here as EBPF plus application-based network observability. And that really gives this compelling observability platform where we can have Hubble, which does inline packet processing and protocol parsing and Tetragon that can do passive telemetry by instrumenting the TCPIP stack or the networking stack of Linux itself, which all of container platforms are using today. And then underneath both data streams get essentially fed into our regular Prometheus and Grafana dashboards or visualization stack that you can run yourself. As the underlying technology, EBPF is used in both projects. So let's look at a couple of examples here. We'll start out pretty basic layer three, layer four, network protocol statistics, whether it's like DNS dashboards, whether it's TCPIP statistics, whether it is network policy drops and so on. You can visualize this over time, what type of float up have you ever seen, what sort of percentages TCP versus STDP and so on. And we can take a next step. We're going to go actually a lot further and start parsing HTTP and GRPC to understand what HTTP calls have been made all the way actually to extracting open telemetry data where instead of just seeing flow logs of actual HTTP calls or GRPC calls you can see span data which gives you a detailed tracing ability on how long particular microservices took to respond. We can go then forward and actually start looking at the distributed view of this all, how are microservices talking to each other, like Jane or hop by hop, what latency is introducing each step, HTTP level, TCP level and actually visualize that and correlate that way. For example HTTP response error rates, how many of my microservices are returning errors? How does that correlate to CPU usage on that note and so on. You can see all the examples here based on Grafana dashboards. And last but not least, we can then go really, really deep and actually look at advanced telemetry, figure out smooth and round trim times for example, like for individual connections, we can detect TCP0 window events, we can detect when an application is no longer receiving data for let's say 100 milliseconds and actually alert on that in real time in eBPF. So this rule on whether to alert is built directly into eBPF and can answer and enforce such rules with incredibly low overhead. That combined, so we saw a combination of both the protocol parsing aspect of Hubble, as well as the passive telemetry collection of Tetragon. The combination of the two is really powerful as end users of these dashboards or this observability data, you don't actually know which part is providing or fueling this observability it seamlessly integrated into the system. If you want to learn more about eBPF in particular, there is a fascinating eBPF documentary premiering this Wednesday at KubeCon where you can see and learn about the creation story of eBPF all the way back to 2014, how Netflix and Google and Facebook all sort of using it, how Isolvalent got founded as part of it and so on. And we have many interactive sessions and demos this week. You can find us in the expo hall. There's a Sillium OSS experience center where you can actually try out the dashboards that I've just shown you in interactive labs. And with that, I think we have a bit of time for questions. Thank you very much. Great, thank you. Also want to say I'm quite happy to do a real-time demo after the talk. Whoever is interested, just find me and we'll walk you through the real system demo. You mentioned that in this case with the eBPF, we can substitute the service mesh or what will be the evolution for that kind of things in order to reduce the complexity about the implementation for the service mesh. Yeah, so as part of Sillium, Sillium offers a service mesh that is using a combination of eBPF and Envoy. It's a side car free service mesh and it will use an eBPF-based set-up path whenever possible. For example, MTLS, multi-cluster routing, topology routing, all of that can be done using eBPF only and then for things like HB authorization, we're using an Envoy proxy but never a side car which leads to better performance that's part of the Sillium service mesh offering. It's on, it's on. What about security with the eBPF when applications are accessing something that's running essentially in the kernel? That's a very good question. That was the primary concern before eBPF got merged and we fought very hard for multiple years to convince everybody that we could merge eBPF into the Linux kernel. Today, eBPF is used to inspect every single packet going into a Facebook data center. eBPF is a very well documented security model that describes how the verification of eBPF itself works, ensuring that eBPF programs are safe to run, that they always run to completion, as well as the capability system that controls who gets to install eBPF programs. So eBPF is not open for everybody. You need to have particular system privileges called CAP eBPF to actually install eBPF programs. Any other questions? From your perspective and considering that today we are leveraging service mesh tools exactly like Istio, what is your perspective in terms of the pace to leveraging the different tools that today we have and then for the telco, of course, and then to replace to this tool. So I don't know for me is a kind of complexity. Yeah. And I think it is the way to the telco have to understand the way to evolve the pace. So I don't know from your perspective. Yeah, it's a great question. I think service mesh initially was very much adopted by application driven platforms where application teams had very high demands on what a service mesh would have to do, retries, circuit breaking and so on. Open telemetry. Telcos are typically coming to us and saying I want a bit of service mesh but it's usually a lot more networking focused with very clear performance indications and demands. My advice would be if you look at service mesh ask yourself first what do I want from a service mesh? What do I actually need? Is it just MTLS? Is it just encryption and I don't care whether it's MTLS? Is it layer seven observability? Is it multi cluster routing? Is it resilience for applications? And based on that a lot of customers will find that even Sillium alone will already be sufficient for that and they will be happy with that. And there are some certain use case and that's what we added in Sillium service mesh that was not available yet. The mutual authentication was one of them. Sillium service mesh itself I'd say is ready. The one feature that is not instable yet is mutual authentication which we released this summer that we will expect to mark stable in April next year. Also just to add to that from an application data analytics perspective that we were talking about you don't actually have to replace anything because that works out of line that's working on a copy of the traffic away from the inline network. So that could run as a completely different service altogether. Thank you.