 OK, now it's time. So we are going to be presenting one story, essentially how we use several cloud-native foundation projects to build a telemetry pipeline, mostly with a network automation use case in mind. And the key elements are Kubernetes and some operators, including the stream C1. And yeah, the idea is we will be introducing ourselves the context. Then we will describe a bit the project and the platform supporting that project to then cover some details about the solution, as well as some associated problems and issues we had. We will be using a slightly different format for this. And before closing, we will be also sharing some highlights about the impact in our organization. But before that, Fernando. Hi, everyone. I hope you are all having a great conference so far. So I'm Fernando. I'm working at Fastly. I'm basically enabling Kubernetes for Fastly control playing applications. Kubernetes is not a new space to me. I think I started on the early stages, thanks to companies like 20 in Agra and Bitnami. I had a great experience there. And I also had the pleasure to contribute to the Kubernetes open source community with bug fixes, features, and even custom controllers, like a 20 secrets manager. So this is me. Then I hand over to Dani again. And yes, this is Daniel, just an engineer at Fastly. I've been doing lots of DevOps stuff, mostly in the Barcelona area, so also from Spain. And yes, I also love open source. And I try to contribute as much as I can. And also, let's introduce a bit our company, Fastly. Maybe you already know about us. Our mission in the end is to make internet users happy. And how do we try to do this? It's mostly by having a distributed architecture where our customers can execute code as close as possible to the end users and deliver contents, being in mind, performance, and reliability. So yes, we have a relatively large network defined by software. We let our customers to run applications in our Edge. And yeah, the idea is to make our customers as autonomous as possible to let them just to create whatever they need in our platform and benefit from our infrastructure. Some numbers, if you are more used to request per second, 8 billion requests per day means an average of 10 million requests per second. So yeah, the size is relatively high. And that was quite clear probably almost one year ago that we had an outage, a one hour outage, and we were on the news. So that's scary, but also interesting at the same time. And yes, this is more or less how our network looks like today. So we have many points of presence across the globe. We try to put them close to our end users. And the main concept here is that the scale of the network is quite high. And as a consequence, we need more stuff, so things that are not in that data plane in our fleet. We need the control plane as well to orchestrate and to manage all this large network. Actually, in the control plane is where most of the automation magic lives. And this story is about one of these automation services, which is autopilot. Topilot is probably not a very original name. I think I have heard about autopilot three times already in this conference, but this is our autopilot. And in order to introduce the problem that it solves, let's take a look to how our pop networking looks like at high level. The idea is that, yeah, we have devices, network devices, compute devices. And they are connected to several ESNs, other ESNs. So some of them act as transit providers for us. So essentially, when we send traffic, they can route the traffic anywhere on internet. We have other relationships with other ESNs. So we may have direct links, bidding relationships with them. But here, the key message is that in order to deliver traffic from our platform, there are multiple links available, so many routes for a given destination. And in the end, we need to make decisions about what is the best way to forward traffic to our end users and also to the origin servers. And that is what autopilot is trying to solve. It's about bidding in mind performance bidding in mind capacity, how hot our links are. We need to route traffic accordingly. And automation here is not a new thing. If you browse our engineering blog post, you will find previous iterations of this. If you are also working in a large-scale network, in a big tech company, maybe you have your own solution for this. And who is behind this last iteration of autopilot? Essentially, it's the NCO team that it's in the network systems at Fastly. And initially, we had six engineers focused in that initiative. And these engineers were, we had mixed skills. So there were software engineers, network engineers, also SREs like me. And here, each one also of the key messages. In the end, there are many implementation possibilities for something like this. But we need to bid in mind the team supporting, extending this solution. And that was also a key item in our design. So autopilot at very high level. We are not reinventing the wheel. We follow the typical pattern of measuring things. Then computing if we need to push a network change. And finally, also the routing manager, like a wrapper of our routing infrastructure, which in the end also applies the changes to our fleet. Let's actually focus in the yellow box, the telemetry pipeline. And what is that telemetry pipeline doing? It's essentially consolidating many inputs, telemetry data from many sources. We have systems, information, so metrics, mostly from our Prometheus infrastructure that describe the status of the links, what is the capacity of the links we have. Then we have flow data. And that is a super key for this problem. We are sampling the network data we are moving in our fleet in order to know what are the destinations of every packet we are sending across our network. We are sampling, of course. And then we also consider performance data, even at application level. And even we are also proving the different links we have in order to know what's the performance and the status of these links in our network. So how the pipeline actually looks like. This is the portion that cares about the net flow, the s-flow data consumption. So essentially, we have network devices, switches in our case, which are emitting s-flow information with an s-flow agent. We encapsulate that data using a DTLS tunnel. So everything is UDP, and that will be relevant later. Then in the actual pipeline, we process, we have a first stage that is kind of enriching and aggregating some data using PMACT, that it's another open source project, to finally push all the data to a Kafka topic that is acting like a buffer. So finally, we have an API in top of this that is maintaining an in-memory view of the network state. So offers and GRPC API, other components are consuming, including the controller that is the one that implements the magic. In order to enrich that telemetry data, we also need to know the routing state. And we do this via that route manager service. And finally, as I mentioned, we are also collecting, consolidating metrics from our Prometheus infrastructure, essentially some counters and other data directly coming from our switches. So in order to run these services that architecture I described, we need to process data at a certain scale. Just to give you an idea, it's more or less given the sampling we are doing. It's about processing hundreds of thousands of packets per second. We need a runtime to manage these components, these services, to deploy new versions, and so on. And of course, as everyone wants probably, we want to run this in many different locations as close as possible to our points of presence. So we need to put this in several places. And we also want to be a dissolution, wanted to be also cloud agnostic. So we didn't want to couple this to another cloud provider. Given our control plane is very diverse, we wanted to be able to run this also in some dedicated data centers we have for control plane workloads. And yes, surprise, in order to run control plane workloads, when we started this project, we didn't have a proper standard. So for the data plane, yes, everything is super standardized. So every component we put in our fleet is using some frameworks and well-known patterns. But for the control plane, it's a different story. We had a mixture of different infrastructures, different cloud providers. And even in the network control and optimization team, we were already using Kubernetes, but many Kubernetes clusters in Google Cloud, mostly dedicated for individual workloads. And actually, regarding Kafka, Fastly was already using Kafka for other use cases, but we didn't have a proper production ready setup. So there was actually even opposition from the team to incorporate Kafka in the project scope. Even there was the perception that that was complicated, it was complex. We were a small team. So managing something like this could be a challenge. And now I'm handing over to Fernando who will describe a bit more the Kubernetes posture and what's the platform point of view and how the Kubernetes offering and internal offering evolve to match these requirements. So thank you, Dani. Kubernetes at Fastly circa 2020, it pretty much looked like this. You want a cluster, you get a cluster. You want a cluster, you get a cluster. Everyone get a cluster. Teams were pretty autonomous here. Like teams will own like cloud projects or cloud accounts and creates their infrastructure. The problem is that led to a very fragmented, underutilized and expensive infrastructure. Like Dani mentioned, we have several clusters over there, hosting only one service maybe. And since it also came from mostly individual initiative, there was no standard way of creating the clusters, configuring them, or even deploying to them. Some folks will use Terraform, some other folks will use the cloud console directly, others Pulumi. Some people will use them to deploy, others Royamel, other Customize. And there was also a maintenance burden because mostly the engineers deploying on Kubernetes, they didn't have the experience to administrate Kubernetes clusters. So Fastly decided to create a Kubernetes team. Initially, we were four, two folks in New Zealand, one in Canada, and myself in Spain. And the goals were to build a Kubernetes shared platform for the control plane, standardizing infrastructure deployment and application deployment. And the scale eventually cross-region, cross-clouds, and eventually elevate also developers capability to deploy faster and scale. And the latter brought the project name, which is also not very original, Elevation. So let's talk about Elevation a bit. Elevation first version was indeed pretty simple. We started with one cluster per stage and only one region. But we designed everything to scale cross-region. So if we need more region, we will create another cluster. We standardize it on hand via Fluxidy, using a GitOps pattern, because GitOps was not new to Fastly, even when they were not using Kubernetes in the beginning. And we deployed to one single cloud provider. But the stack, the whole stack was designed to be cloud-agnostic, because we were envisioning that some people would request other cloud provider or even bare metal clusters. So to accomplish that, we did things like implement authentication with our identity provider, not tying to the cloud AAM. We were using hardware. We're using hardware as the container registry, not GCR, not TCR, or anything like that. We're using Vault for Secrets Management, because Kubernetes Secrets, and also we're using Ingress and Gnex, and we're using the cloud Ingress. And also, Ser Manager is not the cloud. Google Cloud Ser Manager, AWS Ser Manager, or anything like that, we're using JetStack Ser Manager to issue threads from lesson create and also from our recently released root CA. Our Serability Stack is pretty common. Prometheus, Grafana, Splank, Fluendi. So it's also open source, not tying into anything, not tying into any cloud provider. So as soon as we released the first version, we started having the first customers, and we tried to gather feedback. The most important feedback we got is, hey, we need more regions. We need more cloud providers. We would like to have bare metal clusters, because we are hosting latency-sensitive applications. So we want to be closer to this pop or this fast lead deployment, or this cloud service, right? So we got it. The other thing was reducing the onboarding overhead, because the onboarding at the beginning was a bit convoluted. So folks will register the name space and will register a predefined service account. So then we will have to configure Bolt so the service account can read in a specific path in Bolt. And there was a bit convoluted, like I said. Also, the pilot team, they wanted a proper Kafka support in the company without needing to go and create more VMs, peering with the VPC, or something like that. So they wanted something in cluster and they asked for help. Also, there were other teams that they were completely new to Kubernetes, Helm, and all that stuff. And when they tried to get into our platform, they were like, oh, geez, this is overwhelming. Can you folks build more abstractions to engineers like us, more used to networking than Kubernetes? So also, we provide a service mesh in the first version and we were not providing a lot of visibility into a service mesh, so that was a reasonable ask. So we did, and with that come the second version of elevation, which is more or less how Kubernetes looks at fast lead today. It became pretty quickly, sorry, it became clear pretty quickly that with 14 members, we won't be able to scale properly and create cluster cross region and cross clouds and whatnot, so we got two new members that have been very helpful, one in the US and one in the UK. So this is what we did. Clusters in Google Cloud, in AWS, Barre Metal is spanning across three different continents and regions and so on and so forth. And we have today, I think like 20-inch clusters and I probably were creating three or four more in the coming weeks, so that was what we did. One of the key message that I want to share is that we started relying a lot in operators, the existing operators and operators that we built in the team. And that was a huge win for cell service, making teams more autonomous and automation and stuff like that. This is one of the examples. So we built a custom controller to configure HashiCore Bolt. So we have also a tool called Kiverno, maybe you're familiar with it, which is a policy management tool. So now, when people are registering a name space in elevation, under the hood, Kiverno will create a secret engine, KV2 secret engine that will get mounted automatically in its last name space. Also, when a service account gets created, no need for registering a service account or shit like that. Kiverno will create a bolt role and a bolt policy using our bolt controller. So that solved the whole onboarding issues and we relied in more controllers. Also, we provided some abstractions. We created a library chart and also a default chart that will encompass all the best practices, like making sure that you have both disrupting budgets, that you have both anti-affinities and that you don't have to care about, should I put this annotation for this or that annotation for that? What ingress class? What's your manager issuer? And stuff like that. So with this simple YAML file, people will be able to deploy an HTTP application with TLS and ingress routing. We made some observability improvements. So in elevation, I think this is pretty common pattern. Whenever you deploy your containers into elevation, we provide with a default dashboard, which is called the workload overview, right? So you will get CPU memory and stuff like that. But we were missing a lot of service miss visibility, so we built in a bunch of LinkerD dashboard into our default dashboard. But of course, we enabled the LinkerD UI and we enabled teams to execute on their name space LinkerD tab, LinkerD top, and all that stuff to troubleshoot because it's freaking important for teams like Autopilot, which is a network service, right? Other than that, GitOps is cool, but people were like, okay, so this thing got merged. So now what? Is it deployed or not? So we created a bunch of Grafana tables and dashboard. This is just an example. So to make sure that they understand like the thing got deployed correctly. And of course, we talk about using an operator in the cluster. We agreed that the string c seemed like a strong, a strong operator, well-maintained to provide cell service around creating Kafka clusters and teams could be very autonomous on that. So I'm gonna talk about a little bit how it is the user experience in elevation today, what teams like Autopilot are actually experiencing. So we have an initial exploratory phase where we have like ARBAC rule relax, relax it, policies relax it. And these are basically development clusters. We also have a playground project in our container registry in Harvard. So people don't need CI to push an image to it. They just go ahead, develop their thing, iterate, push their images, and then we will allow a hardware playground project in our development cluster. There's no enforcement either on using flags, GitOps or anything like that. They can use the default chart that we provide, their own health chart, Roya ML customized, whatever they need to iterate and get their application running and then being able to move on to the next phase. Obviously we have a bulk cluster per cluster. So they go and put their secret there if they need to. And once they feel comfortable, they go to the next phase which is build and then deploy. So they just push their Docker file and their health chart. If they have a health chart and not using the default chart, then a Jenkins job will kick off and then Jenkins will contact Haskell Vault with the hardware plugin that we built in-house to issue ephemeral hardware robot account tokens. So then Jenkins will sign the images, will package the health chart if any and sign it as well and then push it to Harvard's FastList project which is the only one allowed in production and staging clusters. After that, they can just go ahead and deploy using flags. So they just need to create a pull request against our HEM releases repository and some teams even go very innovative and created their own Jenkins pipelines to automatically generate HEM release from a Docker image tag, updating only the key values that they need. Then flags will go and synchronize and deploy the HEM release to the Kubernetes cluster. So now I hand over to Danny to explain the good about and the ugly things about all of this. Correct. So yes, we are gonna now present some details about the solution in three different buckets and that reminded us this film. Film by the way that I didn't know but it was recorded here in several locations in Spain including Almería that is in the south of, let's say Valencia. So yes, let's first see the first buckets, good things and one is like configuration management for this solution. So this piece of Jamel, hope the guys in the back of the room can read something but this is just an example that comes in the StreamC documentation. It's just how to create an example Kafka cluster to define several properties, including the sizing, how, which some defaults and even some properties of the shoekeeper cluster that it's backing Kafka and even other operators that you can plug to this setup. But we didn't stop there. As Fernando mentioned, we are using Helm charts to push workloads to the elevation to the Kubernetes clusters. So we also created one for the telemetry Kafka and that chart is actually exposing some values. Some of them are sizing matters which are not abstracting the cluster too much but some others and I think the important piece is the autopilot sites value that it's in the bottom. It's the list of posts that are given deployment, given autopilot deployment is expected to support. We have equivalent properties in other autopilot services and the idea in the case of Kafka is that if you define a specific pop you want to support in a cluster, automatically you are gonna get the topics you need for the telemetry data for that pop. You are gonna get a right user with the right grants, a read user and also the operator is gonna store some credentials directly as Kubernetes secrets. Then some processes that are emitting the data like SFACTD from PMACT can pick the secrets and directly start pushing the data to the topics and the same for the reader in a predictable location they can pick the credentials and that's it. So everything is packed. And here so the first key point in this section is like we don't have a separate tool on our configuration management system for the Kafka cluster. It's just another helm release in as we have for the other applications. And in fact, in the past I have been promoting for instance Pulumi a lot also externally but not having to require a different tool to manage your Kafka cluster is even better than the best of the tool. So more things about the helm chart we have in the middle that is also as I mentioned a good abstraction opportunity and the sizing knobs which are very couple to the Kafka setup could be even removed. Now that we are supporting the solution in many pops we can predict how much, let's say how big the Kafka cluster needs to be given the number of pops we are supporting. So we could even drop these values and in the end just have the list of pops we are gonna support. Regarding portability, these are just a few commands. This is the helm releases repo Fernando was mentioning. So we have several helm releases as Jamel files in a given folder, one folder per cluster. And in the end in order to deploy autopilot in a different cluster with the same configuration is as simple as copying the helm releases. And in fact that is already outdated where autopilot was already running a few weeks ago and we had many different clusters some in AWS, some in GCP. And everything looks more or less the same from a configuration perspective. Regarding pluggability and mostly focused in operational tooling. I think also the stream operator does a great job exposing some operational flows directly as Kubernetes primitives. So in this case if you want to browse which Kafka topics do we have you can just use QCTL and list these objects or even other maintenance operations like I want to trigger a rolling upgrade in my cluster. You just, the stream operator maps this to an annotation in the stateful set for the cluster. So in order to trigger it you just put that annotation and the magic happens. Then pluggability also with the observability systems the stream community is maintaining like some great Grafana dashboards but it's not just the fact that we have Grafana to display all the Prometheus metrics it's the fact that all telemetry services can share a common dashboard where you can easily correlate information from the meter of Kafka with the Kafka metrics themselves and that is super powerful to in the end identify and correlate problems you may have in your setup. And the same applies to the logs we are using Splank we are forwarding everything to Splank and in a single query we can check okay what are the logs in this case of the telemetry component so the one that is consuming the data but also the logs from the Kafka cluster and you can easily correlate events. So summary of this section Kafka in our setup is not a special thing it's just another workload in the cluster and we use the same tooling and the same processes to manage Kafka as well as the other workloads we have in the telemetry pipeline. Then things that are not so great in this setup is the fact that yeah we have an operator an operator is an extra component you need to install and maintain in your cluster compared to some software as a service solutions and actually that brings an interesting topic that is ownership of the operator. Right now the operators in the elevation clusters are maintained by our platform team so Fernando's team and teams like network systems they are just consuming the operators however we feel like this something like this may not scale well so in the moment we have other teams consuming or creating the Kafka clusters just coordinating the teams to perform a Kafka upgrade normally in hand with an operator upgrade can easily become a full-time job. So yeah we bet that at some point if this solution is kind of consumed by other teams we may need to involve other teams to support the solution. And Fernando. They're ugly like me. So you know what happens when you have a Kubernetes in multiple clouds, bare metal, you have a service mesh, you have UDP, you have BGP that it's bound to happen that you're gonna find these dudes just eating popcorn laughing at your face. So the first very interesting challenge that we found is that you know autopilot telemetry API receives like a constant UDP flow non-stop it's non-stop from the switches. So we noticed that pod that got restarted and rescheduled into the same node somehow the UDP package started to be black hole. My teammate and friend Danny Kuchinski thankfully discovered a bug in Qproxy which basically you know when you're going from one endpoint to zero endpoint Qproxy will flash the contract entry corresponding to the load balancer external IP and the pod IP, right? But contract, sorry the flow will still hit the node if it's rescheduling the same node. So a new contract entry was created before the IP tables not got applied so the contract entry had the load balancer external IP and not the pod IP. There was very interesting we found that bug they fixed it pretty quickly so that's done. This one is also a good one, very interesting because it was a whole team effort and I appreciate that. So we hit a hard limit in AWS security group rule limit and then we reach out to our team. There was nothing they could do and basically the problem is that the AWS load balancer controller will create an excessive amount of security group rules per load balancer. So it's one imbal rule for the node security group per client traffic per all our source IP which is something that autopilot for instance uses a lot. And then one rule for the health check on each subnet in the VPC. So one of the recommendations where I'll switch to ALB and this is not HTTP so not very helpful or switch to ELB it's deprecated right? So not an option either. And we needed also to preserve the source IP because otherwise we couldn't perform authentication based on IP and stuff like that. So we didn't want to implement something like proxy protocol or anything like that. So we came up with a creative software solution which basically was let's disable the security group rule creation at all and let's just use Kiverno to create a Calico global network policy that will allow our block traffic. So we did that. But you know like in Kubernetes there is an orphan dependent rule right? So a load balancer service cannot own a global network policy because the global network policy is cluster scope and the load balancer is name space scope. So Kiverno couldn't delete the couldn't garbage collect the global network policy if the service, if the load balancer service got deleted. So we implemented our own custom controller, custom service controller based on a notation only to garbage collect the Calico global network policy. This being rolled out in production like three, four weeks ago and it's working fine and we're not expecting more issues in that regard. So handing over to Dan again. Yes, let's quickly review the impact of this setup in our teams. The first box actually represents the first PR we had in the telemetry repo. So and it's actually the bootstrap. So just gopher app is an internal tool to bootstrap a Golang projects. So just creating the structure and the second is represents actually the moment where we enabled the telemetry pipeline in a non-production cluster and it only took like four weeks to get the full solution working in that non-production pop and considering that the team was quite small with also maintaining all the services and doing all the projects that was pretty nice. Also reviewing the history of the repo with the helm chart for the Kafka setup. There were some changes in the last months but they are mostly connected to Kafka upgrades and operator upgrades and not to scale the solution. We essentially were able to just adding additional values to scale the solution to all the pops. And in fact, if you take a look to the helm releases repo where we have the actual workloads we are running in production, you can see that there has been already like 10 engineers contributing to the Kafka configuration that silently means these engineers have been creating Kafka topics, Kafka users even without knowing. So yes, we have like this impression that okay autopilot actually manage itself and given this right now we are working to consolidate other telemetry processing in the same architecture and also the net of teams or network operations they use this type of network data to operate the network every day but there are even other use cases like capacity planning and implementing other solutions on top of this information. Yes, and not only the Kafka like the telemetry pipelines consolidation we are also now migrating other workloads we had in these dedicated GKE clusters to elevation so we can just focus in other challenges we have. So let's now start closing the session. Main points we need like an scalable platform to run a network automation at Fastly and Kubernetes is a great fit for this and in addition our very specific set up with some operators like the stream C1 and also some other configurations is helping us in projects like autopilot and it's a telemetry pipeline. So thanks for being here. I think we have maybe not so much time for questions but maybe we can accept one. I don't know if there is a microphone for this or if you want to ask and I will try to repeat. We use mutual TLS for not exactly, well just it's a DTLS and yes, the perfect. So the question is how the Kafka consumers and the producers how they do authenticate against the Kafka cluster. So okay, so how we distribute the certificates across the different workloads in the cluster. So here stream C is our friend. It's when you declare a Kafka user that it's a specific object. Stream C is a custom resource. Stream C gives to the cluster. Stream C actually there are different like user authentication types. Yes, we are using TLS based authentication and that means that the stream C creates the credentials and stores the credentials as a Kubernetes secrets. So then given the workloads using Kafka live in the same cluster, we can directly pull these credentials from the other workloads. Correct, yes, we are using that shared Kubernetes offering from platform. So we have a dedicated name space for autopilot services. So yes, all these services live in a single name space. Thank you. So thanks.