 Hey everyone, welcome and thanks for joining our talk about EBPF and Cilium at Sky for some context. First of all, who is Sky? So Sky started out as a satellite broadcast company with headquarters in the UK in London and has expanded into a much larger multinational company with presences in many other countries. Likewise, Sky has expanded into many other areas beyond satellite broadcast and one of those areas is OTT over the top, which is online video streaming and OTT is a part of Sky, which we are part of. There's going to be two of us who are involved in the presentation today. I'll quickly introduce ourselves and then get on with the content because 20 minutes goes by quite quickly. So first of all, I'm Sebastian Duff or Seb. I've been with Sky for just over six years, originally as a software engineer before moving into delivery. And I'm now responsible for the co-engineering department, which will cover a bit more in the introduction. And with me during the presentation is Antony Contois, who is a principal engineer in co-engineering at Sky. Antony has a strong background in both SRE and software engineering and joined Sky in 2016. I'd also like to mention CECG, who are a consultancy who have played an important role in the journey to build a mature platform as a service offering and in our work with Cilium and EBPF. Joseph Sanyal from CECG was originally going to be part of the presentation with us, but due to timing, he was only able to be involved with the preparation and not actually part of the presentation. So this talk might be a bit different than the others. Rather than going into the deep technicals of how we're leveraging EBPF, we'll be focusing on the delivery aspect and how we leverage technology to gain a high level of confidence, mitigating risk to the platform and to the business. In the first section of the presentation, I'll give a brief overview of what we as co-engineering do. And then I'll hand over to Antony to first talk about why we chose Cilium and EBPF and then about the pipelining as a form of risk mitigation. So co-engineering is a department within the global video engineering and apps division. And it is responsible for global video engineering and apps is responsible for client libraries and back in services and a portion of the clients as well, which is supporting the video and play out for Sky's OTT proposition. Some of the most recognizable platforms and propositions which we are part of at Sky, SkyGo, Now, which was formerly now TV, Peacock and soon to be SkyShowtime. In co-engineering, we build a multi-tenanted Kubernetes based platform as a service offering, which hosts about 90% of the application workload. The platform is built as a white label product so that it can be built once and deployed many times to support the different organizations and propositions. As the underlying platform for the high-profile high-profile propositions, we have very large and complex requirements, including being highly available, hybrid cloud, multi-region and active active and all of these at high-scale and low latency. So to be able to operate efficiently at scale, we have a number of important engineering principles which we follow for everything we do. And I won't go into all of them in this presentation, but I will mention two of the golden rules which we follow as they have a very large impact on the way we do things and the way how things have been designed. So the two golden rules are Tenant A cannot negatively impact Tenant B and no tenant can negatively affect the platform. Another of the key principles which we follow is that we treat any tenant and environment as production. So that means, for example, we treat the development cluster which teams use as our first production environment. And although they're not any Sky customers who are using that environment because there are 90-plus development teams, depending on the environment, any disruption can have a huge impact to their ability to deliver and to the business as well. And although we are a platforming team, we measure our success by the success of the tenants who are the teams using the platform. So our view is that one might have the best, most perfect platform in the world, but if people are struggling to adopt it, then it really isn't a successful platform. And that really comes through in how we actually implement a lot of the capabilities which we have. And on the slide, we have some interesting stats which give a brief view of the scale which we're working at. So the multi-tenanted platform currently supports just over 13 departments with over 90 teams, which is about 1,000 engineers using the platform. These teams are using a wide variety of different technologies. So our goal is to provide a consistent interface for everyone, and we largely achieve this through Kubernetes. But we also build custom tooling and libraries for teams to leverage as well. And on this slide, we have a bit of a snapshot of some of the interesting technical stats. We have over 300 unique applications deployed to the platform with more than 60,000 replicas running across all environments. And to support the required scale, we have performance tested our central services, such as Ingress, to 1 million TPS. And that's enough from me with an introduction to the platform. I'll hand over to Anthony to talk about why we show Cillium and EBPF. Hi, everyone. So I'm going to talk about why we've been choosing Cillium at Sky and how we've been mitigating the risk with the help of CECG Co-Engineering Consulting Group. So first of all, at Sky, we've seen there is a lot of application running on top of the platform and on top of Kubernetes with multi-tenanted architecture. So and by default on Kubernetes, you've got a flat network where every single port can talk between each other. So we want to restrict with the help of Kubernetes network policies and Cillium network policies, restrict network communication within the cluster. So from port to port, we want to also allow and block access to external endpoints, for example, a databases to our tenants. So a specific tenant are going to be able to talk to a specific databases and so on. We also want to block malicious IP defined by the security team. And that's going to be defined at the cluster level. And we want to make sure like the tenant cannot override it. And we want to move towards the least privileged access models where every single tenant are going to define the full network flow. Due to the high scale and requirement from Sky, we've got like some scalability concern using IP tables. So we've been deciding to leverage and embrace Cillium and EBPF. So Cillium is going to essentially inject some EBPF program inside the kernel to interact with the network stack. And is Kubernetes aware of the full topology and IPs, which is going to be able to inject inside the BPF map and share the data between the EBPF program and the Kubernetes context. That's going to allow us to have a more efficient load balancing and network policies propagation. And we heavily rely on the deny functionality of Cillium to block at the cluster level every single IPs, which seems to be malicious and also using host network policies. And with the low overhead of EBPF, we have the nice observability feature like using Hubbell and Cillium monitor. So in order to embrace this kind of new technologies, we want to make sure there is, we want to mitigate the risk before going to production. So we're going to show how we've been like mitigating the risk by automating the test. So you've got to get repository. You're going to push and then you're going to have a build, which is going to be deployed. So you commit and the build is going to be started on our CI agent. And we're going to run all the localized tests. So for example, linting, digos, vulnerability scanning, and when everything has been passed, we can obviously the Docker image has been built and published to the test repository. So when everything has been passed, as I said, the test image is going to be available and then it's going to be pulled and reused by all our tests, like non-functional testing and functional testing. And if those one pass is going to be promoted up to the next repository, which is excited test. So as part of those tests, I've got two main tests, the functional testing, which we are heavily relying on the Cillium connectivity test suite, which is provided by Cillium. And it's essentially a bunch of pod, which are deeper in the cluster and doing some DNS HTTP check connectivity check and also making sure the Cillium network policies and Kubernetes network policies are the behavior is expected and working on the running cluster. And we've been adding some additional functional testing, including like, for example, making sure namespace network policies cannot override the cluster while deny policies. We want to make sure like the Cillium identities are limited to some specific label. So to limit the number of identities inside the cluster. So for us, we only limit on the namespace labels. That's been every single namespace are going to have a one-to-one mapping with the Cillium identities. And we want to make sure, for example, when we apply a network policies, the existing TCP connection are going to be blocked. So that's one of the main tests, making sure we've got functional requirement and testing. And then we've got the non-functional testing, which is we're trying to have like a 30-minute fast feedback loop. And what we want to exercise is the full network and making sure everything can work because Cillium is using or interacting with the network stack. So essentially, exercising the full network path is having a some load injector, sending some load through a backend. And then we've got like multiple communication happening. So from pod to pod, from using the service IP, but also cost cluster communication, using the internal and external ingresses. And obviously, we are relying on a Kubernetes hostname to target this service. So we are exercising the DNS and UDP flow by resolving the Kubernetes hostname through service IP and core DNS. And talking to the backend, we're using HTTP and JRPC and the TCP, exercising the TCP stack. Which is what the tenant are using on our platform. During this time, when the full load run, we are including some resilience and chaos testing, which means essentially like deleting a Cillium agent pod, a Cillium operator, one of the Cillium ETCD member, and also the backend. And it's really important to delete the backend which is receiving the load, to exercise a rolling deployment and or deleting a pod, the business as usual on the platform. And that's going to force the load injector to reestablish connection and recycle connection. So we are trying to target like the worst case scenario. And we have like showing some number, like the number of identities we want to have at the maximum on a cluster and then trying to reproduce it on our test environment. We have also some dedicated tests, which is to simulate a migration from Cillium 1.7. So for example, we deploy 1.7 on an existing cluster. We run the load and while the load is running and the chaos testing happening, we deploy the new version, which was Cillium 1.9 to make sure there is no disruption. Obviously it's very hard to have every single use case and test. So that's why we automate all the tests, making sure we can scale and every time we've got an issue reported, then we make sure we have regression testing to make sure the issue is not going to happen again. So as part of the non-functional testing, we've got four different tests. The first one is we are simulating the identity chart. So it's essentially creating and deleting some pod, which can have a produce some identity chart and see the identity I injected inside the BPF policy map. So we simulate or create 5,000 identities and we've noticed a small age case. So during all those four tests, obviously you've got the chaos testing happening and you've got some Cillium agent, Cillium operator and also Cillium EDCD member and the backend restarted. And we've noticed a very small age case scenario with Cillium agent restart where when you restart it, you've got a small increase in term of drop as part of the matrix, but it's not affecting the client due to the TCP we try. And we've been like walking closely with Cillium and Isabel on team to measure it upstream, to measure fixed upstream. And then you've got the second test, which is exactly the same, but without Cillium agent restart and we tolerate your drop. And the first test is, the second test might go away and we're just gonna have the first one, but to tolerate your drop when we're gonna release a new version. The third test is simulating like the CMM network policies we creation, which is, so it's gonna exercise like flushing the BBF map and all the information inside when you delete the CMM network policies, but and when you create it, it's just gonna add all the data inside the BBF policy map. And we use the cluster entities, which is gathering all the identities and allowing all the identities inside the cluster to talk to this target. And we do that with the identity chance and the fourth one is without identity chance to isolate which scenario is impacting. To give you a bit more insight, so for example, we are heavily relying on metrics when the load is running and we have been creating some alert and we gate the test and we fail the test if there is any alert generated. So as you can see on the left, you got the pod creation. So we are trying to create pod to generate identities related to and at peak you've got like 10 pod created per second, which and you can see the pod count, which is like roughly around like 1500 and which is matching the identity delta, even delta, which is a number of identities and we delete some of them. So you got the identity chance so you can see it, it can go up and down. You've got the BPF map operation, which is showing all the operations are happening on the BPF. The cilium drop and also you can see the four tests with the load ejector with 2,000 kTPS and we monitor the load injection latency, making sure there is no increase and the CPU and memory give us the ability to define properly at the worst case scenario on the demon set. So when both tests has passed, we're going to promote this artifact, so the Docker image to the extended test repository. At every single day at 6 p.m., we are running what we call extended NFT or non-functional testing, which is essentially the non-functional test which has been running for 30 minutes. But for this time, it's going to be a longer period of time and for maximum 16 hours, but in our use case for cilium engine, it's going to run for eight hours, which gives us a good amount of time to show any memory leak or any issue happening at game. When everything has been passed, it's going to be promoted to different organizations. So it's going to be one for NBCU, one for Sky and they're going to be deployed on different clusters. That's why we've got different organizations and we deploy to what we could predict, which is one of the specific environment. So at Sky, we've got multiple environments. You've got pre-dev, dev, and stage and prod. So they are all done and facing, which means the dev team are deploying an application on top of it, except pre-dev, where pre-dev is only the infrastructure people who are deploying some application and exercising the network, making sure there is no issue. So we've been building for all those three, for environment, building what we call continuous load, which is essentially some load injection running 24-7 on the cluster and targeting a back-end. And obviously, you've got the chaos testing happening at the same time on those back-end. And it's going to exercise the full network flow, internal, external ingress, and Kubernetes services. Obviously, we're going to gather all the metrics and define some alerts. So, for example, latency increase, HTPRO, packet drop, and that's how we can get promotion from one environment to the other one through the alert. At 8.30, we've got the promotion mechanism, which is going to be promoted to one pre-dev, to dev, and so on, if there is no alert defined. And at 10 a.m., for example, because we've got multiple regions, we can stagger the deployment across multiple clusters. So at 10 a.m., we've got the first region and then the second one at 12 for the same environment. And then if anything happening on the first one, then we can stop, obviously, the deployment on the second one. Thank you. It was a brief overview on how we've been leveraging Celium and eBPF at scale. And thanks for all the CECG and co-engineering consulting group for the help. We are hiring at Sky and also CECG. So if you have any question, please let us know.