 Amazing. Good morning everyone. Hello. In this session, I will walk you through how our team designed a multi-tenant cluster to host different type of workloads with ensuring security across different tenants and applications. My name is Ahmed Bevorz. I'm a software engineer at The New York Times in delivery engineering and I'm so excited today because this is my first session as a speaker at KubeCon. As you can see from these little icons, I'm a proud dad. I write a lot of go for Kubernetes operators, applications, and other things and also enjoy building on top of AWS. I'm also a scuba diver, so if you have recommendations, please hit me up, but only warm places. I don't like cold water. So let's dive in. Before we start, let me tell you a little bit about The New York Times. Our mission is simple. We seek the truth to help people understand the world, and we are doing this by aiming to build the essential subscription bundle for every English-speaking, curious person who seeks to understand and engage with the world. And The New York Times news and journalism is the most recognizable product we have. We also have other products like games considered like crosswords, spilling bees, and wordle. Also have cooking for amazing recipes, and wire cutter, audio, and athletic as well. So to get into the same page here, here are a couple of distinctions to explain the upcoming sections. When I refer to a Blatrum team that I'm referring to, delivery engineering, that's my team where we build Blatrum tool and operate. When I refer to like a team, that's more of an engineering team that's our amazing engineering teams across the company who are building application and products that I just mentioned earlier in the previous slide. Here's our agenda for today. I'm going to cover a few topics. Why are we building internal developer Blatrum at The New York Times? How did we design our container runtime Blatrum? Why did we choose Selium as a C&I? They want abrasion, Selium setup, and how are we using it? They do challenges, and that's important. And what did we learn so far? So starting with developer journey. When we talk about developers, they do have a journey like customer journey. And here I try to identify the steps that most developer would go through when they start to collect their business requirements until they deliver their application. So as you can see here, there are a few steps that developer would go through. Starts by an idea followed by design and another action that each developer need to go through to deliver their applications. But you can notice here, I call it a few different steps because here where we can help and we can try to like get all the similarities between these unique steps and help the developer with seamless experience. So let me give you an example. Here are a few colors. Think of them as color ballots. And the ask is to mix them together. Imagine how your team to build this. So I'm going to pause here for about 10 seconds. It's going to be awkward while we go through this exercise together. And you can imagine like mixing all of these together in one canvas and how it's going to look from your perspective. That's awkward, I know. Okay, here we go. You can see here are the results. Every team builds their own perspective and their own unique image. So we give them some tools, but they build their own process that they go around. So the goal here is not to limit innovation because we know we have very smart people, but we want to help them to onboard to the process and make the process seamless and easier for them. So the goal here is to help engineering deliver their products to our subscribers. And we designed a platform to do that. And we start by like creating application template of the resources needed for your applications to get deployed. Then we give you like the space where you can innovate and code and like build all of your business logic that you need than our centralized CICD system where like you can build and deploy your application. That's our runtime, which is the focus of my talk today. And then of course, like you ingress of your traffic and get it into your application and most important, demonstrate an aspect of it because like you need to monitor all of the steps across. So we talked about why we built an internal developer platform. Now I'm going to talk about how we orchestrated our runtime setup. So we experimented with different setups for our cloud across counts. And we found out that multi account architecture is the one that we want to go with. That's for a few reasons. Basically we can group workloads based on like common business purpose in distinct account, avoid dependencies and conflicts. We also can apply different security rules between multiple accounts, development, production, number reduction, sandbox account. We can also limit the scope to impact, achieve resource and dependency and also manage a cost across and have like different methods of cost allocation. So when it comes to build Kubernetes clusters, there's often a dilemma between like two things and most of you may have went through that. First, are we going to use multi singleton cluster so each team get the dependency of like, hey, I can do my own stuff. I don't have to worry about others. There's no like noisy neighbor problem and all of the kind of resources. Or we go with multi tenant clusters which we can handle like more like less of less of operation overhead and like we give them the space to innovate but we can like optimize for cost. So we started to look into the main design consideration for building our Kubernetes clusters. Since we already have isolated accounts by default, we began considering network isolation. It's crucial to ensure that each workload has boundaries from other services. So we don't cross these boundaries unless it's defined. So by default, all communication between services is denied. Another factor we looked into was like role based access control and essential to have proper access. So when a tenant is important to a multi tenant cluster, they don't gain access to resources that they don't own. And last, we talk about resource management. That's important because like if we want to manage a multi tenant cluster, we want to optimize for cost and we want to make sure like everything is running efficiently. So after careful consideration, our design requirement, we came to the conclusion that multi tenant clusters fits our use case. And we recognize this approach will help us achieve our goals and also to support creating our runtime environment. In order to do that, we started to build like Kubernetes cluster across multiple regions for disaster recovery and making sure that we have efficient workloads that can fail over between regions. And it's important to understand that there's no one size fits all. This design worked for us. It may not work for you. You may have to figure out a different design. So we're talking about network isolation, but we're still running in multi tenancy. So we need to ensure that we have the right networking tools in place to ensure that there's no boundaries being crossed here. And that's where we started to look into different CNI options. And Cilium was the one we decided on. First, let's talk about performance. As you can see here, like this is from the CNI benchmark blog on Cilium. And like you can see a few components over there are being removed in favor of Cilium. And that's of course mean like there's performance overhead. Improvement that we can see and using BB filter for routing means we shift traffic filtering to the kernel and that's improved the process. And Cilium promised this originally. And there are different CNI benchmarks that chose this. And you can go and use our original benchmarks on Cilium blogs for more detail comparison between different options. On top of that, one of the other considerations between network isolation, we want to make sure that we have the right things in place. So Cilium network policy is a great extension of the Kubernetes network policy API. It brings such different policies from like L3, L4, and also L7 and support for DNS and also have identity for services. But also we have policy enforcement modes. The default mode is good to for most cases where like everything is allowed by default unless you specify something and then like other things are restricted. And there are more like always mode. I believe that's where like everything is restricted by default. And you can enable this for like more secure environments. But you can also see here like there's cluster wide policy which is streamed up process of you and apply something across all of your applications despite where your application is hosted and it doesn't matter which name is base you are in. Then observability is important. So like you heard about Hubble. So here's like a small diagram from Hubble UI. That's where like you can see the service meshes and network, how it works, traffic routed between different services. But the most important from the Hubble UI is that flow itself. So if you look deeper into the flow you'll like you will see like full rich information about every single packet traversed between like different services. And that's important. So we can build like any understanding of like how traffic is flowing between services. But also like it comes with a lot of metrics out of the box. So you can see like we have here in points states we have like traffic drop traffic allowed all of the kind of metrics that allow us to observe more for our clusters. So how are we using Cilium? So as any organization we are using Terraform for cluster provisioning. But there are a couple of things. When we start to use Terraform like EKS where we build on top of that comes with CNI itself and comes with Kube proxy. So we need to remove that and then we need to install Cilium as a CNI and then have all the processes. So there are two ways we can do that. We can even like just provision our clusters, do the process manually or we can just script it in some way. So we don't prefer to do it manually. So we created just like a small script that does all of that for us when our cluster is being provisioned. So after the cluster is being provisioned there are a couple of things that we need to go through. So that's a resource we'll like we'll provide some base home values and like a script where like we'll get all of that sorted and we'll install Helium for us and mark all the nodes as ready. To go through the script what we need to do. So first we need to remove that AWS CNI and after we remove that we need to remove the Kube proxy and then install our CNI configuration which like you can see here we have some ENI config and also like some other subnet tags and then there's like some masquerading configurations that we need to install. So you all heard about IB exhaustion and that's where we use the ENI mode for Cilium on top of EKS. So there are a couple things here. First on that side you'll see like we have like 10 IB space where like this is where it gets attached to the Cilium agent and the nodes himself and then like because we are running a multi-account architecture if we kept using the 10 space we're going to run out of IBs which is like going to happen quickly. So we started to provision like other subnets which like 100 space that allow us to communicate and have more buds and because like we are in a shared environment that we need to ensure that like we can scale. So couple things here to unbox this that's another subnet but like you can see couple things here you can see like couple ENIs and you can see multiple IB addresses and brief fixes. So one of the things that there's a limitation when you run EC2 instances that you have limits of the number of ENIs that you can attach to the instance and depending on which type of instance that you can run into that may be problematic for like shared clusters. So one of the things that Cilium supports is IB brief fixes basically like you can get the same number of like ENIs by initiative having like a single IB or a few IBs very ENI you can have multiple brief fixes which allow you to expand I believe 16x the amount of buds that you can have on a node and we run different size of nodes as well. So how the process starts like you get an account the process is completely automated after you get that there's like a CRD will get created we have an in-house operator that will take the CRD and convert it into like Kubernetes resources you start by default namespace roles and all of the things that tie it back to your cloud account that gives you access and then like more important we talk about Cilium network policy so we start by a network policy that only allow traffic back to your accounts so like because we are running in shared clusters we want to ensure isolation so all of our accounts are connected together because we are shared but we still want to ensure that your services only can talk back to your account when you need to extend capabilities but there are other possibilities of using Cilium network write policy where like things like if you are in AWS you are familiar with this IB address which is the instance metadata service so we want to make sure that we block this we also want to make sure that we block other IB range that you are not allowed to talk to unless like we give you access to do that so so far so good everything is great and then you have the column before the storm that's where like our day two operation gets in so everything is running okay and like we are onboarding services and everything is great before like telling you about what happened and how we survived that let me tell you about our ingress model first it's a critical service that we have all of our traffic going through envoy and then goes to upstreams so we just onboarded our critical service into our cluster and then like traffic is coming there but then this happened and then you can see like traffic is still coming but like there's something happening and like we can't understand what's really happening and like we looked through it and like seems like one of the budgets not working for envoy we can't blame envoy for sure but uh we aren't sure so we removed the bot everything works and then like we're going forward but we weren't sure like what exactly is the problem so we started to investigate first is it in void so in void possibly like have misconfiguration and despite is not behaving as expected because once we remove it everything works as expected but then we discovered something that also might be DNS because we started to look into that specific bot and like we figured that it's not getting all of the DNS requested it needs but like it was a weird situation because not every time it's not going to get DNS sometimes yes sometimes no so it seems to be like more of a networking and that's where we decided to look further into psyllium to understand like what's really happening so the bug hunt will start and we'll try to use all of the tools at our disposal to understand where's the problem and psyllium gives us like a says dump tool which allow us to dump everything that happens from logs and configuration into like log file it's huge so don't like try to look into it but it's big uh so we went through like reading and understanding like how things are working there and then we found like a couple things like psyllium node CRDs which is like the ways that psyllium manage resources in in uh kubernetes cluster that's where like it has like uh as you can see here like in i addresses it has like the pool of ibs that it's using and all of that kind of stuff and what we found is there's a missing ib so here what happened like we have an invoid proxy that has an ib in the 100 space but then like we look at the node and this ib is not there so there's something wrong and we're trying to understand like what's really happening there and looks like this was happening so the use case there was the ib gets attached to the bud and then it gets marked as it should be released so in psyllium there's a feature that allow you to release xs ibs so what happened is like a bud get created and ibs being attached and then like psyllium think like this ib should be removed so start to remove the ib and that's where like the bud was already starting successfully passing all the props but then gets in a bad condition because it can't communicate so it goes bad and then like to fix that you just like simply go to the iads ib and everything works but like that's not going to work for us because not every time at 2am we can go and like add this ib back so also like when you look at here you simply can disable the xs ib release but we are talking about ib exhaustion so we want to make sure that we release all of the ibs we don't want to keep them we are running in a shared um ib bowl even in the cluster so we want to make sure that all ibs are released back and like you can i don't know if you can see the logs but like you can see like we're talking about bud creation time and then like ib is getting released after so that was the intermittent fix which like okay let's turn it off for a second let's make sure everything is stable and then move forward but then we went in to look into the code to understand like what's really happening and read through and try to identify where it's about and that's where community helps so looking into the code trying to understand what's really happening uh we found like data dog already contributed to fix to that i think they have a talk today i'm not sure if it's about the same thing but basically there was an edge case here that happens when like bud getting assigned and then getting released so like what's being introduced here and this is from the uh get hub request to fix this issue is like some sort of handshake or like your sign your signs ib when the bud get terminated there's like a handshake between the psyllium asian and the operator to update the cache and make sure that like the ibs gonna get released and it's not gonna be available in the cache to the asian within via signs ib again and that was our hunt so what do we learn so far we learned that ebbf is powerful so we started just by using it as a psyllium as a c and i am networking and have like all of this network policy and looking forward for more other opportunities where ebbf can be helpful observability is very important to us to understand like all of the traffic that we need to get uh from our clusters and platform and at the end like we appreciate the community like we found something wasn't working for us but other people who already contributed to the upstream and solved the problem as we upgraded to the next version of it and i believe that's about all and i'm joined here by my colleague so we're gonna have like other talks from the new york times talking about argo and how we deployed this uh in the new york times and also we're gonna talk about oba later on thursday i believe uh and uh using this for testing and get ops and that's all thank you testing one two yeah hey um marco hemkin here um so uh just quick question about your pod network so you're not using an overlay network for the pods you're using the vpc network for the pods that's correct okay yeah we're using a secondary sider as in the vpc where like a sider get assigned to the node that's the original sider across all of our network cross accounts but we have a special secondary sider with the hundred space with like just assigned to the pods so when traffic gets back outside of the account gets with the 10 space but like inside the clusters it always has a hundred uh i d addresses hello uh for multi-tenant purposes do you plan to use different uh c idr private uh addresses for the pods part because you you seem to use only the 164 uh subnet for the pods and do you plan to split it to get really uh isolated c idr uh ips for your tenants like i'm not sure is that i heard the question can if you don't mind repeating it please yeah for for the private pods uh you you've got a 164 c idr uh address plan the do you split it and isolate it for every tenant you have or you just use the 164 in the for all the pods so if i understand the question correctly do we split the ciders between tenants across in our clusters no we don't so like if you have your own account you get your own cider that's in your vbc but in our runtime environment we have like a big ciders that we use and we just like give you the space in any address that we need to and then like we use like that your name space your identification of service of the candidate to understand like where are you going to and like understand like how we can egress and ingress traffic to your close to your services yeah so does this mean that cilium does natting on individual nodes like normally when you use as your cni or as your network plugins it's like a one-to-one mapping so it is already allocating the ip addresses while the building of the cluster right but when we have uh i think we've met or some other cni i don't remember which does a natting inside so when you see there's a multiple nodes you can spin up it's not dependent on the fixed network like you know it's like a one-to-one mapping kind of stuff so the cilium does the same kind of strategy like natting of the ip addresses so when you want to troubleshoot you will see some different ip address it's not the one the actual ip address right so if i understand the question like you are asking if cilium is natting uh our traffic from buds to other accounts is that is that the question yeah okay so when i mentioned i referred to that earlier so all of our nodes get 10 address space so that's like if traffic is crossing boundaries to a different account that's what the ip address will be showing out so like you get like 10 space ip on the other account but if traffic is aggressing between different buds we will only see the 100 ip address there so cilium is not like natting cross accounts but it's natting around that nodes themselves so we can see traffic flowing between buds on a specific range which is 100 space and then like the 10 space it just like cross account and if for sure if we crossing public internet there will like be like public ip addresses when how it's going to get egressed out okay thank you another question so as you said that you have removed the kube proxy demon set and things while deployment of e-cares so after uh implementing the cluster and it's live and there's some issues on the pot-to-pot connection on any worker nodes right or any problem so that means that aws will not take responsibility to support that it's us or it's like the developer has to find the root cause with this and has to provide the fix or go to the community check with people but that's a time consuming right so does cilium also have a licensing kind of stuff that you get support normally so if i hear correct uh you are asking if once we remove kube proxy and components from like eks like what's rule for aws and cilium in this place so like i'm not sure about like aws but like since we removed things that's already provisioned earlier they wouldn't support that out of out of this box and for cilium we basically support that so we understand the process we understand what's happening and like we need to work out with cilium i know that cilium has other offering as well so i think i've been out of time so thank you all