 for a little bit about who I am. So my name is Karsten, as Liz nicely stated. I'm a senior systems engineer and I work at IKEA Retail, the Inca Group. It's a franchise for those of you who don't know. I'm working in our cloud hybrid private team, mainly focusing around on-prem data centers and also service mesh technologies and things like that. I'm a firm believer in kiss, keep it simple, stupid. It makes your life easier, at least when you work in infrastructure and then I've been working in different things around. So lately I've joined the Celium contributor list. So who are we? The Inca Group operates 379 IKEA stores in 31 countries. We have more than 170,000 co-workers, and in physical year 22, we had a total revenue of 30, 42 billion euros and we saw some 680 million visitors to our stores and also 3.8 billion visitors to our website. So we have a total revenue of 3.8 billion visitors to our website. And the team that I'm working in is the digital team within Inca and we create the digital experience of the IKEA experience for the many people and that includes customers and co-workers, financial systems, et cetera. We're currently around 3,000 people and if you're interested, we are hiring so you can visit our website on joininca.com. So let's get into it, have some topics to discuss today. Like most enterprises, we started out in on-prem as having been built for operational purposes rather than consumption. So it included a lot of change process. If you wanted to have a VM, it could take weeks. If you wanted to have network to that VM, it also took weeks, et cetera. So if you're looking at it, it's a typical case of how you used to work before cloud and things would like, and team members and developers especially would like to have that. So our modernization journey has moved us towards the public cloud but an old company like IKEA, you just don't move everything. We had a non-lift and shift policy so we had to create new services around in order to get this up and running. Some of the unwanted design that would happen from this case was to have the front end deployed to cloud and all the backend, including data sitting around in data centers, leaving traffic patterns that was not really interesting from that case. Also created a much larger failure domain. Since you're stretching, you're more or less stretching the data center from on-prem to cloud. So we needed a cloud-like experience on-prem to unlock the development teams. So it's pretty easy to set up a Kubernetes environment on-prem these days through VMware or Nutanix or other platforms that supports running VMs but we didn't really have a good VM experience like I just told you, pretty involved. And then also we would like to get out of some of the classical networking challenges that you're always pointing fingers at someone else because it's always like the networking team to blame or the firewall or lately we have heard about DNS a lot, right? So all of these problems we wanted to kind of get out of to have a more consumable networking infrastructure. There's always the case that it's kind of an illusion that you don't have any dependencies because you have dependencies to everything, right? You have your dependencies to the laptop as I just showed it, when it doesn't work, you just. Yeah, it's also like you need to be involved with, like you have internet, you have upstream, downstream, et cetera. So you have a lot of dependencies. It might not be as easy to get out of. So in order to get kind of the foundation right, we have a critical infrastructure like any company or other place, like the infrastructure. The foundation needs to be right, otherwise it's not really going to work out for you at all. So the first thing is that we run our infrastructure as hyper-converged. That means that we run networking, storage, compute all on the same box. Then we run software on top of that. So we can basically run non-branded hardware in this setup. So we get out of dependencies to watch vendor login and other things. Now we're just quote on quote, dependent on software in this stack is based on Kubernetes, Celium, Bird, which is a routing daemon, and then Ceph for storage. We also manage this through some of the other CNCF projects with Metal Free and Cluster API. We're doing that with GitOps and then tied together with Argo CD. But we're going to focus around the networking piece in this talk. So network is like one of the key things to get right. If you cannot get the bits across the wire, then watch the point. So we really needed to look at that also to get out of the blame game so that at least we have a visibility into our network for our development teams. We needed a vendor-neutral networking stack so we could build around and choose software instead of proprietary software, hardware protocols. Good performance is always a must. Also, when you move into Kubernetes and microservices, you get a lot more east-west traffic. We needed easy scale-out. We are a small team and we intend to stay that way. So our scaling characteristics of the network should be the same as the software that we try to deploy. So it should be scale-out rather than scale-up. So we're not really interested in buying new switches whenever there's a faster switch coming out with 500-geek networking or so. So we wanted to scale-out. We wanted to have easy consumption. So things like increased load balancing and network policies so that all software defined. And then we needed to integrate it into our legacy systems. It's a modernization journey. It's not a fully replacement. So we need to coexist the environments for a long period of time. And then also, as previously mentioned, IPv4 exhaustion can happen in our environment. We have like the 400 stores and different cloud providers and other things that we run. So we have a lot of consumption around the network. So the implementation pit, we chose to go with BGP and ECMP to get redundancy and scalability. ECMP is equal cost multi-path. So it actually enables you to run multiple nicks and add more nicks if you have a scaling problem. Like if you want to read up on high-scale networking, like there's different articles around from Facebook and other cloud providers, how they do networking. And then since we are at CiliumCon, we're going to talk about all the Cilium features that we use in our setup. So we have Cilium deployed. We also use Hubble to get the visibility. We're using the integration between Cilium and BGP and also the newly featured, the LBI PAM module that is being to come up with, I think it was 113. Then also we're using the ingress capabilities of the Cilium service miss, as well as the egress gateway. And we're using the cluster pool IP version two as it has a lot of needed functionality around how to save IPs when you run in a setup like ours. And then let's go into the assembly part. Some of you might recognize the format. So we have an Ikea cloud native assembly manual. It has Cilium and Cilium brings Envoy. And then we're using Kubernetes. And for this assembly, you would need a git with a bunch of your favorite programming language. We chose to use ARGO CD. You could bring all of your block CD if you rather use that one. We use metal free and then also Ansible. And if you have problems assembling this, we are on slack. So maybe we can answer some of your questions. So you have your Ikea bookshelf and then you start by putting in some switches and putting them together. We need to put in our servers as well and connect them with the switches. We use Ansible to set up our switching infrastructure. We're setting it up with some VRFs and some BGP configuration. VRF is a way to virtualize your switch environment. So you can run multiple routing tables and looks the same, but they can coexist with this technology. And then since we're actually using Celium, we're putting in this cluster API config to get our Kubernetes environment up and running. We're using, so metal free comes with the QBADM plug-in from cluster API. And we're using the user data to put in our bird configuration. And since we're running hyper-converged, we need the servers to become routers before we can actually install Kubernetes. So we're putting the bird configuration on top of that. And then we're also adding the BGP control plane when we install Celium with peering with local host as we have bird, the routing even running on all nodes in our environment. So we compare Celium with local host as you can only have one peer out of a given server. So let's look at, it's always nice to take a step back and look at what you built so far. And maybe take a coffee break. So we can see that we have now set up our bird peering with our leaves and Celium. We can also see from our management setup that the pod networking is actually also present in our management server and it has the ECMP environment. So you have two routes for the same network. So all is good. Let's move on to the IPAM, which is always enabled. So all you need to do is actually just add a pool to Celium. So we have done that. We are running a single pool in this case. And also we enabled Ingress in the same step to get both layer four and layer seven load balancing capabilities. We are using the Celium Helm chart and that comes with built-in capabilities of enabling the Hubble UI. And I guess some of you can recognize it, but we're also using a cert manager to provision our certificates in this case. And then again, we'll have a look at what we have now. So we have, like I said previously, kiss is always good. So we keep it to a single pool as we don't have a strict need to actually run multiple pools and be able for our tenants to consume different types of pools. So we have like just a single one. And again, it has redundancy with the ECMP functionality and also we can get better performance with that. So again, if we run out of bandwidth, we can always just add multiple paths to our BDP environment and then we will have more capacity. And for Ingress, when you enable Hubble, you will also get Ingress set up. We use a shared Ingress type as we will save your IP consumption. When you use shared, you will get a single IP that is being used for all Ingress configuration. And with that approach, you can also have like the problems if you override the same configurations, of course. So you have to be somewhat careful when you do this. But anyway, it's an effective way for us to create these two low-balancing features. Before we had this in Kubernetes, we had external low-balancing features and that was not really a nice integration path to have observability like that because you don't really, you have disconnected systems with different ways of doing traffic patterns and other things. Hubble has a nice integration for this so you can actually see your flows showing up in Hubble. And then we're moving on to Ingress, one of the key features that we also need. So we run multi-sendent environments and when you do that, you know when you run Kubernetes, usually you're using node IP ranges to exit your cluster for external traffic. It's not really feasible to allow all or nothing in to watch your other on-prem environment. So the egress gateway allows us to create policies and in this example, we have set up a demo namespace that gets an IP and we're routing all internet. So zero slash zero site range in order to get all traffic leaving the cluster to be matched by this policy. This gives us predictable IP addresses towards our end consumers that's really needed when you are setting up your legacy firewalls to protect. Usually that's an IP-enabled feature so we can actually do this. We can also be more fine grained if you want to run more egress gateways. You can easily do that. And for the IP exhaustion, as I said, we enabled this cluster to be true in order to get the ability to limit the amount of IPs that is actually used. And in our case, we are actually using pretty large worker nodes so we can run up more than 500 parts on each node. So in order to assign enough IPs to those nodes, you will need a pretty large networking segment to attach to every node. That can leave a lot of waste, especially because we're also running QBird in order to run VMs on these environments. So if you have a node that will run maybe 10 VMs instead of 500 parts, you have a lot of IP waste. If you're not running IPV or cluster pool V2, which actually enables you to slice your network in much smaller network segments and Cillium will make sure that it actually gives you back your IPs back into the greater pool so it brings a lot of flexibility. So we solved major things with this integration. Yeah, I spoke way too fast. I guess I became a little nervous. But so we have plenty of time for Q and A. Thank you very much. And we have about five minutes for questions if you have questions. Yeah. Hi, my name is Sangamitra. I have been using Kubernetes for the last three, four years around and regarding networking. So I see that you are creating a load balancer and other stuff. So is it the new engine is ingress? Is it gonna replace? Or is it the new Calico for the network policies? So what is the viewpoint of Cillium? What is the scope like you guys are targeting to? I couldn't really hear the question. So is it gonna replace the market of Nginx, what we use in Kubernetes for creation of ingress and ingress or is it gonna, it's kind of a creation of network policies like we have Calico and Azure, CNI, VPC, CNI, like that. So what is the scope of Cillium Con? Are you gonna incorporate these two combinations for the overall networking of Kubernetes architecture? Or yeah, so that's my question. What's your question if we're using Nginx as ingress, together with... I wanna replace Nginx ingress. So currently we are leveraging the Cillium ingress capability so we don't have any Nginx running for ingress. We're also using Cillium's networking policies so you wouldn't really need to have Calico for that either. So that means that whether my platform is in Azure or AWS or Google Cloud, I can get rid of Nginx and I can get Cillium for both network policies plus for the ingress and ingress traffic, right? So I'm not really, so someone is nodding. I haven't been using Cillium that much in cloud so I wouldn't be able to answer but the people that have is nodding, you can use Cillium for those capabilities. Usually with cloud providers you have the ability to consume the cloud provider, low balancer. So for some use cases that might be more interesting but we're using Cillium for that piece. And you specifically said that you are using the Cillium ingress controller but from the Cillium service mesh but are you also using the service mesh capabilities or are you using something else for service mesh or no service mesh at all or? Currently we're only using the Cillium ingress controller from service mesh. We have a quite, I would say, involved service mesh story that Cillium doesn't support yet that I had a comment around the cluster mesh that will at least solve some of our problems with using the Cillium service mesh. So that might be interesting to look into for us at least. Thanks. But then again you get like features with Hubble that you would also have like this observability. That's usually one of the key things that people are using for service mesh at least initially. They want to have observability in their micro services environment. So you actually get that for free just by installing Cillium and using Hubble. In terms of efforts, timelines, how long does setting up something like that would take and what are the difficulties, maybe like people in the organization, the network people, how would they accept like something like that I want to do this kind of project, let's say. Yeah, so to show this off, it has been a multi-year journey, especially with the people. So like for us it hasn't been an easy path. It has been, I wouldn't use the word shadow IT, but maybe a little bit. I get the right people in the organization, get them interested in this piece because it's actually, as we're removing some of the obstacles, we're also getting into like this is our domain. So why are you bringing on this functionality in your area? So we had spent a lot of time. It's also why we're using Git because then you can use pull request and you can have code owners. So you can actually give back the responsibility of key infrastructure pieces. We can do that with code owners in Git and then you can include like a networking team to actually be the persons involved in those areas. So they keep kind of the responsibility for that area. Makes it easier. It also requires quite a bit because usually those teams at least at Inca hasn't been used to working with Git and pull requests. So they have been used to other tools from other vendors and how, so there has been a lot of learning in different teams in order to kind of get into this setup. Oh, I'm out of time. Thank you very much again for your talk.