 Hey, so for our next session, please welcome Timur Solodovnikov, who's going to be talking about the use of Cilium CNI in Clickhouse Cloud. So let's give Timur a big round of applause. Thank you for joining this session. Before talking about Cilium and networking, I want to share some interesting facts about our journey. It took only six months between hiring the first engineer to data plane team and opening our Cloud for private preview. And so it was really important to find right technology for each piece of our stack. And after running Cilium in more than a year in our Cloud, I can tell you that Cilium is a great choice for our platform. About me, my name is Timur Solodovnikov. I'm a site reliability engineer. I work in infrastructure team. I joined to Clickhouse in June 2022, right after opening it for private preview. Our team is involved in the multiple projects related to infrastructure improvement and management. For those who don't know, what is Clickhouse? Clickhouse is open source, column oriented, distributed database. It was open sourced in 2016 under Apache 2.0 license. And since then, it gained the huge popularity. A lot of companies, they use it for analyzing the big data. That's what Clickhouse actually was built to analyze the huge amounts of data back in 2009. So usually people use it for analytics, dash boardings, ad hoc queries for near real time analytics. That's what is this. And we are building Cloud on top of Clickhouse. I want to show a quick demo about user experience. In our cloud, you can create Clickhouse instance in seconds. We currently support AWS and GCP. In AWS, we support eight regions. In GCP, we support three regions. You just need to pick right region and set some settings. For example, we support auto scaling. You can define minimum and maximum limits for instance. And instance will be scaled up and down based on your workload. Also, we support scaling, which means that if you don't have any activity on your instance, instance will be paused. And you will not be paying for your computer sources. You can connect to Clickhouse Cloud using standard tools, such as Clickhouse CLI. Also, we draw a lot of drivers and connectors. Most of the programming languages, they have connectors to Clickhouse. And also, we have Clickhouse SQL UI. It's the UI that's part of Clickhouse Cloud, where you can build queries, visualizations, and also actually our SQL UI can build queries for you using Generative AI. OK, about our stack. Currently, we have cloud deployed in AWS and GCP. Next year, we will open it for Azure. We use managed Kubernetes services. We use Terraform for infrastructure as a code. Our compute and storage are separated. For AWS, we use S3 for storing data in Clickhouse. In GCP, we use Google Cloud Storage. Also, we use Istio for data ingress. Istio is a single point of entry to our cloud. And also, we use Argoscd for managing our Kubernetes manifests. During initial design, we had some network requirements for our networking stack. One of the most important requirements was the performance. Clickhouse is a blazing fast database. It can easily utilize network bandwidth on compute instance. Excuse me. Yeah, and it was really important that network CNI shouldn't be bottlenecked for database. Also, these networking stacks should be easy to debug. Our company is relatively young. And we don't have network engineers. We have mostly software developers and SREs. And also, really important requirement is network isolation. We post multiple instances. And it's really important to have isolation on networking layer as well. And we have multiple options. We reviewed them, for example, multiple AWS accounts or VPCs or subnets or multiple Kubernetes clusters. But those options are not really good for us because it would add a lot of maintenance burden. And it's just hard to manage. And we decided to start with just a single Kubernetes cluster with multiple namespaces and network isolation on using network policies. And actually, initially, we started with Calico. Yeah, worked for us. But later, we switched to Cilium because Cilium just had better performance comparing to Calico. One of the reasons is the EBPF. And also, another advantage of Cilium is network policies, Cilium network policies. Yeah, to be fair, later, Calico also added EBPF support, but we still decided to go with Cilium. What is Klikow's instance? We run our compute in Kubernetes. Cilium, Klikow's instance is set of pods that deploy it in a single namespace. We have our operator that creates all Kubernetes items, such as Klikow's server pod, keeper pod. Keeper is our implementation of ZooKeeper. Also, our management player, it creates IAM role and attaches it to Klikow's pod. This role has read-write access to S3 bucket. And also, operator creates network policies. I skipped a lot of information here, but I just want to highlight some items. In general, we allow access between pods within the namespace. Also, we allow inbound connections from certain pods, for example, for monitoring, and also from Istio for ingress. And also, we allow outbound connections from the namespace. Why we need it? Because our customers, they can ingest data from internet. It could be, for example, Klikow, sorry, Kafka cluster that deploy it in internet. And customers would connect to this Kafka and load data to Klikow's server. But we want to make sure that we block certain connections. For example, we block access to Siders that we use in our VPC. And also, we block access to AWS Metadata Service. So using Cilium network policies, we achieve network isolation on using network policies. I'm going to talk about inbound connection handling and how it works in our cloud. As I said before, we use Istio for ingress. And as you can see on the schema, actually, we run Istio in dedicated Kubernetes cluster. One of the reasons why we decided to deploy Istio separately is to protect Istio pods from potential bugs, or if, for example, customers, if they escaped the pod, they wouldn't get access to Istio pods and certificates and secrets. And how it works, when a customer establishes connection, it lands to Istio pods. And using SNI headers, Istio forwards a connection to proper database backend. Everything works fine. And you might be interested how it works because with network policies, we should allow access from Istio pods to database server. And really easy, we use Istio labels to allow inbound connections. And everything is OK, right? But in general, it will not work. But thanks to Cilium cluster mesh, we use cluster mesh between proxy cluster and data plan cluster where we deploy database pods. And it allows us to create network policies in a way like as Istio pods would be deployed in the same cluster. So it's a really, really cool feature for us. There are other features that we do not use. For example, pod IP routing, we do not use that because we install Cilium using ENI mode. It means that our pods, they receive IP address from AWS PPC. So all routing and forwarding is done by AWS, not by Cilium. OK, how we install Cilium? We use Helm for installation. First thing, we delete standard network demo sets that's created by AWS. And we create certificate for establishing trust between proxy cluster and data plan cluster. Certificate is managed by sort manager tool. Next step, we install Cilium on data plan cluster. And as part of this installation, we expose Cilium cluster mesh API server through internal load balancer. And also, we set cluster name and ID. This should be unique within the cluster mesh. As a next step, we install Cilium on proxy cluster. Proxy cluster is the cluster where we run our Istio pods. We also expose cluster mesh API server through internal load balancer. We set name and ID. And also, we create external Kubernetes service. And we point this service to internal load balancer that was created on the previous step. So proxy cluster can connect to cluster mesh API server through these servers. And as a last step, we create external name service and data plan cluster and point it to internal load balancer in proxy cluster. And we add configuration cluster mesh configuration as part of this step. So that's it. Unfortunately, as you can see, there are circular dependencies. And that's why we do this installation manually, because there are load balancers created in data plan and in proxy cluster. And it's hard to create automation for that. It is possible to set up cluster mesh using Cilium CLI, but we prefer just use Helm. OK. Unfortunately, we use Cilium network policies in AWS. And unfortunately, we cannot use it in GCP. One of the reasons, because we use managed Cilium in GCP. And when you use managed Cilium in GCP, you cannot create Cilium network policies. So we have to create standard network policies. There are some limitations that really do not affect us. For example, in Cilium, you can create a cluster-wide network policies that will be applied to the cluster. But when you create standard network policies, it's scoped to namespace. So also with Cilium network policies, they support L7 protocols, such as DNS, Kafka, and HTTP. In standard network policies, you can use only L3 and L4. And also you can use pod labels. In Cilium network policies, you can target services. But in network policies, you can't. But as a workaround, you can use pod labels that attach to service selector. So also not a big deal for us. And also Cilium provides such things like entities. Entities is, as I understand, a predefined set of IP addresses that can be used in network policies. For example, you can target QBAPI server or internet or host. There are multiple. I didn't put all of them here. Where in standard network policies, you can use CIDR notation. OK. Also as I said before, it was really important to have tools that will allow us to debug network layer. And Cilium provides really nice feature as Hubble UI. In Hubble, you can debug and see traffic in real time and see drops or what traffic is forwarded. So really nice tool. I'm going to talk about problems we had in our cloud. One of the problems is E9 limits. Because we run Cilium in E9 mode, some of the Cilium doesn't have knowledge about some AWS instances. As you know, EC2 instances, each instance type has own limits. And for example, how many network interfaces it has, how much IP addresses you cannot touch to it. So our version didn't have information about certain instance types. And really easy to fix. In our case, just run this command, found it in Istio source code, and passed this information through Helm variables. Another interesting problem we had after optimizing our computer sources, our cluster autoscaler started to expand and shrink our cluster more aggressively. And so what happened? In the name space, in a Clickhouse name space, one of the pod had problems with connecting to other pods. And it was really hard to figure out initially what is going on, actually, because what we did to fix this problem, we drained the node where this pod was deployed, and it helped. But of course, it's not a solution, not a long-term solution. And after debugging, we found that, as you can see here, this is the IP address of the pod. And we got information from IP cache. And you can see that this IP address had identity 4. We also got information about this endpoint using this command. And as you can see, we had correct labels here. And when we checked what is identity 4, we found that it actually had label reserved health. And that was the clue for us, because usually this label attached to a Kubernetes node, not to the pod. And that's why we figured out that something is wrong with identity or mapping. And we found bug. Luckily, fortunately, it was already closed. This bug related to incorrect node deletion. In our case, we just updated Cilium. And yeah, that's what is for us. Also, unfortunately, cluster mesh is not available in GCP in Azure if you use managed Cilium. Don't know why. Maybe someday it will be implemented. Yeah. Overall, Cilium is a great choice for our platform. It helps to forward a huge amount of data in our cloud. And we are growing. And so far, we are happy with Cilium. If you want to try Clickhouse or know about Clickhouse, please check our booth, M14. You can install Clickhouse using this command and run it locally and try it. Or you can sign up to Clickhouse Cloud and get $300 credits. So check it out. Clickhouse is really fast database for analytical workloads. Also, we are hiring. Please check out our website. We need talented engineers. We are looking for SREs, software developers, and, yeah, thank you. Thank you, Timo. If you have questions, we have the mic in the middle of the room there, so please just go up, form an orderly queue, and ask your questions. Thank you. Hi. I have to ask, sounds like there are some downsides when you're selecting vendor-provided managed Cilium solution. And no guarantees, certainly, of interoperability, I guess, at any point in the future. Just curious what drove your organization's decision toward the managed service instead of being able to, I guess, carve your own destiny? Yeah, sure. We wanted to just optimize our operation to move this burden of upgrades and patching to Cloud Provider. That was the driving factor for us. So is that a comment about the administrative burden for the AWS deployments that you have? Yeah, kind of. Yeah, it's not really painful to upgrade Cilium within the version, within the release. So far, but yeah, we wanted to just try another way. And maybe we will change this in the future. Maybe we will migrate to bring your CNI. Ask one more. Sure. So if there were no future of managed services being interoperable across Clouds, what would you do right now? If you knew that? Could you please rephrase the question? So if you were told today you will never have the ability to use what you might call vanilla Cilium across multi-Cloud with managed Cilium services, what decisions would you make now based on that information? Of course. Of course, we will install our version of Cilium manually and use. We wouldn't use managed Cilium in that way. OK. Thanks very much. Hi. Thanks for your presentation. Could I please take us back to the page with the IP address, please? Yes, this bit, yes. So could you explain how you fix these problem again, please? You rem that AWS describes command and then? Yeah. So we faced this problem when we tried to deploy, tried to use our Cloud with different instance types. And basically, we started the new types. We created new node group in Kubernetes. And pods couldn't start on this node group. And that's the error we had. And so we just found that actually Cilium didn't have information about enile limits for those instance types. And I found this command in Cilium source code. I run this, and this command generated this configuration. And I just used this, passed this configuration through Helm variables. That's it. So essentially, what we do here is, means that this R6 ID, 12x large, has eight interfaces, I think, and you can attach up to 30 IP addresses. So it's just information for Cilium. That's how you can operate with those instances. Thank you. I ask that question because we are running into a cluster of run out of IP addresses. Cilium tells you, I don't have enough IP addresses to spin up a new pod. So it's something we've been dealing with for the past few days. And I'll see if this is going to help us. Thank you.