 All right, I guess we can start the session. So thank you for coming to the Cillium Maintainer Track session. This is scaling up and scaling out networking, observability, and security with Cillium. I'm Bill Mulligan, and I work at Isovalent. My name is Jeff Chen from Check.com Group. And I'll be doing the first half of the presentation in English, kind of giving an introduction and overview of Cillium. And then Jeff will be doing his half in Chinese, talking about how trip.com uses Cillium at scale. So with that, let's get started. So just a show of hand, who even here uses Cillium right now? OK, cool. You're part of the great community. We have over 100 different companies that have publicly stated that they're using Cillium right now. And we have 45 public case studies. If you haven't added yourself to the list on GitHub yet, please do it so we can help keep track of who's using Cillium. Cillium was a project originally created by Isovalent and then later donated to the CNCF. And it's an EBPF-powered solution for networking, security, and observability. It's the CNI for almost every cloud provider in the cloud-native ecosystem right now and almost all the Kubernetes distributions. So if you're running Kubernetes right now, you can use Cillium as your CNI. In fact, all the major cloud providers have now picked Cillium for networking and observability in their Kubernetes platform. And if you aren't familiar with EBPF, that's OK. It's a Linux kernel technology, so very low level. And what it does and that sets it apart and really what makes Cillium powerful is that it makes the Linux kernel programmable in a safe and efficient way. And so one example that people kind of like to use is kind of like what JavaScript is to the browser. So before JavaScript, we had kind of flat static web pages. But then with JavaScript, suddenly web pages were interactive. And that's exactly what EBPF does to the kernel. It allows us to flexibly and dynamically add new abilities and functionalities to the Linux kernel, which allows us to do a lot of new and interesting things. And it's really important because the Linux kernel is about a 30-year-old technology. So updating it for the cloud-native world is difficult and EBPF helps enable that. So if you aren't familiar with Cillium, so EBPF is the underlying technology, but there's a lot of different parts of the ecosystem. There's the Cillium CNI, which provides high performance and scalable networking. There's Cillium Service Mesh, which is sidecar list free and providing Ingress 2. There's Hubble for network observability and Tetragon for security observability and runtime enforcement. So this is kind of the overview. And I'm actually going to go back to the beginning of the project and walk through how each of these pieces came to be. So originally, Cillium began as just container networking, so providing a flat layer three and layer four network based on EBPF for even before Kubernetes just container clusters and then for Kubernetes. As we started to kind of like grow and get more users that were doing kind of more enterprise features, we started to add additional features. And the first one was around network security. So people want to be able to secure their clusters. So things like network policy on both layer three and four and layer seven and encryption. After that, we kind of realized, if you have a distributed computing environment, you also need to observe what's going on in the environment. Where are things going wrong? Where are the packets flowing? Where are they not flowing? And that's where Hubble came in. So Hubble is the observability component of Cillium. It provides things like network flow logs, metrics, a service map to the Hubble UI, and allows easy troubleshooting by platform teams and application developers. After that, we realized people aren't just running run Kubernetes clusters. As you'll hear from trip.com, they're actually running multiple Kubernetes clusters. So we came up with cluster mesh, which allows us to network multiple Kubernetes clusters together. As you're running more and more clusters, you also need to start to lock them down because you're running more sensitive and more important workloads. And that's when we added Tetragon, which is the security observability and runtime enforcement part of Cillium. And that can monitor things like Cisco file access and network observability. And the last part, oh, sorry, then also you need to get the traffic into your clusters too, which is where the Cillium load balancer came in. And this is a very high performance load balancer based on EBPF. And there's also Cillium service mesh, which is providing the layer seven functionality around networking, observability, and security. Once again, kind of the theme around Cillium, providing networking, observability, and security in a cloud native way. So Cillium service mesh is different than a lot of other service meshes you'll see out there because it is in large part powered by EBPF. And that's what makes Cillium itself and also the service mesh really scalable and performant. So whenever possible, we're doing all the traffic routing with EBPF and there's no proxy in there. And when we can't do that, we fall back to a layer seven proxy, which is one proxy per node rather than having a sidecar based service mesh. Now, kind of like looking forward towards the future, that's kind of the history and where we're going next. So 114 just recently got released and the big highlights out of that release were mutual authentication for network policy. So enabling what a lot of people use for MTLS when combined with mutual authentication and encryption. As a part of that, a SPIFI integration to do the authentication, day two operational enhancement to make it easier to run, especially when you're running at significant scale, like we'll hear about. Grafana dashboards in the Hubble UI, integrations into SEO Ambient Mesh and Cillium Mesh. Now to kind of talk about these as the really big highlights. So mutual authentication is adding authentication as just a network policy. So it's not something else you need to add. It's really just like a two line change to add mutual authentication into your network and providing that enhanced identity. This SPIFI integration is providing that identity for the authentication part and the day two operations is a lot of that kind of like troubleshooting and dashboards through the Grafana integration. And the last component that recently came out was Cillium cluster mesh. So up until now, we've really been talking about Kubernetes based environments but really your IT infrastructure extends so far beyond just Kubernetes. And so what Cillium Mesh allows you to do through transit gate rays is to add other elements of your network infrastructure like bare metal machines or VMs or other things that you're running on your network and giving them a Cillium identity and allowing you to treat them as network devices like the rest of your infrastructure being networked by Cillium. If you want to like learn more and get started, the Linux Foundation just recently launched in introduction to training Cillium. Yeah, introduction to Cillium training. I know this is extremely fast overview because I wanna give a ton of time to Jeff for to talk about what they're doing at trip.com. So this is a great resource to get started. And there's also a Cillium lab page on the Cillium website and this has more labs to get started. And with that, I'll hand it over. Okay. 各位下午好,我是来自协成用网络团队的陈恒 然后在接下去的十几分钟 我会为大家分享一下 Cillium在协成生产环境的一个实践 以及我们遇到过的一些挑战和解决方案. 协成 Ctrip相信大家应该都已经很熟悉了 海外用户的话也可以从 trip.com去订计票酒店的商品 那我们云团队的话 主要是负责公司的基础设施的开发运维 服务于公司的一万名工程师. 简单介绍一下 Cillium在协成的一个历史 我们是从一般年开始去调研 Cillium 那当时的触发点是因为 我们当时的那个网络方案和 CNI 无法支持 KBS 集群的一个快速增长 那经过一段时间的调研之后 我们是将 Cillium投入生产 在之后的几年里将 我们的在线应用啊 还有像 Redis ES 这样的中间键 都陆续的签到了 Cillium网络上 然后在今年的话 我们主要是将大量的业务负载跑在了阿旅云上 后面会继续介绍 OK 我们目前的规模的话是 一共有大概两万台劳的 这其中包括 IDC 里的物理服务器 还有公有银上的一些讯息笔 OK 那就先弄得上一共跑了 35万个 POD 然后我们的 Cillium网络策略的规模 是大概有 3000 左右 每一秒的话是有 20 万条的 Hub Events OK 简单看一下我们的部署架构 首先我们的网络的模式是直接路由 然后在 Cillium的话是通过 BGP 将 POD CIDR 通告出去 然后在公有银的话是 POD 直接从 EPC Subnet from IP 所以也是可以直接通的 每台 Note 的话 每台 Note 上会通过 Hub 将网络访问的事件 通过 Kafka 送到我们内部认为的 ELK 系统 最后在 Kibana 做一个展示 然后我们的网络策略的话 是通过 QFide 做一个统一的 创建下方 确保每一个集群有一致的 一个网络策略 然后多集群的方案的话 我们是使用一个叫做 KVstore Mesh 的设计区 替代了 Cillium 社区的 Caster Mesh 我们也会详细介绍一下 简单总结一下 Cillium 给我们带来的一些好处 首先的话是集群的一个可获胆性 我们最大的集群是一度跑到过 6000 家的 Note 然后在数据面也是很稳定 在生产跑了大概超过 4 年的时间 同时 Cillium 也给我们带来一些 KVase 原生的一些 Feature 像网络策略 LB 这些 然后除了这之外 还有像 Hubo 提供的额外的可获胆性 总的来说 Cillium 是给我们带来一个 一致的体验 不管是对用户还是管理员来说 之前的话我们可能 Cillium 还有各个云厂商 可能要去分别去跑一个不同的 网络方案或者说 CNI 有了 Cillium 以后我们是有一个 一套方案来让我们在各个云上面 实现这些功能 可以看一下这张图 这个数据是来自 中国民航局的一个 公布的一个每个月的 飞机客运量 这个数据是很大程度上 可以反映出来 旅游行业的一个 情况 那可以看到 2020 年 疫情爆发以后 是有一波暴跌 从五千万节跌到八百万 然后之后的几年由于 疫情的反复 是有一个剧烈的波动 在今年终于是迎来了一个反弹 在八月份是有一个破记录了 OK 那这中间的 对我们的业务是有一个剧烈的波动 对我们的主设施 也是提出了一些新的要求 首先的话是在业务低迷的时候 我们需要去 有效地控制我们的成本 在剧烈波动的时候 我们需要一个自动扩缩那一个能力 然后在快速增长的阶段 我们需要有大量的 可以快速投入生产的一个资源 来支撑我们的业务 那至于这些考虑的话 我们是将很多的业务负载 跑在阿里云上 随之而来的话 是部署了一个比较大的 Cm 集群 那目前这个集群的话 大约是在三千到五千个 node 这些 node 是由 a cost auto scaler 来自动扩缩 一共的话是由大概一万到四万个 pod 然后我们扩容一台 node 的时间 大概是一分钟左右 这时间就包括 去创建一个 pod 然后 auto scaler 去出发一个 ASG 的扩容 然后再到 node 拉起 both strap 再直到 pod 到一个 running 状态 OK, 在这个过程中 我们也遇到过一些挑战 首先一个是 我们扩容的性能 受制于 Cm 有一些性能的瓶颈 其次的话我们 在集群规模比较大的时候 我们也是在控制面的稳定性 是接受了一些挑战 然后同时我们也是 这种网络模式的比较早的用户 那遇到了很多 bug 目前这些 bug 和性能的优化 我们已经都提交到社区了 OK, 可以简单看一下 在阿里云上面 Cm 是怎么给一个 pod 分配 IP 的 第一步的话是 Cm agent 的启动 会创建一个 Cm node CID 的资源 这资源会被 operator 去 watch到 然后 operator 会去掉阿里云的 API 来创建网卡分配 IP 之后的话是有一步 比较耗时的 list 应该 就是一个同步应该的一个操作 之后的话 这些新的 IP 会被同步到 Cm node 然后 Cm agent 就可以从 这些 将这些新的 IP 分配给 pod 然后 用完 pod 以后 它需要将这个 已经使用的信息同步给 ACSO 的 Cm node 这个上报的这个动作 也有一个 15 秒的间隔时间 这个间隔时间 也是有一些影响我们的扩容性能 OK, 那我们在 operator 做过一些优化 首先的话是 当我们的 ENI 达到 1500 左右的时候 我们发现我们 list 一遍 ENI 花了 17 秒的时间 经过调查以后发现 其实是里面做了一个不必要的 subnet 的一个便利 我们把这个便利去掉了以后 大概是降到了 3 秒左右的时间 然后 API 的请求也是 下降了很多,对 AI 铃的压力也是下降了 OK, 那 当我们的机群继续扩展的时候 我们 ENI 的数量达到过 8000 个 甚至 13000 个 那在这个阶段呢 我们发现 我刚才即使是进行过优化 list 一遍 ENI 还是非常慢 那这个时候已经是 没法从 AI 铃那边再去做优化 所以我们是 重新去思考一下 Selim 里的那一部分的代码 他发现其中有一部分的 list 一遍 ENI 其实是 没有必要去 便利所有的 ENI 而是只需要去 过去 某一个 node 上的 ENI OK, 那经过这个优化的话是 把这个耗时固定到了大概 2 秒钟的时间 Agent 测的话, 我们也做了一些优化 一个是刚才提到的这个 15 秒的一个 写词的一个剑隔时间 我们加了一个 flag 去配置 另外的话,其实 就一路过来也是遇到过一些 bug 比如说 agent 启动会有一些偶发的问题 some risk condition 导致我们的 node 启动 可能会受到延缓 我们也是做了一些 bug fix OK, 另外还有一个值得注意的问题是 目前 kubes scheduler 在调度 pod 的时候 其实不会考虑这 node 上有没有可用 IP 地址的 也就是说你的 pod 有可能会调到 node 上 然后要等待 CNI 从外部去分配 IP 甚至说你这台 node IP 的容量已经到顶了 那可能就会一直卡在这个状态 其实 kbuzz 社区也是有人提过这个问题 但是没有人去修 像 AWS EKS 的话 去解决这种问题是用 在 node 启动的时候给 kubelette 去配一个 maxpods 这样一个 page 去 walk around 但其实这对我们来说是比较难管理 因为我们私有营的话 可能有不同的 pod CIDR 的大小 而不同的 cloud provider 可能有不一样的 flavor 对应不同的 IP 数量 我们的解决方案是用 kbuzz extended resources 加 device plug-in 来实现 这个原理上其实和 NVIDIA 的 GPU device plug-in 是类似的 首先是通过 device plug-in 去读取 stadium agent 里面 IP 数量的一个信息 然后再将它写到 ip server 的 node status 将 IP 作为一个资源写进去 然后 pod 在创建的时候 就可以带着这个资源 这样就实现了一个 IP 感知的一个调度 那我们也有一个 web hook scheduler 来确保每一个 pod 会带上这样一个 资源的申请 Okay 还有一个问题是我们在 集群规模很大的时候 有一个典型的问题是 控制面的一个重启分报的一个问题 假设我们的 IP server 有一个故障或者性能下降 在短时间内没有恢复 那我们的 city agent 可能会去 不断地重连 然后重连到了一定时间以后 它可能会退出重启 而当它重启的时候 就会发送一些很多的 listen watch 的请求 那我们知道 listen watch 请求会去 从服务单拉取很大量的数据 同时这些请求要是集中在一起去发送 那导致结果可能就是 我们的控制面会不断地被打挂 解决这类问题的话 是需要引入一个指数的 back off 以及必须要引入一个 就是间隔时间要有一个随机性 也就是解隔 那目前其实如果用 demon side 去跑 city agent的话 那其实 炮的 this is the policy 它最大的时间间隔只有五分钟 对一个大集群是不够的 另外的话它也不支持解隔 如果没有解隔的话 那可能就会像左边这张图一样 你虽然每一波的请求的间隔 时间越来越长 但其实它没有一个下降的趋势 因为每一波都非常的集中 有解隔的话 你可以把这些请求 均匀地分散到一段时间里 让你的服务端有 更加充足的时间去 正常工作 那我们的解决方式 在启动 city agent 的时候去写一个脚本 来实现这样一个 restart back off 一方面 然后我们的 agent 也是用 token compose 来起的 一方面呢 是可以对这个 重启的一个行为 有一个完全的控制 另外一方面呢 我们在 升级发布的时候 可以有一个更好的 挥度控制 最重要的是 我们希望 city agent 将提供 底层网络的一个 组件可以 尽可能地从 kbs 去解我 举个极端的例子的话 比如说 你用 dim set 去起 一个 city agent 来泡的 而且 尝尝泡的 有一个 web hook 是通过 service 来暴露的 而 service 就依赖於 city agent 那你可能 就会陷入一个 循环依赖的问题 然后 我们是希望 尽可能地 去避免这种情况 OK 刚才说到 充其分报的问题 其实还有一个 多极群的版本 那 如果你是 使用的 city agent 去原生 class ms 方案的话 每一个 city agent 是会去 监听 每一个 极群的 city agent 那造成的 结果 可能是 当你的规模 非常大的时候 如果你一个 极群的 控制面挂了 那 就可能会导致 故障 最终导致 整个控制面的 一个故障 那 我们解决 这个问题的 方案是 引入了一个 叫做 kvstore mesh的 一个设计 这样的话 就是 使每一个 agent 只连到本地的 city ms etcd 而如果要 访问远端的 这些 原数据的话 是通过 一个 operator 来 将 每个 极群的 city ms etcd 的数据 来做一个同步 这样你只要 连接自己 Metadata 那 这个 方案的话 其实社区 也有一版 实现在1.14 是一个 beta的版本 那 考虑到 一些极端的情况 我们也做了 一些 应急的 处理的 准备 其实 比如像 这种 network policy 的 一个 降级 在 比如说 假設 你的 控制面 完全挂了 去通过操作 ebpf map 来加一条 全通的 规则 来使 至少要让 业务的 联通性保证 那 我总结一下 我们 可能算是 踩过的一些坑吧 首先的话 是 内核的版本 因为 你使用 cdm 肯定是需要去 很可能需要 去升级内核 那 我们的 经验是 最好是用 一个 patch version 大一点的 相对来说 会稳定一些 然后的话 在 集讯 扩展 扩展的过程中 需要 不停地去关注 cloud api 的一个活台 有可能这方面 会被忽视 然后 还有 关于 arp table 的一个容量 也是需要去关注 为什么 专门提一下 是因为 Cdm 的之间 它其实是有 一个 心跳的 health check 这个 check 的话 这上面有很多 note 的话 可能会维护 一张比较大的 邻居表 那 这也是值得关注的 一个点 接下来就是 需要去管理 你的所有 client 的一个 重启的行为 我刚才举的例子 是 slim agent 但其实 不仅仅是 slim agent 在 kbus 里面 你比如像 cubelite 或者是 其他的一些大规模的 agent 如果他 启动的时候 要去拉 非常大量的 一个数据 都需要去 考虑这个问题 可能会陷入 重新公报的问题 最后的话就是 需要对针对 slim 有一个 故障的 一个应急方案 我的分享就是 这些 谢谢 大家有什么问题吗 Hello I'm from Intel service match team I have a question for slim In your roadmap you mentioned that you will integrate with the ambiance tunnel from my perspective I think from the functionality perspective I think the Z tunnel has some overlap with slim so I'd like to hear what your comments about integration and the future yeah thank you yeah I think Istio and slim service mesh both have need to work together so we're exploring like different ways that they can so we don't have any specific plans we're going to be driven by community interest and what people want and need from both of the projects so we're looking at like how it can interact with what slim does so that obviously we don't have conflicting things in the network because that's not great I know there's been like some things about the conflict between the Istio CNI and the Silium CNI so hopefully we can resolve those for people that want to run both Istio and Silium in the same cluster so first you mentioned that in the load map the Silium will be integrated with the Istio embedded the tunnel so does that mean that you will provide a CNI will include will include both the Silium current Silium CNI under the the tunnel CNI and make enhanced CNI yeah so I guess the same question as last time so we're not making any predictions about the future of the roadmap like in the Silium project we're really driven by what end users want and end users need so we're going to I guess have the integration that people demand so if there's something that you'd like to see please come comment and github join the community developer meetings all those things I think my question will be in Chinese sorry you can use the case you two are mainly using Silium to solve your pain for example how do you solve the problem of the level seven do you use Envoy Envoy is working with Silium and we use Silium to solve our problems we can't support such a large scale so whether it's IPAM or some physical performance to cause we can't use it to solve the problem of you using this yes yes yes yes yes yes yes yes yes yes we actually not yet we still use STU such a Silium not yet the second problem is that Silium will be safe right I don't worry you will use some solutions for Silium for such solutions for such a solution or how to think which part of the safety for such a safety for the safety for the safety for the safety for the safety for the safety for the safety for the safety for the safety for the safety for the safety for the safety for the safety for the safety for the safety for the safety for the safety for the safety for the safety for the safety room that isn't public yet, but because you're here, you get to know a little bit early. At TubeCon Chicago in just over a month, we're actually going to be launching the EBPF documentary. So it's going to tell the history of how EBPF was created and where it's going in the future. The actual trailer, official trailer, will be launching on Monday, but because you're here today, you get to find out a little bit early. This is for me, yeah. Very nice. EBPF is used for a layer 7 mesh and was the roadmap for the future of layer 4 to layer 7. Okay, yeah. So maybe I can go back to, it's gonna take a second. Yeah, so kind of this slide. So obviously, like we believe in the power of EBPF, like Silium was built from the ground up based on EBPF, and so we found that EBPF is very like performant and scalable. So anything that we can do with EBPF and keep it in the kernel, like we will do, there's some things that are a little bit too complex because of the kind of like limitations of EBPF, and that's when we're offloading it to layer 7, which is an Envoy proxy running in user space, but I think kind of, if you look at the history of networking, we've really seen a lot of the networking stack move into the kernel, and the experimentation first happens in user space, and then once we kind of figure out like the, a good way to do it or the best way to do it, that implementation moves into the kernel. So we saw it first with like TCP IP, and I think we're gonna see a similar thing with service mesh. So more and more network functionality will move into the kernel, and we're gonna see that with Silium service mesh, and I think that's kind of the roadmap. If you want to talk about like more specific things, I would once again like join the developer meeting. There's also an APAC friendly developer meeting, and like jump into some of the issues on GitHub, if you think there's something missing or that you'd like to see. Thanks a lot for the great presentation. I have a question for actually both Jeff and Bill. So we have been using a different solution than Silium, which I'm not going to name in front of camera, but something we run into is we found when the node or pod or not graceful shutdown, the IP allocation would get leaked. So I'm wondering like, I guess for Jeff, have you folks run into similar situations with the Silium, and is there anything you did, like did you folks do anything for that? And for Bill, is there any kind of requirement on like how the node or the pod should be shut down? And have you folks, I guess, in the community heard any kind of situations like IP allocations get leaked, and it's just like, you know, all taken away. So it's kind of a sign new IP to the nodes, and then the whole thing just kind of work. Thanks. Yeah, we kind of run into an IP leak. And I guess there are many solutions for that. What we have recently found is that when you're when you create a EMI on a public cloud, you could you could run into issues that from performance or, you know, provider changes, and your EMI could run into a low back logic. And if you your low back fails, the EMI could just leak. So I guess right now, Silium have a backup mechanism in IP allocation for public clouds. So it will kind of slow down the procedure. But you have to do some monitoring to find out the issue. I would guess. Yeah, I guess kind of the second half of that is, I guess we're like aware of some that also like the re-use of IPs and stuff. And that's something that we're working on for the future. So I would say look for updates in the 1.15 release cycle, which should probably be coming out early next year. Hi, good question. Just mentioned, have you ever done some optimization for Habbo, including the Habbo matrix persistence? Because if you don't have persistence, it seems like it's very short. Only one day is not enough to do much of it. I'm going to ask you, if the original Silium, for example, is using this Siliclub's mesh to do multi-channel interaction, then in the network Yes, there is a network policy. Does it need multiple groups in life? Do I need to implement the same strategy? I might need to implement two groups at a time, right? Yes, so you mentioned it just now. Yes, we created this network policy using a KubeFat. It's a federated network strategy. So you will have a single group in the center Yes, yes, yes. Any more questions? I would like to ask about the framework. We just talked about some problems in the original series. You have done some optimization. I would like to ask about the advantages of the version and the social version. How do you deal with it? For example, the design version has been upgraded. Do you want to continue to deal with it and deal with it in certain circumstances? Yes, in certain circumstances, it will be upgraded. In fact, most of the changes that we can upstream have been mentioned in the community. The internal part may be less committed. Otherwise, it will be upgraded every time. Okay, I understand. Thank you. Okay, time's up. If you have any more questions, come by the Sillium booth tomorrow morning from 10.30 to 2.30. I'll be happy to answer any questions you have. Thanks for coming today.