 So good afternoon, everybody. Nice to have you here all. Today, we are going to have a discussion about connecting and securing services across hybrid and multi-cloud. And today, I have a nice set of colleagues and guys to really talk to each other. So let me start by a quick introduction from myself and from the team as well. So my name is Rania Muhammad. I'm a solution architect in Google. I have the passion towards the service mesh, SOA, integration, digital transformation. And I'm super into the multi-cloud and the hybrid cloud story. Christian, please. Yeah. My name is Christian Huening. I'm from Germany, Hamburg. And I'm in the cloud-native space since 2016, 2015, roughly, doing mostly private cloud setups since that time. Been working at the university first, actually building up a compute cloud there for research and then moved over to the finance industry, where I spun up the past five years, led the team to create a bare metal cluster first and then switched it over to multi-cluster hybrid cloud setup across GCP, AWS, and vSphere OpenStack setups. And now, just very recently, switched over to the defense sector. So we are now working on doing the same, roughly, on a larger scale for the IT service provider of the German Armed Forces. BWI. Pieter? My name is Piet Szczechniak. I work at Google. I'm an engineering manager managing GKE Kubernetes Networking Team. I'm based in Warsaw, in Poland. I consider myself a Kubernetes veteran. I joined this project in January 2015, so before it was popular, at least such popular, probably there was a number of users back then. It was comparable to the size of the audience in this one specific talk. And two years ago, I switched. I was working in the various areas of this project. Two years ago, I switched to networking space, managing the teams there. So I am quite new to networking space. I still consider myself on the learning curve. And yeah. Thank you, Pieter. Ricardo? All right, so next, I'm Ricardo. I'm a computing engineer at CERN. I've been at CERN for a long time. I joined initially to help develop what we call the grid computing infrastructure we have. And I was a software developer for storage systems and computing systems. But with time, there was a lot more interested in big data and large scale infrastructures with the cloud appearing. So the focus at CERN kind of changed from building our own systems to collaborating in these sort of communities like Kubernetes and previously OpenStack as well. And since 2015, I've been doing Kubernetes as well internally. And I'm responsible for some of the platforms we have for containers, Kubernetes, for different services. Also do a bit of machine learning where things get really interesting for us in terms of doing hybrid clouds to get access to accelerators that we don't have on-premises necessarily. So that's it. These days, this is mostly what I've been working on. Thank you, Ricardo. Lost, but not least. Ronald, please. Hello, my name is Roland Kahl. I work at bold.com. I've been there for 10 years as a system engineer. Bold.com is the biggest online retailer in the Netherlands and a very well-known brand. I started 10 years ago. We worked on our data center. And about five years ago, we started also moving workloads into the cloud. And I've been working on that ever since, and especially connecting the two, because we have a hybrid environment. So there's lots of challenges there that we had to solve. Thank you, Ronald. So as we speak about hybrid and multi-cloud, Ronald, would you share with us your point of view about the top two and three or three network and security challenges for the multi and the hybrid cloud, please? Yes, so on a network-related challenge that we faced is that in our data center, we had a lot of technologies deployed for low balancing. It was already quite complicated. And then we moved to some of our workloads to GCP. And then GCP introduced a whole new set of low-balancing technologies. And with every technology that we introduced, we also introduced new naming, DNS naming conventions. And so for our developers, it became pretty hard to understand which naming conventions to use to reach which servers, et cetera. So that was one of the big challenges that we had to solve. And on the security level, the problems that we there have is that in the data center, a lot of the connectivity is protected using IP-based security, right? Firewalls, IP-based. That works if all the workloads have a static IP, because then you can identify them, et cetera. But in GCP, and in special Kubernetes, ports have ephemeral IPs. They change all the time. And securing that properly with mixing those two, that was a real challenge for us. So Ricardo, what do you think? I mean, what are the challenges from your point of view, the top ones? Maybe I'll focus more on the networking part, being like a research lab, our data is kind of, we even publish a lot of our data as open data. So the networking challenges are complicated for us. We don't run most, like when we go hybrid, the goal is more to run the workloads, not so much right now to run services. And in this part, the challenge is really to stay flexible in the choices of clouds, but also regions. We've seen when we try to burst for things like GPUs that we don't have the capacity or the types of accelerators we need in all the regions necessarily. So being flexible in being able to get resources from different regions or even clouds is quite important. Also for specialized accelerators, this is the case. So this is kind of a challenge, because we also need to move around lots of data. So often we need to peer with these regions to have fast networking and dedicated networking and matching this with the flexibility we want to achieve is not necessarily easy. The other challenge we have, which is networking, but also cost-related associated to that, which is we have workloads that generate a lot of data. A single job can generate multiple tens of terabytes. And often we have to bring that data back. And clouds are very good at egress costs and charging those. So when you have to bring the data back, this has to be carefully managed and negotiated. It's not necessarily networking, but it's somehow related as well. Thank you. Thank you, Ricardo. Teater, what do you think? So with the shift from data centers on premises to cloud and multi-cloud and hybrid environment, the problems didn't disappear. They just moved to some other place and the problems are there. So we moved from cables and wires and connecting computers to connecting data centers. We moved from securing the data center from the external world to some extent to securing us connection between those data centers from typically local, fully controlled, consistent environment to distributed, heterogeneous infrastructure that is not fully controlled by the users. As a user, you have limited capabilities to control what's going in the cloud provider's places. And last but not least, the applications from being collocated, being coupled with themselves, like moved to a very distributed model where they are running across the whole world. Thank you, Teater. And Christian, what would be your thoughts? I would say all of the above. Maybe what we found was that in the finance sector, you often have to do with, OK, we have to talk to and connect to banks that have source IP-based routing and these kind of things, but also setting up Kubernetes the same way or kind of the same way in these various providers. The networking always works a bit different, especially when you have these requirements with egress and how that works or different AZs if you went to regional clusters. So it was in that area. And then also, of course, the usual latency aspect, what to put where. Can we use services in one area or one region of the world and reuse them by the other deployments and all these considerations? And of course, naming, but we could talk about that later, I guess. Thank you. So in a sense of that, and based on your experience, how do we or how should we secure and govern the communication between services running in multiple clusters or even running in distributed and different infrastructure? Yeah, for us, this was the case with, I mean, I talk at the beginning of this week by my colleague Caroline about how we did monitoring in the end. So we decided to go with a service mesh solution based on Linkadee. So we use the cluster mesh technology because we said exposing all of these things to the internet or using something that would connect all these clouds, also OpenStack, also vSphere, also the bare metal setup wasn't really feasible. So we needed this connecting technology that would be the same in all of these areas that has a shared trust anchor, that has a shared identity framework that you can use to say, this is all that we talk to, and it also has good latency. So that's why we then in the end decided to use that approach where we also were able to find granular, say, we want the communication only go in this direction, but not back. So in the end, that was the example we would use a singular Grafana at the central cluster. And then as an example, just connect to the Prometheus and the other clusters and use that for short-term observability, but we would also stream data or use services in the other clusters. And the framework allowed us to, with their policies and all that to selectively also prevent that a certain customer request in Germany would reach a cluster in, let's say, Spain. And that allowed us to do that here. Thank you, Christian. Piotr, what would be your thoughts? So service mesh, like Link at D, is a natural choice, like my colleagues will be talking, I guess, about Istio. Is a natural choice for such a situation, so it would like to bring maybe one more solution that is available in Kubernetes, which is like Gateway. Kubernetes Gateway API and the Gateway implementation. In particular, a multi-cluster gateway that is built on top of a multi-cluster service there. So Gateway API is like a portable solution. Kubernetes, there is like a unified implementation, unified definition, unified API. And then various cloud providers, various vendors are providing their custom implementation of this Gateway API. But using the same API, you can apply applications across multiple platforms. It's a role-oriented. So it allows, it's designed in a way that is taking into account both the needs of the cluster operation, administrator, infrastructure operator, as well as the user. And it's very rich when it comes to the features it offers. It supports, for example, like more sophisticated way of balancing on the API level. It supports HTTP header matching, for example, like many other features that are there. This is like the next generation of ingress that is like L7 standard in the Kubernetes world. And when it comes to its status, currently it's more like multi-cluster within one platform. But it's relatively new projects. And I'm pretty sure that over the years, it will be evolving to support also multi-cloud and hybrid use cases. Thank you, Peter. What about you, Ricardo? I would actually second what Peter just said, which is the challenges are there for multi-cluster as well. And the problems we've been facing are related to not having an easy way to express what we are used to express within one cluster, like our back or namespaces or all these network policies across multiple clusters. And defining these policies in the ways we are used to, which means there's different projects building additional things on top, which pose problems to everyone because it's another layer to maintain and to express the same policies and more error-prone. So I think standardizing somehow the multi-cluster situation will help a lot in this area, also for hybrid clouds as well. Thank you. What about you, Ron? I think that what Chris said for us introducing the service mesh was a solution to ensuring that we had a more consistent security model across our hybrid setup. In our case, we used Istio. And we have a lot of autonomous teams that are responsible for managing their security of their own service. So the Istio authorization policies, for example, they have to configure them themselves. And we have some guardrails in place using OPA gatekeeper to ensure that they configure proper authorization policies and don't accidentally open up their service for the whole company, et cetera. It can happen. So I think that is for us that the Istio service mesh allowed us to move away from IP-based access towards a more identity-based access. Thank you. So maybe, Ron, you can walk us through how to manage and communicate the communication, the traffic, the connection between services. And this is based on what you have been experiencing in bot.com in order to really manage such a thing in multi-cluster, running in different, again, and distributed infrastructure, like the hybrid or the multi-cloud story. Yeah, so like I touched upon in the first question, we had a lot of different low-balancing technologies. And that made it really hard for users to figure out how to configure the service to connect to the right service. So we have introduced a new naming convention that will actually apply to our service mesh. We're not using the Kubernetes service names. We actually use some kind of alias. But that alias works across our Kubernetes environment, but also in our data center. So it doesn't really matter where the service runs. The name stays the same. And if the service moves from the data center to the cloud, the name stays the same. So no more changes to dependencies. I need to change my host name, et cetera. So the service mesh allowed us to do that. And that was a very powerful feature for it. Thank you, Ron. What about you, Ricardo? Yeah, so I mentioned that we don't have the traditional service deployments. And there's a lot of focus on service mesh for us. What we want to offer is kind of a high to the complexity of multi-cluster, multi-cloud for our users. But we are deploying workloads that run large, large amount of jobs and things like this. So since a few years, we've been looking at multiple ideas for this. We've tested what was called Federation View 1, which was really nice API, similar to Kubernetes, but it had a lot of limitations. Then there was V2, which was maybe over complex for what we wanted to achieve and needed a lot of customization to be able to use that. There are projects like AdMirality. We actually implemented internally, if you've heard of the virtual kubelet. We actually implemented a backend that would talk to Kubernetes API behind. And we would hide behind a virtual node, full Kubernetes clusters that were running in the cloud. All of these things we've been trying to do. I think the path that is being followed now, which is this idea of having also cluster mesh in addition to the service mesh and some sort of gateway between for the hybrid connectivity, that's something we are looking forward to test further this year. And hopefully, this will be the last one we have to go through before we have the dream deployment. Thank you, Ricardo. What about you, Piotr? So maybe I will echo to what you said in the previous topic, like network policy is important aspect here. In addition to the network policy that is supported on the service mesh layer, there is also like L4 network policy that is defined in the open source Kubernetes that is supported, for example, by Selium. So from my perspective from Google Cloud, like we are offering network policy as a part of data planning that is our Selium solution for GKE. Thank you, Piotr. What about you, Christian? We also found that the network policy problem was actually interesting when implementing that with the built-in policies. It was really hard to do and also to expose that to developers. So we were really curious about, especially now, the new API, the new gateway API policies that come along. Also, it was the new linker-de-release. We didn't get to try that out, but it looks promising. So I would look at that. The other thing that service-wise was interesting that the open banking stack we deployed, it was capable of being multi-tenant, but also some customers demanded to have that isolated, so a dedicated stack. And this whole new multi-cluster and hybrid cloud setup allowed us to very distinctively adhere to the specific requirements of the customers, while at the same time reducing cognitive load for the teams by centralizing certain components or control plans and not duplicate them anymore. And that's where the service mesh then also helped a lot to have these specific connections. Thank you, Christian. So, Ronald, would you please share with us the experience of Bal.com for the usage of the custom resource definitions and Config Connector, and how did that help in implementing the self-service? I can. So for those who are familiar with Istio, it has a component called egress gateway. And we use that for filtering all the outbound traffic and making sure that certain applications are able to reach a certain external party, while others are not. But the egress gateway is a shared component, so there's lots of configuration going in there. And naturally, we're not allowing individual teams to make modifications to the configuration. So I've introduced a custom resource definition, which we call external site, which a team can request or can add to their infrastructure repository. The external site resource in itself does nothing, but it provides some kind of metadata for another process to generate the egress configuration based on the external sites. But there's another part involved here. External sites need to be approved by our security team because of compliance reasons and make sure that no data is exposed to parties, et cetera. So the security team needs to create a external site approval custom resource. And those two are linked, so there's a controller behind them that makes sure that once the approval is there, that the state gets to approved. And once an external site is approved, the configuration is generated. This is all done automatically. So the team only needs to go to security, say I need access to this site, and that's it. And we used to be also in this flow where we had to actually do the work for them to make sure that they could do the external configuration was set up, but that's now done automatically using custom resource and controllers. Super interesting. Thank you. Maybe Ricardo, would you please share with us the challenges that Shurn has been facing or is facing in integrating HPC workloads with cloud-native workloads, especially in distributed cloud? Yeah, so I'm happy. I'm always happy to talk about that. It's not specific to Shurn. It's really research workloads, actually. We run, I always advertise this. I also help out with research end user group in the CNCF. So if you have similar requirements, we're happy to have more people helping. I think the main problem has been that Kubernetes was focused always from the start very much on services. The job API had some limitations and was very specific for one type of job. So the scheduler is also lacking some of the features that traditional HPC workloads would require, like queues or the ability to co-schedule workloads, gang scheduling, things like this, fair share. These are limitations that are kind of important. And this is really more Kubernetes related. If we start looking at using cloud resources, the main problematic is a lot of these workloads need a lot of data. And specifically, if we're talking about hybrid deployments, moving data around can be complicated. Data gravity is a real thing. For some of the research users, and specifically CERN, we actually built mechanisms on top to deal with this in a good way. For the traditional HPC workloads, they are really focused on like a supercomputer with data locality. And this makes it harder to go hybrid as well. So I think that's a big challenge. That is not Kubernetes specific. It's more the hybrid part. Thank you, Ricardo. So Peter, would you please share with us how do you and Google envision the ability and the benefits and the challenges to charge back workloads running in hybrid and multi-cloud? Yeah, sure. So to explain maybe what charge back means, charge back is an ability to allocate the infrastructure cost to some specific workloads or some specific applications. I think this is super important for many companies. I can see two use cases here. One use case is attributing, mapping the cost to the service to the application, and mapping to a specific user that owns these applications. So this is for internal budgeting purposes. And then the second use case is for the companies that are building SaaS solutions on top of Kubernetes, where you want to attribute the cost of a running application to some specific user of there. In both cases, the problem is quite challenging. So there is plenty of the questions that needs to be answered up front, like how to attribute those costs. Like, you know, is this about, like, do you want to cover the actual usage or what is requested? Do you want to, like, what to do with the spare capacity that nobody uses this? And for example, like, internally at Google, we have a very advanced economy behind that. In the ecosystem, there is multiple solutions. The cloud ecosystem is a Kubernetes ecosystem. There is, like, multiple solutions in that space, fortunately, that are addressing those problems. So, like, Cast AI, Kube cost, to name a few cloud providers, like, for example, GCP offers us the custom solutions. Like, we have, like, GKE cost allocation. Why do we want to care about that? One of the reasons is, like, cost allocation, cost attribution is, like, the first step for the cost optimization. And, you know, probably this is, like, one of the most hot topics these days when it comes, like, you know, to the infrastructure, cost infrastructure, giving the, you know, macroeconomic situations in which we are all operating. Super. Thank you, Peter. Christian, can you give us your thoughts there, please? My thoughts, yeah. Maybe two things. The one thing with multi-cluster that was, I think, a very good thing for the company was that it allowed us to, if you take Kubernetes as the foundational layer for everything, which we did early on, even on bare metal, and you base your entire stack on it, and then you find a way, like, copy, or in our case, we use, I mean, the cluster API, in our case, we use SAP Gardener, which is a tool that kind of does, yeah, not exactly the same, but it's in the same realm. Then you suddenly have the opportunity to really deploy your stack to wherever the customer wants it. So we had a customer that said, I want it on AWS, and we said, OK, then we deploy this onto AWS into their provided account. That was a very good thing, and a very surprising thing for us, which we just didn't really think about, was that by creating all these clusters and all these environments and spreading them around and having somehow to come up with names for that, we totally forgot that we had other teams, like the support teams, for instance, that would look into tools like Kibana and Elastic and other things. All of a sudden, had to deal with all these names and where data originated from. So there was a whole bunch of organizational changes needed in the processes behind that when going from what we essentially had as a big bare metal multi-tenancy cluster setup to this hybrid multi-cluster world that we overlooked a bit. So that would be a word of advice to take that into account early on that you have to also take all these other departments with you on that journey. Thank you, Christian. Ronald, what are your thoughts? On the costs. On the cost part, yeah, we are very cost-conscious. Currently, with our cloud spend, we have a budget and a target budget, so OK, this is the spend for this year, and we need to try to stay within the budget, of course. We have a lot of metrics collected to see which team is using, which resources, and what those resources are costing them. We try to find anomalies. If huge spend costs are detected and teams are informed, say, and asked, OK, what's going on? Did you make a mistake in your BigQuery, for example, or have you spun up too many polls, things like that? So yeah, for us, that's an important part of cloud, the insights in the costs, and all the shared infrastructure that we operate, for example, not gateways and traffic that flows between different Kubernetes clusters. And the Kubernetes cluster itself, those costs are spread over the teams. And if possible, based on ratio, so on use, but yeah, if it's not possible, then we just divide it by the number of teams. Super, thank you all. So Christian, would you please share with us what kind of operation changes are requested from our organization aspect in a multi-cloud or a hybrid cloud environment? Yeah, I think I did that. Yeah, but I mean, what I said essentially is these changes in the other departments that you really have to ramp up, that multi-cloud consciousness naming was really a hard problem for us. We even thought about writing a naming tool to name the cluster, because you have to somehow have this in DNS. You have to have it in HashiCorp Vault or whatever you use for secrets. You have to somehow reflect that. You have it in your cluster management solution. You can come up with using IATA codes, but that only goes so far. Somebody suggested to use the IKEA catalog naming, because they are short names with no uploads. So that was a thing. But also cost, of course, is a thing, because all of a sudden you have to somehow organize all these different providers. And then even in on-prem settings, it's a completely different setup when you have hardware costs that you somehow have to translate into cost per use. So that becomes very hard to navigate here. Thank you. Peter, would you please share with us your thoughts on how to govern and integrate workloads distributed in different infrastructure? And how can we have a consistent story and flow across those infrastructure? Yes, this is a good question. The key to answer is two words here. One is like abstraction. Another one is like standardization. So we need to take this very complex problem and try to decomposite this into subproblems by providing abstractions on the multiple levels. So starting from the very low level, so we have multiple data centers, like multi-cloud on-premises potentially, that we want to connect together. And we need a secure, we need a reliable, we need a high bandwidth, high throughput solution for that. And to ensure that the private network from one data center from one entity has an ability to talk to the private network from the other entity. So for example, at Google, we have a couple of solutions here, one is like various flavors of interconnect solutions. We have cloud VPN. With those things in place, we are bringing all those distant entities into the single place from logical perspective. They're now working together, they're closer together. Then, there is still different infrastructure in all those solutions. They are connected together, but the infrastructure is different. So Kubernetes is, of course, an answer for abstracting the infrastructure, providing a consistent unified API for deploying applications, deploying workloads, on top of various infrastructures, various flavors of the infrastructure, different heterogeneous infrastructure across multiple data centers. And then on top of that, there is a service mesh layer that offers a solution for abstracting the services level, abstracting the services, and various applications like offering secure, offering observable solution for abstracting those problems. I see that we are getting out of the time, so I try to keep it short. Thank you. Thank you, Piotr. So thank you all for the great discussion, but I would love to get final thoughts from each one of you. And let's start by Romant, please. Yeah, so if you have a hybrid situation, multiple clouds, data center in the mix, et cetera, the connectivity and the security challenges that you have are probably solved using one of the open source service measures that are available. So don't be scared of them. I would say start small and build out from there. Thank you, Ricardo. Yeah, so I will also mention that, is that we all have similar use cases, and we have this opportunity of being in a huge community and being able to work together instead of solving our own problems in the corner. So the best solution for everyone is to continue doing this as we are, which is to gather from time to time in conferences like this. And make sure that we all push for our own use cases, but in a coordinated, grouped way. Thank you, Ricardo. Piotr? So I want to echo what you just said. Running services in a multicloud and hybrid environment, doing this in a secure way is like a tough problem. And it cannot be solved in isolation by one person, one company, one entity. Because it's a distributed problem. Nobody has a control on that. So I'm super happy that we are having the open source community. We are having CNCF that are facilitating the work in that space. There is already a lot of great stuff that is in place, like Kubernetes, service mesh, Istio, Gateway, plenty of other things, and more great things to come over the next couple of years, for sure. Thank you, Piotr. Christian? Yeah, I think there is a diverse set of solutions. The only thing that you should really do is to ask, why am I going hybrid multicloud cluster, whatever? And is it really required? But if you say yes, and if you have a very good reason, then, as said, the community can cover you. Yep. Thank you. Thank you. So I hope that you enjoyed the discussion today. And just to close, it's a matter of patterns. We don't have one single magical solution for everything. But always ask why, as Christian mentioned, also check the contribution and contribute in the use cases within the community, and definitely enjoy it to the max. And with that, I would like to just ask you, if you have any questions, please, we are open to really answer any questions to have. And also, please, give us your feedback. It always matters for us. Thank you.