 So my name is Shannon Cohen. I'm a product manager at Pivotal, responsible for routing and load balancing in Cloud Foundry. I'm joined today by Aaron Hurley, who's our lead engineer on the routing team. And we're going to share with you some of the work that we've been doing over the last year integrating Istio and Envoy with Cloud Foundry, what we're currently working on, and where we see it progressing. So I'm going to start with a couple of basics about the problems that Istio solves. And then we'll get into the nitty-gritty of integrations and specific features that we're rolling out in Cloud Foundry. So the domain space of problems that Istio attempts to address is quite broad, but it is scoped to how microservices are consumed by clients, how microservices interact with one another and consume their dependencies. So this diagram shows some of the concerns in that domain space. You've got services of disparate, written in different frameworks. They have to manage multiple versions of them. You have to manage how they're consumed by clients and how they consume external dependencies. And the challenge with this is that there's many things which developers of each of those microservices have to be concerned with. And it's the same common problems that every developer needs to be concerned with, things like load balancing and rate limiting and retries and timeouts, very standard communication patterns. Operators are also concerned with security concerns like is data in flight encrypted? How are requests authenticated and authorized? How do I manage security policy across a large portfolio of these applications? And how do I have visibility into the traffic patterns and whether the security policies being enforced are the appropriate ones? So a common way that these concerns are being addressed today is that each application developer is less to solve them themselves. And all application developers solve them over and over and over and across large portfolios of applications. They're not always solved the same way. It's difficult to have visibility into whether these are solved in a consistent way. Here the colored boxes represent either custom code or libraries that application developers are leveraging to solve these problems. As I mentioned, a problem with this pattern is consistency. A lot of folks who are leveraging Cloud Foundry are doing so to enable large teams of developers, many developers and many applications. And so managing that consistency or lack thereof of how these problems are solved is a challenge. It's very hard to have visibility into, for example, across all of these applications have they updated their dependencies to address a CVE that's been rolled out. It's also a burden on application developers to repeat these solutions over and over when they could be focused on value added business logic. An alternative approach is to use an out of process proxy or side car and a centralized configuration plane or control plane to manage them all. Istio and Envoy are examples of what we call a service mesh or service mesh describes as pattern of side cars and control planes. And a service mesh helps address some of the concerns that or troubles with the custom code pattern in that it provides the operators with a way of keeping all of those solutions up to date. The platform operator is responsible for updating the side cars and then app developers are focused on their business logic code. It's a polyglot solution because the proxies responsible for these patterns are out of process. It can be agnostic of the framework that the applications are written in. Policies can be applied at scale because you have centralized management of all the side cars and the policies that the side cars are enforcing. Application developers don't have to be responsible for as much security because a security admin can be confident that when applying a security policy across a service mesh that it's enforced at all points. And the user experience for applying these policies is consistent for all personas involved whether it's your security admins or your platform operators or the application developers. So as I mentioned, SDO is a service mesh and we like SDO because it has a very vibrant open source community. It is platform agnostic. Like the success we've seen with the Kubernetes project, SDO is gaining wide adoption. There's consistent maintenance and innovation happening from Google, IBM, Lyft and more recently Pivotal. Envoy has already been proven in production. There's an emphasis in SDO on extensibility and plugability and it addresses some of the concerns that I mentioned around traffic management, security and observability. It's another view of the relationship between SDO and the data plane. The three data planes that we're concerned with are Ingress, represented here by the red arrows. Service to service represented by green, otherwise known as east-west and egress. SDO has three primary components. Pilots responsible for applying configuration across the sidecars. Mixer is responsible for receiving telemetry and making additional policy checks in a pluggable way so you can have multiple mixer back ends which are providing additional policy engines. And Citadel which is responsible for distributing TLS certificates. When you look at the data flow diagrams here, there's a number of components involved and this is the nature of requests being made through many intermediating components. But when you think about how to manage traffic management and security policies in a service mesh, the view becomes much more simple because you're only having to control the sidecars. So it's like having a remote agent in every application with which you can use to as a security admin or platform operator enforce global policies or enable application developers to apply their own policies within those guardrails. One reason why we chose SDO to integrate in Cloud Foundry was so that we could deliver, speaking selfishly from the Cloud Foundry routing team, deliver on use cases that we've heard from Cloud Foundry operators time and time and again more quickly. We're one team. We have our team sizes grown from four to eight and fluctuates and being able to leverage this large community of folks who are innovating on the same use cases that we've heard from is a big advantage. And so we've in the last year, as we'll describe, have integrated or begun integration of SDO with Cloud Foundry and participated in that community. Erin will talk about some of the contributions we've made. So a bit about what we've done already and where we're going. As some of you may be aware, Cloud Foundry already has installed an Envoy proxy in every container. But that proxy is currently statically configured and fulfilling one small purpose. It's actually a very valuable one. It is terminating TLS from the go routers for ingress traffic and also through identity in the certificates guaranteeing against any misrouting due to a stale routing table in the go routers that may occur during a failure in the control plane. Of course, we'd like to eventually have those sidecars be dynamically configured and I'll get to that in a moment. What the team is currently working on is a analogous and eventual replacement for go routers and TCP routers, leveraging a ingress perimeter Envoy that is dynamically configured by SDO Pilot. And in order to accomplish that, we've had to do some fundamental integrations between Cloud Foundry and Pilot. Erin will talk about some technical details, but the gist of it is that we've needed to sync the idea of routes and applications and the IPs and ports for application instances from Cloud Foundry to SDO so that the perimeter or ingress envoys have knowledge or this route mapping between a URL and back in IPs and ports as they change as the container orchestrator moves containers around. So we've done that and we've done those fundamental integrations. And in addition, we've targeted some initial new capability that we can bring to Cloud Foundry based on SDO Envoy and the one that we've chosen based on feedback from many users has been supported for traffic splitting. This will enable an application developer interested in rolling out a new version of their application to control some percentage of traffic sent to a new version. In a demo you'll see later that we can show how, for example, an app developer can send 10% to version 2 and then turn that knob, increase the percentages and shift traffic over time and finally cut over. So we've done that integration. We'll show you a demo of that later for that feature. And we're currently working on scaling this integration. SDO itself has been tested at Kubernetes scale. This is not yet Cloud Foundry scale to give you approximate comparisons. A Kubernetes cluster has been tested at about approximately 10,000 containers. A Cloud Foundry cluster supports 250,000 containers. So there's a significant difference in scale for a single cluster there. We have tested that our integration supports 10,000. But in the interest of, as we've done historically, getting production mileage on Cloud Foundry innovations, we target a scale of a production environment which we have access to. Pivotal manages something called Pivotal Web Services. That production environment serves production customers and operates currently at about 20,000 routes and 20,000 apps. So that's our current scale target. So we're iteratively working towards that target. We've already accomplished an integration of 10,000. And that's what we're currently working on. Just recently, the CF Container Networking Team has begun the initial work to eventually enable east-west traffic, egress traffic from a container to transit the sidecar and dynamically configuring that sidecar for the purpose of eventually offering features like client-side load balancing and control of retries and timeouts and all of that. But there's some fundamental work that needs to be done first, packet capture of egress, sending that through to Envoy Vips and eventually the dynamic configuration of those Envoy's. So once we have these fundamental ingressions in place, we're excited to roll out to Cloud Foundry users the capabilities that Istio offers. One of the top asks we hear from users is support for mutual authentication or MTLS between services so that encryption everywhere in the service mesh, sorry, data is encrypted everywhere in the service mesh. This will enable us also to offer token validation for authentication and authorization, support for rate limiting, support for new protocols like HTTP2. We hope that Envoy will soon support UDP, which will I think cover many, many use cases, retries timeouts, redirects and rewrites and a great deal of flexibility in route matching. Folks have asked for regular expressions instead of just prefixes that Cloud Foundry supports already. We see the Istio-based service that we're developing as not just being responsible for ingress, but also service to service and egress as I mentioned. And we're also participating in conversations with Istio community on how Istio can facilitate application of these traffic management and security policies not just within a Cloud Foundry cluster, but between Cloud Foundry clusters, between Cloud Foundry and Kubernetes. The Istio community does have currently recommended patterns for a single Istio governing multiple clusters, but not yet for how these policies can be managed across clusters which are in across geographic regions or across cloud providers. And those discussions currently are in a conceptual phase. But recently a couple of proposals came out last week that were a good read. And if you're a member of or considering joining the Dev mailing list at recommending reviewing those proposals and participating in that conversation. A point about rollout plan. The new Istio-based routing control plane is being rolled out in parallel. There is no, absolutely no deprecation timeline or sunset timeline for the existing go routers and TCP routers. We know many folks are getting great outcomes of those production grade services. The envoy in Istio control plane is being brought up in parallel and we recommend that for app developers and operators who want to leverage the new capabilities that they would do so by using a different DNS name. Eventually once we reach parity, we would identify a deprecation timeline for the go routers and TCP routers and we're very much looking forward to having a single proxy that supports those use cases. We mean to do the work necessary to support that scale. Okay, here's a quick diagram of that parallel control plane, a data plane. All right, with that I'm going to turn it over to Aaron. Thank you. All right, thanks Shannon. So Shannon touched on a lot of the product value that we're getting out of our new routing subsystem. I'm going to talk about some of the implementation details, engineering benefits and then as well as our experience with the Istio community so far. I'm going to try to get through this quickly. So as you can see in the bottom left-hand corner here, we have an envoy which is configured in the gateway mode to the edge of the cloud. The ingress traffic will go through this envoy. This envoy receives its configuration from Pilot through the Envoy V2 XDS API. So this is a bi-directional GRPC stream. Pilot itself receives its config information from Co-Pilot which is a component that the routing team has created. And you can see here that Co-Pilot is really responsible for talking to Cloud Controller and Diego. And it takes the actual LRP information and maps that to the routes so that we can provide all of the Cloud Foundry route information into Istio config so that Pilot can read it. You can see that the solid lines are event-based transmissions. These are like CRUD actions. Whereas the dotted lines are the bulk synchronous actions. And so these happen every 30 or 60 seconds or so. And this guarantees that if we do happen to miss any one-time event-based transmissions, that we'll still have any eventually consistent model which has served us well so far. So some of the engineering benefits of the new routing subsystem is that it's a simplified routing tier within Cloud Foundry. So instead of having go router and TCP router and routing API and that, we now only have Envoy and then Pilot and Co-Pilot. So it's just we've converged on one control plane on one router and we expect it to be able to handle everything. Another added benefit is the resiliency of each of these components. So both Envoy and Pilot have some built-in resiliency. Resiliency is to like a network partition if it loses contact to, like if Envoy loses contact to Pilot, there's some built-in resiliency as to how long it should keep its routes in place. And these are all configurable so we can tweak as needed. Another benefit to really the entire Cloud Foundry system is the fact that we're able to clean up the container orchestration layer. And we can do this because currently Diego has some route emitters that are co-located on each of the cells and these are responsible for emitting the route registration messages. As I showed in the previous slide, we're now polling BBS which holds all of the actual state of the world so that piece of the Diego cell reaching back out to NAT will no longer be necessary and it just cleans up the abstraction across the board. This provides some of the initial groundwork towards a service mesh to gain all the capabilities that Shannon mentioned. And maybe most excitingly is the fact that we hope to better enable all the other teams for future integration and extensions of STO and a way to get other teams to accelerate on delivering some of their features. As Shannon mentioned, our current focus is scaling, a target of 20,000 apps and routes. We've been tackling this in an incremental process, basically scaling till it breaks, finding what breaks, fix it and repeat. It's been a pretty fun track of work so far. Some of the early things were obvious inefficiencies in our code but then we also found some inefficiencies in pilot itself and contributed some bug fixes to get scaling up in pilot itself. So we've learned that through our CI and the test that we run that we are able to identify some of these scaling issues more transparently than even the community might as they don't quite have it built into their CI yet. They do run them daily but we find, you know, a red box in CI, a much easier thing to notice. And then currently we're around 11,000 so we're on our way. So as Shannon mentioned, we've been working with STO for about a year or so now and we've learned a number of things. Probably the biggest thing is that it's a fast moving project. STO just released its 1.0 release a few months ago so you can imagine up until then there are things constantly changing and staying on top of those changes has been tricky. We learned that a lot of context additionally is needed to understand not just the code base but also, you know, the intentions of the entire platform itself. So what we learned was rather than our standard pair rotation strategy that we take, we actually dedicated one pair of our team to work solely on STO community focus work. And this was able to get us the more like day to day knowledge of how this community works and build up some relationships and some goodwill with the community so that we were able to further our understanding of STO. And one way that we saw an opportunity to do this was the fact that STO is largely built with Kubernetes in mind and this is largely reflected in the code. So since STO is intended to be platform agnostic, we saw this as an opportunity to provide some pull requests to generalize a lot of the code. Some of the other things we learned was the way communication in the community works is slightly differently and what we have found best is just to be as open and transparent early. So creating a GitHub issue, gathering comments, using that as an audit trail and then providing proposals for comments has worked best for us. And then last thing to mention here is we do have a few like blessed STO members on the routing team now and so this comes with some further capabilities within the project itself but also gives Cloud Foundry a bit more representation into the community. So one big piece of work that we worked on is the mesh config API and protocol. So this was really a decomposition of pilot, so an incremental refactor and the goals of this was to make it easier to test pilot. It resulted in a bit more modular and more well-defined boundaries of the code itself. We were able to rip out some of the more platform-specific code as we follow a general API and protocol now. This is a well-defined and accepted API so that all folks that want to provide a server know what they need to provide and can expect the same like responses. And then lastly, as a result of the MCP work, it's also easier to extend for future tools instead of having logic live inside of pilot, these are now moved out to the MCP server and so the server just has to communicate all of its config over the APIs. And so it looks something like this. So previously up top you can see that there's a client for Kubernetes and a client for co-pilot. These both reached out to different servers and they had different APIs. After the MCP work is all completed, it will look something more similar to the bottom where there's a generic MCP client and then some other servers that follow the MCP API. For example, co-pilot for Cloud Foundry or galley for Kubernetes. Now I'll do a quick demo of our weighted routing functionality so far and this is going to follow a canary rollout of doing a standard v1 to v2 and switching traffic from 0 to 10% to 50 then 100. So first, I'm just kind of showing that there's two apps. They're very simple at v1 and at v2 and they're currently only app v1 is mapped to a host app.istio. And here on the bottom is just a little curl loop so I'm curling 20 times and the currently highlighting v1 so you can see that 100% of the traffic is being sent to v1. What I just did here was I added a weight of 9 to the existing route mapping and maybe something I should pause this for a real quick second is that right now this current demo is weight-based, integer weight-based and so it will essentially take the weight you assigned this route mapping along with a sum of all the others and that's how we'll determine the percentages. We actually plan to move to a percentage-based routing once we get some time with the CAPI team to work on the v3 routes object. So what's in this demo may slightly differ from what we have in the future. So I did a preemptive increase of the existing route mapping from 1 to 9. It still receives 100% but this will make sense once I map v2 to the same route so it gets the default mapping of 1 and then so we would expect that the v2 instance would get 10% of the traffic. So right here I'm just quickly taking a look at the route mapping's object. Even recorded demos have some unnecessary steps but this is all just a show that the route mapping itself by default will have this property and there it is. So now when we do this same curl you can see that there are some v2s that are starting to pop up. So this example shows that 2 out of the 20 so that's 10% so that worked out as we had hoped. Now I'm just going to update the original v1 mapping and move that back down to 1. So now each instance has a weight of 1 meaning we should get 50% of the traffic split between both of them. And it takes a little bit of time to propagate but you can see that it starts kicking in halfway through. And then so kind of the last step from here is as you're monitoring your app and 50% of the traffic going to v2, everything seems great. Let's just finish the cut over and un-map the original mapping to the v1 instance. And then from here do one last validation, make sure traffic's going where we expect it to go. And we can see that it's all now going to v2. Alright. So here's a list of resources that we've referenced in the talk. The slides are uploaded so feel free to take a look. And we want to hear from you. I just showed you one example and using API endpoints is not the best but we're curious to hear what are your thoughts as to what the provided UX should be. Should it be something similar to CF or would you prefer to actually use some of the Istio native configs? Additionally, what sort of problems could Istio address for you or are there things that you're looking for? We would love to hear your ideas there. And then lastly we have a routing and networking office hours today at 350. There's also an Istio birds of a feather session at 530 and you can always reach us in the routing or Istio channel in Cloud Foundry Slack. And that is everything.