 All right, I'm super excited to be standing up here with a ship that we're gonna talk about kind of the next generation of service mesh And how a ship that's implemented some really Amazing features to get the like most bang out of your buck from service mesh And so I'm gonna hand it over to Brandon here to kind of talk about ship awesome Yeah, I'm Brandon Barrow senior engineer at ship. I work on our developer platform team. So we expose a self-service platform for developers that We hope they enjoy I'm pretty passionate about developer experience so That's kind of what I I try to do every day and work primarily on Kubernetes go Service mesh and API gateway as of late I'm Nick Nellis. I'm a field engineer at soil. I help companies like shift implement service mesh in complex multi cluster multi environment environments and so today we're going to talk about some of the complexities and Features that we got out of service mesh in ships environment. All right, so who shipped We were founded in 2014 in Birmingham, Alabama We're in multiple cities, but still headquartered there Serve over five thousand cities partner with hundred fifty thousand or hundred fifty retailers and three thousand 300,000 shoppers And cover eighty percent of households nationwide Technology-wise We primarily are a polyglot environment mostly microservices written in go Got about probably four hundred plus running across multiple different environments a lot of clusters 60 plus I think over all environments We process about 3k our RPS on average services communicate mostly HTTP and some gRPC and Yeah, we have some other languages as well, but most mostly everything's written in go so beginning We sort of reached the point where we might need something like a service mesh once we started in doing a cloud migration So initially most of our clusters taught Sort of within themselves using cluster.local DNS We expose a YAML file tell me if you've heard that one before For developers in their repo that allow them to sort of define what dependencies they have injected into an environment variable and issues Is the the native Kubernetes DNS However, when we started implementing multi cluster, this became problematic And so we needed a way to handle that The Yeah, so initially we we settled on sort of a redirect method by implementing external services On our external name services on every cluster for every possible destination So that we could redirect Geographically to the nearest cluster where the app actually lived if somebody requested something that wasn't on their cluster But that was sort of a stopgap solution. So we had an engineer start PO seeing Istio in our environments If he's in the audience Keith Maddox now at Microsoft And so he sort of spearheaded our initial foray into service mesh Which inspired a nation? So yeah, initially just Keith was working on it and We we sort of reached an implementation in staging That worked pretty well. We deployed all of our Istio crd's via Helm chart it were pretty much invisible to the developer. We did expose a very loose abstraction Over the Istio primitives, but for the most part developers at ship just kind of want things to work so We we definitely came up with a default optimization for them So after that though we decided To partner with solo To re-architect the whole solution using glue mesh In order to get a few features. We really wanted like We wanted to be able to manage everything from a single control plane across multi-cluster as well as define our Our primitives in a higher level CRD so Yep So yeah, one thing we wanted to kind of highlight on here is when you when you take on Implementing a service mesh or even when you're getting started That typically falls on your operations team even though a service mesh is for developers It's usually an added responsibility for your ops team and Operations teams are typically one of the smallest teams in your company and they're really highly centralized Knowledge base and so something with like ship is they had somebody like Keith go out and learn service mesh but over time they actually had to Kind of share that knowledge with the rest of the team And so they kind of realized that over time now Keith is that Microsoft that the rest of the developers and other operators had to go out and learn service mesh and so one of the key learning points I think from that experience was that ship needed to kind of expand that knowledge base and kind of share the expertise and then You know insulate themselves from you know having one person have all the keys to their service mesh and so there's a lot of ways that you can actually Kind of learn about service mesh From the early onset if you're thinking about using service mesh, I would encourage you to Get a team involved rather than just a single person I think some there's some really great books out there that teach you about service mesh For example the one is in action that Christian and Lynn wrote is pretty much a full stop zero to a hundred If you want to learn service mesh or Istio through and through and so having your engineers kind of ramp up early Is going to pay off immensely in the long run because you'll be spreading out that expertise Yeah, so why service mesh? Ultimately Initially a lot of our Multi-cluster solution depended on hairpinning so we didn't we had multiple hops or we didn't need them to be Even the the DNS redirection we were using required hitting an ingress on another cluster in order to actually reach the service where we wanted to talk Service A to service B So it worked, but it was undesirable. There were a lot more optimizations to be made We had we have robust failover it shipped It does require human action right now So some developers like that, but I know a large percentage of them really would prefer things to just kind of happen automatically So envoy in Istio and glue give us sort of that automatic failover capability. We needed to feel confident The Developers also wanted more options around how their services were routed to so closest locality Weighted routing that sort of thing was desirable From an implementation standpoint for us And then we have an info sick team. They love zero trust. They love in TLS. They love policy-based enforcement So definitely wanted that that was a plus for them. Yeah, I think one thing that Everything at brand enlisted that's what service messes do But they actually had one extra complexity is that their applications are spread out through a number of clusters And so this CNCF survey kind of shows here that the number of people running just one cluster in production is decreasing and Multiple clusters in production is increasing And so again kind of service mesh largely when we talk about it is just single cluster But they wanted those same features for all their clusters So yeah, I want to talk about your architecture a little bit and we'll talk about the hairpinning and stuff Yeah, sure. Yeah, so initially we were in pretty closely coupled to clusters to handle ingress for this given service running on that cluster Each service had sort of a cluster local External URL that was available internally for other applications if they were not on the same cluster as that service to talk to So we needed to decouple the cluster abstraction from the equation when Deciding how to route to an application service from outside or inside the mesh We did we wanted developers to really have one URL to use to you know rule them all Inside outside whatever not care about cluster Completely remove that from the equation And we wanted also a single management plane like I said before So we were having to in our CI CD pipelines Deploy a CEO CRDs to each namespace in each cluster where things were deployed So coordinating that you know it worked But it's much nicer to be able to just deploy a single CRD to manage multiple complex SEO primitives to a single management plane per application We wanted something relatively plug-and-play solo glue is that And driven by get ops, so we don't do anything manually Everything is managed through terraform go and helm charts. Yeah, and I think one thing here is that it wasn't just a studio to make this work they needed to combine their cloud DNS as well, so they have a Essentially a unique DNS name for every service and so when your application needs to reach another one it resolves to that DNS name and then routed to a gateway, and I don't know if you guys use HTTPS But you lose some once you leave your cluster you're losing some of the identity of your application And so not only are you hairpinning, but it's not as secure as it could be Yeah, whenever you leave the cluster to go to another cluster at the moment in the the legacy architecture It is over HTTPS. So you do have that additional computation for for decrypting the HTTPS traffic and and terminating down to the service level after that so MTLS directly kind of helps with that and And This is the the North the kind of new mesh architecture. So you can see instead of having to go ingress We go directly east-west wherever a service is using our singular URL which Nick will talk a little bit about later the glue knows how to intelligently route If it knows about it on the mesh it can route directly even if the URL is resolvable to an ingress IP In the environment pertaining to like a regular Istio ingress It won't use that if it knows it can reach it through less hops Yeah, so I think there's a kind of a lot to unpack there In terms of service mesh features I think one of the big things that most of us start with when you start using service mesh is an ingress gateway We use Istio, so it'd be Istio ingress gateway and it has like some really great features services discovery observability MTLS And then like a failover or a liar detection and that works really good in a single cluster But the architecture changes quite significantly when you have multiple clusters And so if you if you decide to deploy two gateways now you just have two single service mesh is deployed to or two clusters you have two single service mesh deployments to those clusters and the problems get a lot more complicated and Ship solve those by using cloud DNS to help the routing from it But it's not because they're not connected to each other again They were able to only use HTTPS instead of MTLS And kind of you were they were stuck applying configuration to each cluster individual so you were treating your clusters more like peps than cattle and so that brings us to kind of this tiered gateway approach and what's really Kind of unique about what ship is doing with this is that it's a mesh integrated API gateway And so you have they have this kind of global regional ingress. I think it is in that Now you can extend those same service mesh features that you have in a single cluster to any number of clusters And so that gateway In the regional ingress is aware of all the applications running in those other individual clusters So when an ink when a request comes into that regional gateway It knows where that checkout service is so it can apply best path routing and it can actually use MTLS from the ingress gateway to the checkout service and so you can actually you are now getting a multi cluster Awareness you get multi cluster observability and then you get full MTLS from cluster to cluster one great thing about Moving the ingress gateway to a regional thing is now shipped has one specific spot where Public traffic enters their environment and so they can concentrate on that from an operational side to put all of your Security at that one spot and so we have one single ingress that you can put all of your attention into to make it more reliable and better and Put more security at that layer so they use a number of tools OIDC And JWT's for their external traffic Yeah, so this is more a slide around sort of how we did the initial like stop gap solution before implementing SEO So essentially we we created services that corresponded to every potential destination That a service could live on every single cluster That way if a service asked for App a dot app service dot cluster local that it would end up getting Handled by a service with an external name that would redirect to the ingress on the next Cluster geographically where the app actually lived So that's that's sort of how it worked. It was it it was pretty neat, but Ultimately was doing more work that Istio solved in a much more graceful way So Yeah, here we are And then Yeah, so multi-cluster DNS. We wanted each app to have a single URL So we have some you know a naming convention that pretty much is like an a record wild card that any service can be routed to via host-based routing and that's all app developers have to know about in terms of You know knowing how to reach a service whether or not it's injected through some sort of service discovery or whether They need to hard-code it for some reason they they know exactly how to reach another service Regardless of where it is across like different app domains. So like we have machine learning clusters We have app clusters. We have data science clusters all those kind of things. It doesn't really matter now You just have a singular URL to use Which is fantastic? Originally we started looking at using something like DNS rewrite with core DNS to solve this problem But glue does it intelligently? through Istio and Yeah, using virtual IPs as long as it knows about something in the mesh, you know instead If it didn't necessarily it would use the Istio ingress. Yeah, it probably sounds complicated because it pretty much is And we're gonna show you here how it works and then how a multi-cluster service match actually fixed a lot improve this architecture But you want to you want to talk about how the routing kind of works? Yeah, so this was a Basic example if a little convoluted of how it worked, which I mean it was convoluted because it naturally was so like we basically in the helm chart defined a list of locations and Based on if compute existed in the region, we would tell it how to route Based on that geographic list of locations and So it kind of looks something like this where we do a bunch of redirection using DNS and Istio is much more efficient at that. So here's Istio doing it So we have app foo talking over empty LS to another cluster directly to another service And or using the load balancer if it doesn't know about it You want to go back one side actually? I think there's something really interesting What they had to do with their cloud DNS is all those load balancers that exist They essentially had those ingress gateways listening on all the hosts And then if they would receive requests for an application that didn't exist in that application They would forward it to the correct cluster and so using a lot of like DNS Magic and then routing in the ingress gateway. They essentially gave all of their applications You could essentially hit any gateway and reach that That application, but it also came at a cost of Increased complexity and then extra hops and so then kind of what we're showing in the next slide is we're actually able to Keep the same functionality that they wanted which is high availability and any gateway can respond to any request But we could actually make that a lot more intelligent and just figure out because of global service discovery We knew where that app like the destination was and we could actually short-circuit that extra hops and significantly reduce the latency between the calls Yeah, those are good points We're gonna kind of take a little step back to kind of explain some of those concepts here the first one is we talked about how they were using HTTPS before with Multi-cluster service mesh we were able to retain the identity of the calling application And so now as in this example, we have I think a closed application calling a checkout application and because of MTL multi-cluster MTLS with service mesh We are actually able to keep that identity of the closed application And so the checkout application could actually determine if it wanted to allow that Request to happen or it could deny it and so not only did we improve on the HTTPS we actually Retain that identity so we improve the security of the multi-cluster routing that they were doing The second thing You're probably wondering how we were able to short-circuit those hops and that's because of the intelligent DNS That's added in Istio's proxy So there's a DNS server that runs in there that actually will take that it will resolve host names for you before actually forwarding on To like the cloud DNS, so we're hijacking that DNS We're looking for those host names that they did they made for those services and then we can actually return back the best path for your application to call to get your response and so what I Put up here is kind of the flow pattern of the Istio proxy and how it determines Where it should go so at the top there your foo application is calling bar dot ship dot com And so it's asking the Istio proxy. Where is bar dot ship comm and so Istio hijacks that request that DNS request and returns a virtual IP and so it's a just an internal IP address that's used by Istio and only known by the proxy and So then the foo application is going to make that request and that request then goes back to the proxy and the proxy sees That virtual IP address and then decides where is the best path throughout and so in this application Or in this example the proxy determined that the bar application in us east was the best path But it returned a 500 error and so the proxy can automatically then retry and go Oh, there's a next best path to go Let's try the US West one and so then it can automatically retry that HTTP request and that one returned to 200 And so to the end application all it did was just make a request to bar dot ship comm and The Istio proxy did a bunch of intelligent routing and returned you a 200 there or 200 Okay Yeah, so here I'll talk a little bit about our get-ups pipeline of how we actually handle all this from code to to kubernetes so Like I said earlier in talk developers essentially define what we call an infraspec YAML file in their repo, which is a pretty similar pattern to anything else. You've probably seen Where we define loose abstractions over kubernetes and Istio primitives Also handle like clouds cloud storage and s3 and whatnot that kind of thing so develop it's like a one-stop shop for developers to define their application environments and What compute and data stores they want? So there is a service message abstraction there too so that they basically can define all of their weighted routing They can opt in so right now we primarily have an opt-in approach So developers say they want the sidecar. They want the mesh to be enabled at some point in the future they they won't need to do any of that because of ambient mesh, but Looking forward to that day in in the terms of how the actual file gets parsed it we listen to Web hooks with a few different services from get pushes creates issue comments that kind of thing and parse that infraspec into a templated helm chart and Then deploy those in parallel to all of our clusters with the addition of glue We deploy an additional glue mesh helm chart to our management clusters that are the control planes for the multi cluster mesh and that sort of defines the glue CRDs that are needed and developers Pretty much just use that pattern They define the specific environments and regions they want to deploy in and we match whatever their strategy is And it works pretty well. I think one like interesting thing you've done with your GitOps is that they allow their developers to opt-in to all kinds of service mesh functionality without even essentially knowing that the service mesh Is doing that so they expose to them like retries and priority-based load balancing and all this stuff to Allow the developers to have that freedom to pick those Features that they need out of a service mesh and abstracting it away from the actual implementation of it Yeah, most developers don't care about service mesh At our head at shift they just want to enable their business application functionality To work in a better and more efficient and faster way So luckily they have a great platform team to handle that for them So let's see get ups. Yeah, this is this is more of the same developer commits We have a couple of different CI CD tools that ultimately take their docker image Push it to a repository and then we use concourse to actually pull that and build the helm chart and deploy it to all the different environments and yeah So that's that's pretty much our conclusion We love service mesh Next step is for us to start moving our edge routing to the left. So API Gateway is next And I'm excited to be on the forefront of that Partner with solo. Yeah, I think like one great thing is the architecture for ship didn't change and those original objectives that they had around multi-cluster routing and Global host names for their services. We were able to retain all of those features that they had but we just improved upon them We made them more secure. We made the routing more intelligent And we allowed them to keep doing this their Development with all of those same features using service mesh. We just enhanced it And then we with service mesh we plug that into the GitOps pipeline as well So that all of service mesh kind of worked with ships environment Cool. Thank you so much. Thank you All right, I think we have a few minutes for questions I really really enjoyed this talk and Congratulations, Brandon to your team that you were able to hide the service mesh complexity for your application developers That's really cool. So any questions from the audience? All right, I'll pass the microphone First so you are closer, but I'll get to you Thanks for the presentation. Can you guys share What you've done in terms of improving security? You just mentioned that you made security more and is to a more secure with solo or Routing more intelligent. I mean, do you might or there are things you can share on what specifically you've done? Yes, this like kind of the two big things is By retaining that service and identity in multi-cluster MTLS, you know, I have a lot more Kind of metadata that's traveling with the request as it leaves one cluster to another and so it allows your applications Or your servers to make more intelligent decisions in terms of security, you know Should they allow or deny and essentially that with multi-cluster MTLS You can now lock down the entire environment and only trust known applications And then secondly at that gateway level because that gateway is integrated with service mesh We were able to focus our security specifically at that layer and so adding capabilities like web application firewall and OIDC and JDBT and all that stuff at that single point Kind of like acts as like a DMZ for outside traffic and entering ships environment And then in terms of the routing decisions we kind of talked about the decisions that the Esteele proxy makes In terms of because there's global observability it can look at kind of your environment Applications your environment wide and so we know if there's an application that's closer to yours in a different cluster It'll prefer to go to that one first But then if it's not available it can fall over to another region Great presentation So you talked about multi-cluster deployment Just wanted to understand in terms of Istia providing this out of the box versus The implementation that solo has done like what are the differences or additional features that have been brought in if any and Secondly, you also talked about automatic failover. Was it just related to? Like pods going down and the service mesh taking care of ensuring that those parts don't have traffic going in or was There's something more than that Yeah to your first point so What we did we built everything on top of open-source Istio Essentially what we were able to do is stitch all of those individual Istio deployments together and make them aware of what they needed to know about in the other clusters and so all of the kind of Routing and all that decisions is just open-source Istio. We just configure it at an extremely fine grain level We create all kinds of Istio CRD or configurations using the CRD's And to make that make that happen and then we just implemented this week Full PKI infrastructure using Istio, CertManager, and Vault to give the full Security trust chain between all of those clusters Does that answer your questions? I think the only thing on there that solo has is at the API gateway level We built some extra filters into the API gateway like web application firewall and OIDC for that Specific use case, but that's still built on the Istio Ingress gateway Yeah, in addition to what Nick said developers Basically can define if they're interested in Total network outage in terms of failover or if they are concerned about specific like 5xx errors The rate at which they are received and then a you know a gestation period of when the route is actually removed from the configuration So that's that's pretty much how the failover is handled they define multiple regions and then an outlier config and It's very basic like Istio functionality