 I guess let's get started. So, hello everyone, welcome. Thanks for coming. So, today I'm gonna talk about the service match journey at DoorDash over the last two years. We will discuss what's been good, what's been bad, and yes, even a little bit some ugly stuff. My name's Ho-Chun. I've been our software engineer at DoorDash Co-infrastructure team since 2020, mainly focusing on compute and traffic infrastructure. And before that, I was working on buildings and distributed systems for machine learning at several startups until I realized that's just probably too hard for me. All right, so let's get started. Why service match for DoorDash? So in 2019, DoorDash started their effort to extract microservices from a single monolithic application and fast forward to the end of Q1 2021. We found ourselves with over 15 microservices. And with this growth can, with those classic microservice challenges, observability has been challenging. Debugging was much harder. Our service topology became a maze of complexity and no one really knows who is talking to whom and who owns what. Microservices could talk to each other in so many different ways. They could use service IP. They could use our headless service with client-side load balancing or they could just use load balancing, load balancer. And there was no standard way to do authentication and authorization. And besides that, developers were just implementing so many similar things in the application code using so many different ways. So in Q2 2021, we started exploring service match which we believed has the potential to address the above challenges. We believed that by implementing so many features that the platform layer in a more standard way is far more efficient than addressing them individually at the application layer. So we started searching for solutions and we did explore various open source projects. And probably, you know, before we look at the initial decision, let's take a quick overview of our requirements at the time. So first thing first, scalability. Our microservices actually were, around the time were all deployed on one single big Kubernetes clusters with more than 2,000 nodes. And typically, we felt uncomfortable whenever a cluster has more than 1,000 nodes because a lot of open source tools we used were basically tested at most with 1,000 nodes. And in reality, we did observe some random scalability and reliability issues whenever we had a cluster more than 1,000 nodes. And obviously we couldn't afford any outage of that single cluster because that's the only thing we have. So first thing first, we wanted the solution to be able to support our scale at that time. Flexibility. So we don't want to leave with that single Kubernetes cluster forever. So I guess to make our life easier, we implemented a console-based multi-cluster infrastructure and started their migration already to move some microservices to new clusters to put less pressure on that old single Kubernetes clusters control plan. So we needed the solution to be less opinionated and more flexible enough to support our unique multi-cluster setup. And tools, of course, we used should have a mature community and solutions need to be supported by our successful user storage supported by some other companies similar to our scale. And it was also essential for our solution to be easy enough to configure with some comprehensive documentation as well. And lastly, we need the solution to provide features in observability, security, reliability, and traffic management to address those typical microservice challenges we mentioned before. And after spending a lot of time exploring various open-source projects, including those most popular ones at the time, we settled on and was our data plan while developing our custom control plan to align with our specific needs. And you are probably familiar with this architecture today, which is still kind of most common and typical solution nowadays. Traffic redirection is managed by AP tables and sidecar injection is managed by Kubernetes mutating webhook. Data plan runs as a sidecar container for each pod, handling all ingress and egress traffic for HTTP, HTTP, and JRPC traffic. At that time, we don't manage, we don't manage any storage-related traffic. The control plan just manages the configurations for these annual sidecar containers using XDS API. I guess it's worth to mention that building our own control plan was not an easy decision at the time, especially considering we have been always a small team on this project, usually with just one or two engineers until very recently. And while the service mesh landscape has involved a lot nowadays, to be honest, our choice might be different today, but given our unique requirements at that time, this path was probably the most logical choice. So for adoption, when we began the journey, we knew we wanted so many features from service mesh to address those microservice challenges. But however, to be honest, what exact features we wanted to use was a little bit unclear. And the plan was basically to onboard everyone to service mesh with many more features and then decide what additional features we want to support later. So however, I guess our significant turning point occurred on June 19th, 2021, that's when DoorDash experienced a complete outage lasting over more than two hours. If you are interested in the details, I also linked the RCA in the slides as well, but I guess the key takeaway from that RCA is that the outage was actually caused by a typical cascading failure, starting with some high latency issues initially in our payment systems. And as clients like Datcher Service attempt retries, retries and probably retries, they recurred or retry storm eventually that further overloaded the already struggling payment service and eventually that caused a complete outage which caused us to basically shut off the traffic from the edge later twice to put less pressure on the payment service and give it just more time to recover. We realized that this situation could have been prevented or at least mitigated with some standard best practices that we always recommend to teams. So for instance, we should have, what should have happened is that the payment service itself could have implemented load shedding to proactively reject some load when it's already in the degraded state like in this case high latency so that it could at least prevent itself from our complete outage. Also the clients of payment service could have used our circle breakers to fail fast whenever payment service returned some higher error rate and it could just periodically check whether payment service is back to normal and decide whether it's time to talk back to payment service again. So for these two reliability related features, we did have them implemented in some common codling libraries since that's our primary program language and however services like the payment service were still using other program languages in this case, Python. So it was kind of challenging to implement all these reliability features across the board. And unfortunately we had several similar incidents before so this eventually led to many teams to have code freeze for amounts focusing exclusively just on reliability related tasks. And that's also the point that folks started asking what's the status of service mesh which has the potential to implement these reliability features in a language agnostic way. So where did we stand with the project? We basically had nothing, we just started the initial design and reviewed it and started the implementation. We didn't have any operational experience with running N1 in production and there was basically no control plan at that time. So this outage basically made us realize the urgency of shipping the project sooner. So instead of waiting to build a complete control plan and onboard everyone first, we understood the importance of addressing the most immediate and pressing issues first. So eventually we decided to shift our priority. And let's look at their design after the priority shift, after the outage. Similarly traffic redirection and sidecar injection is still unchanged, still using IP tables and mutating webhook. The biggest change is in the configuration management system. So instead of the API based configuration management system we used a file based dynamic configurations. So users put N1 configurations in a GitHub repository and then a CD pipeline just package and ship this configuration to an S3 bucket which then just get pulled and mounted to the N1 sidecar through our existing internal service which we call a S3 thinker in this diagram. And N1 can still hardly start whenever there's any updates in the N1 configurations. So leveraging their existing CD pipeline and that's the S3 thinker service saved us a lot of time for the configuration management story. And N1 was configured as a HTTP password proxy using their N1 original destination cluster. Instead of giving no features we added two reliability related features adapt to a concurrency for that load shedding behavior and auto light detection to have a similar behavior as our circle breaking to basically help us prevent our similarities from happening again. So this design as you can see had a very primitive configuration management approach and the focus primarily was around the data plan specifically for these two reliability features provided by N1 and actually at that time we started calling the project N1 sidecar rather than service mesh. So once we successfully implemented and tested our solution in staging we onboarded two critical Python services which include the payment service which was the one that cost their site wide outage. A little bit about onboarding process requires two steps and firstly just similar to many open source solutions we need to add some custom label to the namespace and deployments back and then we also need to create raw N1 configurations at the time which typically includes around 1000 lines of configurations which as you can imagine overwhelming for everyone at the time even though we just had two customers. So the good news for developers was that they didn't need to modify any of our application code. For the roll out strategy to gradually introduce the N1 sidecar we used the canary deployment approach users needs to deploy another independent Kubernetes canary deployment which uses the same application code as a production one but runs with N1 sidecar injected. So the N1, the canary deployment shares the same labels as a production one which matches the selector defining the service object and that's how some traffic could be sent to those canary parts as well. And so this allowed users to adjust their canary deployment's replica account to control their amount of traffic routed to the parts running with the sidecar. And for these two services we then baked the traffic for around two weeks we were doing this super cautiously because the whole point was to prevent an outage like that. And once we felt confident about everything we then scaled down the canary deployment and run the production one with the N1 sidecar injected. So the roll out eventually was smooth and for these two services they have their extra protection without any code changes eventually. So this I guess brings me to the first lessons we learned along the way. So looking back we believed that shifting our priority to build this very simplified solution was the right decision. We initially had a very big dream but we needed to clarify our immediate goals and start with something small. So looking back on our journey actually many big changes were driven by the motivation to solve some real world problems within the org. So in Q1 2022 the project, the N1 sidecar project reached GA status. We created configuration templates instead of having developers to configure those raw N1 configurations which as you can tell machine possible. We offered our common dashboard for networking matrix provided common alerts and runbooks to monitor common issues like you have higher rate or you just have high latency. We expanded our user base by reaching out to more early adopters and onboarding services in more different programming languages. And with this successful user story from our initial customers and the announcement of that GA status in 2022 we started hearing more feature requests from PIMS as well. So one big ask was to support Zoom-aware routing. So typically our microservices are deployed across all different availability zones in Kubernetes and previously the default behavior is that the egress traffic from clients in service one in this diagram is load balance between org destination or server path of service two. So it's a pattern of basically everyone talks to everyone and the idea of Zoom-aware routing is to route the egress traffic of clients to its local availability zone while ensuring their ingress traffic received by each individual or server path is still balanced. So staying with the same availability zone has a couple benefits and firstly it saved us some cross AZ data transfer costs and reduced the impact of one AZ outage and also making the communication more performant because we are now connecting the clients to their nearby service. And given its impact in all these reliability, efficiency and performance areas and given especially efficiency was one of our engineering priority in 2022, we decided to support this feature. And that's the point that we could take their opportunity to involve their configuration management system as well leading to this introduction of their API based dynamic configuration for all the EDS resources. And now our XDS resources and XDS servers read the IP addresses from the source of shoes which in our case is console and ships or IP addresses with our easy information back to the envoy sidecar. So that, you know, given this information the data plan can just perform Zoom over routing. It turns out Zoom over routing was just the beginning. We co-developed many more features with our initial customers throughout the year. This process helped us to get a better understanding and deeper understanding of our customer pain points and helped us prioritize additional features beyond those initial reliability related features. So in our case, many use cases we heard were related to traffic management and with our very particular focus on head of this routing load balancing and traffic policy. So today, all these features are in production but at that time with the introduction of all these new features, we quickly realized that the more services we adopt, the more benefits we could have. So we continue to focus on increasing adoption throughout the year. And by the end of 2022, we eventually onboarded around 100 services which doesn't sound bad, right? Given that we were really a small team with just one or two engineers helping people to onboard. And that's also the point that people started asking when we can onboard all services. So unfortunately, some back of the napkin mask quickly showed us that it would take us several more years to onboard most services to service mesh and that it became evident that we had to speed up the onboarding process. So in Q4 2022, we decided to review what could prevent us having moving faster in 23 and work changes we should make before 23. So first thing I wanna mention here is that in the initial adoption phase, we have discovered a lot of unknowns. We found many special client behaviors that was previously unnoticed, but became apparent with the introduction of NWO sidecar. So here taking our first customer payment service as an example. So there was a client which couldn't perform client side request level load balancing for the GRPC traffic. So before the NWO sidecar was injected to balance the traffic, it turns out that the payment service was just periodically recycling connections so that client could create new connections to other parts, in this case, part two in the diagram. And however, injecting the sidecar into payment service part disrupted this balance. And now only the connection between payment service and sidecar and payment will be recreated and the client would just always talk to payment service part one in this case, leading to an ingress traffic imbalance over time. And eventually we had to move the similar behavior to the sidecar by adjusting the connection age configurations in the service NWO sidecar. We also quickly realized that this example was just the tip of the iceberg. And we were basically in a phase where we didn't know what we didn't know. So we continued to uncover more special unnoticed client behaviors and the introduction of the NWO sidecar broke whatever just worked before. This highlighted that making their data plan always transparent isn't easy and we had to kind of expect the unexpected in their initial adoption phase. And that's why we had to use that canary deployment approach for a while. Unfortunately, as we onboard more services by the end of Q4-22, we figured that we were not seeing this kind of unknowns that often and that's the point we decided to take some bet. We believed that we uncovered most unknowns already and it's probably okay to roll out faster as long as we can roll back fast. So the first change we made was shifting from that canary deployment approach to where we also baked traffic for days to that the native Kubernetes building rolling update method. This did significantly reduce onboarding time from days to hours. We also realized many challenges in developer experience could prevent us having a large scale adoption. And firstly, we used to ask teams to follow some onboarding documentation to onboard their service. Adding some labels and some NWO configuration sounds easy but actually every team needed to follow some documentation and we are talking about 400 more services here. So we realized this decentralized onboarding approach doesn't scale for us. Similarly, we asked every individual team to manage their NWO side card resources which doesn't scale as well. So we decided to have the infra team to own the onboarding process and the resource management story which is the team that is most familiar with the process and the team that is most motivated to improve the process. And eventually we did streamline the onboarding process by pre-generating all those NWO configurations and labels for all services. Throughout the year, we also noticed we put our primary focus on onboarding and making the NWO side card transparent but we didn't put enough time on educating our users and users were lack of some basic understanding of service mesh and that eventually caused some confusions. We decided to just enhance our documentation and invest in more time in educating and enabling our service owners. The complexity of observability features is also another big one. Networking issues still happened, of course, but the metrics we provided were just overwhelming. There were just too many metrics and it was super hard for our users to know what metrics to look at. We exposed all terminologies used in NWO metrics to our users and they have to learn stuff like what is ingress, egress, downstream upstream, local remote connections, requests, responses, messages, CRPC, HTV1, HTV2, stuff like that, all this stuff. But not all engineers enjoyed looking into every single detail of all the data. So we should have our product mindset here and be more customer obsessed here. So we invested some time in simplifying the dashboard to make metrics more user friendly by giving our users most important high-level metrics. We also noticed that ironically, sometimes, with the introduction of service mesh, debugging was more complicated. Since the architecture became more complicated, now we have multiple NWO cycle in the core pass and when some errors happened, it wasn't always clear whether they were triggered by service mesh or not. Their infrastructure team was just involved in two more incidents to assist product teams to debug and this sometimes caused frustrations on both sides. So to provide clarity in identifying issues, we introduced service mesh availability cells to show all errors that are originated from NWO cycle. We also introduced distributed tracing to have better view to show our users which component in the service graph is actually returning the error. Service graph, so we realized many features are actually enabled only when the egress dependency are defined. Features like the doing or routing, outside detection and even for those most basic upstream level metrics require users to define their NWO cluster, egress cluster in their configurations. And however, users are not sure about their service graph and there was basically no tools available for this purpose. So eventually we decided to build an accurate service graph from the tracing data that we just introduced and then build a tool to generate those egress configurations based on the tracing data. So following the exclusion of the plan, we started massive adoption and things were doing okay. Overall, onboarding was much faster. We had a few unexpected issues for some services which is kind of expected, resorting in around two or three incidents but fortunately we were able to roll back fast before things went wild. And there is some additional maintenance responsibility. The infrastructure team which is I guess still maintainable nowadays. Many features like client-side load balancing, zoom-over routing and header-based routing are also widely adopted. So here is a quick overview of our current state. Microservices are deployed on around 10 production Kubernetes clusters. We have more than 500 microservices deployed in five isolated production mesh. Within each mesh there are multiple clusters and there are today more than 10,000 parts and around 5 million RPS managed by mesh nowadays. So here's a quick overview of our current and future work. Developer velocity became a prominent concern. More efforts are now directed towards having their experience of configuring Envoy more developer-friendly and improving their user experience of leveraging or observability-related features. For efficiency, the way we are currently save the cost of compute and metrics usage is still manual and we need tools to eliminate the waste care. We're also trying to leverage mesh to simplify some other traffic infrastructures. So in this case we are actively working on some new architecture to build the multi-cluster service discovery solution. And of course we should continue to add more features to support our users. All right, so here is a quick recap of what we have discussed. So before you start the journey, we figured that really understanding your requirements and their use cases is important and you can co-develop your features with your initial customers. And when working with Envoy, we noticed that it's hard to make the Envoy sidecar always transparent. So you probably have to expect the unexpected in the initial adoption phase. When onboarding services are testing since gradually at the beginning it makes them well-informed bets at the right time and decentralized onboarding doesn't scale even though the process required low effort from the user and try to streamline and automate the onboarding process instead. When delivering product to the rest of the engineering team, Envoy metrics could be overwhelming and have a product mindset and be more customer-obsessed when you build the product. Envoy sometime in training and enabling service owners but increasing the velocity through some more simpler and more automated solution is actually more important. All right, that's all I have. Thank you. All right. I think we still have some time for a question. You can go to the mic. Hi, I have to ask would you still develop a service mesh architecture from scratch today if you started to try to solve this same problem again? So I guess with the current architecture we probably, so I guess the first thing is that we have to have some use cases to motivate us to probably move some new architecture. And wait, I'm saying put on your like pretend you have nothing, right? You're starting 2021 with tools and solutions that are available today. Well, would you still build your own? I guess we will have to evaluate all their solutions again. It's been two years already and honestly a lot of our other traffic related infrastructures are actually improved by the current service mesh architecture. So it's kind of hard to say that we are given what we have today because what we have today has service mesh kind of tightly coupled in the architecture. But we are actually evaluating some other solutions as well because we do have some use cases. For our case, we wanted to introduce our network policy and that's why we are actually evaluating CELIM to introduce it as our CNI first and then evaluate it. Maybe we could average some more L7 level features. Okay, yeah. And you already know some gaps maybe between something like an Istio or CELIM that they wouldn't be suitable, I guess to solve the same problems that you developed for. I think it's been a while last time when I checked all the solutions but I guess the common concern nowadays is still around their developer experience or aside how to make your life of confidence or the configuration easier. That's I guess some common concern that even now with our customer solutions we are trying to solve. And our current solution is trying to expose all this kind of experience through our common interface. We are trying to develop some developer portal to manage all these configurations. Yeah. Okay, thanks. Yeah. So I noticed that you talked a lot about how you're using console for service discovery. How would you evaluate using console connect as a service mesh? We did. We did try, we did reach out to console folks in the community. I guess I, if I still remember correctly we were mainly, we wanted the most mature solution because we only, we were really a small team. So we tend to be a little bit more conservative and just try the most mature solution at the time. So that's why we probably tried then we'll cycle approach. That makes sense. Thank you. Yeah. Hello. I kind of wondering like given the limited engineer resource back in the 2019, what's the most motivation for you to build your own control point instead of using some other solutions like linker D or H2? Right. We actually spend a lot of time trying to make the existing open source solutions work for us. So I can probably give some examples. So we try linker D first. And as I mentioned, we have a big cluster that has 2000 nodes and we were kind of asking around, hey, what's your biggest cluster you can support? And the story we heard was from other users who was around 400 nodes, which makes us a little bit scared. And the way linker D support, the multi cluster story is a little bit opinionated that cannot support our own console based solution. So I guess we were also, another thing is our traffic team also were leveraging the envoy proxy to manage the edge traffic. So it's probably more makes more sense to consolidate the effort for the data plan there. STO, a couple of my colleagues tried STO, but they were initially scared by their complexity of their other configurations to basically set up the most basic configurations. And we had some concerns about their control plan at the time that was even before, I think everyone, we still moved to the more logistic related architecture. So eventually we had to decided to build our own. But that's a good point. We are actually trying to make us not be locked into this kind of self in-house solution and we are trying to leverage some opportunities to probably move to some new architecture event. Eventually, if that's possible, that's why I mentioned we are evaluating using Celium as the CNI as their starting point. Thank you. You mentioned availability zone aware of routing. Could you elaborate on that a little bit more? Is it preferring pods in the same AZ within a cluster? And just how do you prevent things from becoming imbalanced if, say the source has more pods in one place or destination in a different place? So the traffic is controlled by the data plan. So what we are doing in the control plan is to read all their IP addresses with all the easy information from the source of choose console in this case. And then ship all their IP addresses and all those EDS resources back to the annual cycle. And actually we are just leveraging the data plan to actually perform the zoom over routing. So I was curious about the, you mentioned you had some challenges when you build some of these things, developers didn't know how to debug. So they were asking the product team to debug and you ended up adding some observability. So first of all, was that enough? Were there still use cases where developers had to debug? No, that's definitely not enough. I guess that action item was basically to unblock us to move faster in 23. And the debugging experiments nowadays is still not great. The current direction is try to leverage, unify all their data from all the different sources, metrics, logs, and traces and just give a better interface to our users. And we are also actually evaluating, building some automated tools to help us to basically summarize what is going on using AI as well. So one last question. So it looks like you made a lot of design choices, architectural choices. Was it just one team making these decisions? Were there multiple teams? And what kind of process and red tape did you have to go through? It's decided by the traffic and computing or within the core infrastructure. And we do, actually whenever we make big changes, we do propose the RFC to the engineering org. Honestly for the service match team, usually it's been just one or two engineers for a while. But now we have around, we moved the service match project into the traffic team now. And the traffic team currently has around 10-ish engineers.