 Hi everyone, welcome to this talk. I'm very excited to be here to talk about how we change oil for fast running sidecar quickly and safely at Pinterest. So I'm today's mechanic. My name is Fuyuan and I'm part of the traffic team at Pinterest Infrastructure Engineering. So today we are going to start with looking at envoy at Pinterest briefly and follow that or look at the configuration rolling out story at Pinterest, the challenges we faced and the solutions we used. And at the end we'll share some best practice we applied and lessons we have learned during this journey. So at Pinterest we have been using envoy started around four years ago and we use it very widely and deeply. When user traffic come into Pinterest it firstly hit envoy, edge envoy. From there traffic get routed to our mesh, mesh envoy. And starting from there the traffic get routed to different services and it did not stop there. We also use envoy at storage layer. For example we use envoy, we use the envoy on MySQL so that we can get mutual TRS between services and MySQL. And at edge envoy we deal, we are processing millions RPS per seconds. And this is the architecture for our mesh. At the very top is a tower. Tower is the central laser control plane of service mesh. When a user land a company into the gate report Jenkins will upload the config into the central laser control plane and it will be persisted into a keeper. And behind the central laser control plane we install a agent on every host it's called beacon. Beacon and tower, between beacon and tower is a simple but very robust GRPC streaming protocol. And it's a generic config distribution protocol between them. And we leave the versioning and other complexities to the last mile. So the beacon will be the agent dealing with XCS services and they will distribute the config through XCS to envoy and to build. Sitting side by side with envoy, we have SDS sidecar and OPA agent to help envoy to do authentication and authorization. We have envoy deployed on both EC2VM and Kubernetes. It's cross-platform. We have one centralized control plane for both platforms. Compared to binary changes, config changes are more complicated. Why? Because we have thousands of clusters, some are small, some are huge. Pinterest was created 10 years ago and since then service was added gradually. So we got more and more services and every service has one or more clusters. And to make it more complicated, we give every developer an EC2 instance as a dialog box. We treat every dialog box a cluster. That is how we get thousands of clusters. And some of the clusters are small. Only one listener, one or two, maybe three upstreams. Some are huge. I just look at the largest cluster we have. It has about 300,000 lines of config. And the config dump was more than seven megabytes. On average, we roll out tens of changes every day. And we need to deploy them fast. And then it comes to the question, if a problem happened to config, how do you would cause the problem quickly? And once after you relocate the problem, what do you do? Should we support Ruby or should we mind it forward fix? So think about this scenario. It's like you are driving a sidecar and your customer is enjoying the ride. And it's perfect. But deep in your heart, you know you are doing an oil change. And you don't want your customer to be unhappy. How do you do that? How do you do that? Oil change. So here are the pinches. We basically tackle this problem from three perspectives. The first one is configuration as code. So for configurations, we make sure they are typed, they are tested, they are versioned, and they are deployed. The second one is we emphasize strong infrastructure governance. And the third one is we use a stator config rolling out. It's safeguarded by a near real-time feedback group and a holistic health check. So configuration as code typed. How do we make the configuration typed? We use a JINDA template. All the configurations are based on a set of JINDA templates that we provide. And when the user writes their configuration, they need to compile. And configurations are materialized at compile time. And it's also versionized. When the ground truth is in Git, it keeps the version in the centralized. When Jenkins copies the configuration into the centralized control plan, it adds version there. So the configuration stored in centralized control plan is also versionized. And it's tested. We run a comprehensive set of unit tests before and after landing. So the tests are comprehensive. If we test listeners, every listener must have a valid route. Every route needs to have a valid upstream. If an upstream requires TLS, the downstream must give the correct configuration. They must match also like a port conflict. These kind of things are detected during build time. And also it is deployable. If you have a pipeline, deploy changes after APR landed. And to do that for every PR, every conflict change. So infrastructure governance from a mesh perspective is basically two questions. First one, where should the mesh configurations go? Should it be in a centralized repo? This of course has a pro and cons. Pros of a centralized repo is consistency. You get consistency. You have a quick turnaround for horizontal changes. By horizontal to horizontal changes, I mean you will make a change which applies to multiple clusters. And of course it also has counts. User experience may not be ideal. If you are a service owner, you most likely would like to have your configurations in your own repo instead of make a change to another centralized repo. Or it should see next to each service. Pros. User have better control of their config. They just need to modify that config in their repo instead of go to your repo. However, there is another big fragmentation. When your configuration, when some mesh configuration is scattered in multiple places, the build cost, think about the build cost. How do you build? How do you make sure that every company needs to be built is built? And another question is, actually more severe question is, if a critical problem is spotted, like a security bug that needs to change the config immediately, what will you do? You have hundreds of maybe hundreds of repo. What will you do? How do you fix the problem? How do you make sure that you cover everything and you are not missing anything? So, based on that, the choice we made is we use a centralized repo. Everyone checking their configuration into that centralized repo. Another whole question I will with regard to info governance is, who owns what? Mesh is big. No matter how many mesh you have, you probably have one or how many. Do you always have this problem? Who owns what? At Pinterest, we have three personas, infrastructure engine, security engine, and service owner. Each of them owns different things of the mesh, as listed here. So, how do we fit those personas into the service mesh world? Let's start with simple. There is A and there is B. And service A, talk to service B. In this case, we call service A downstream, service B upstream, and instead of treat each service A monolithic, we cut them into three pieces. While ingress, egress, and the service itself. Then, we fit each persona's responsibilities into each piece. Infrastructure engineering. This is that this persona owns the mesh infrastructure. Here example, the infrastructure engineer define well-known ports, like ingress, HGTPS port, must be on port 8443. Then, security engineering. Security engineering own TRS, own authorization, and they control both ingress and egress. Now, last service owner. Service owner, of course, owns the business logic in their service, and make routing from the egress side. However, this is the most important part. Think about yourself as the owner of service B. Another service owner wants to call your service, and you want to re-limit that service. That re-limit config is part of service A, but given by service B's owner. This scenario, in this case, makes it complicated. Like, one service owner contributes to downstream configuration. Think about that. It's the right thing to do. It's beyond those traditional definitions of service owner A and service owner B owns things of their service totally. That's not right. That's different. This is how mesh configuration should look like. With that, if you look at service as the owner of service B, from egress side, he barely has something to config. Probably just some declaration of upstreams he need to talk to. That stays rolling out. I think we must have a config that gets landed. We started pushing that config into a through a pipeline. Jinkers job keep kicking in. Started running unit tests. Although unit tests were run before PR land, however, after PR land, we still ran it because you could have a merge problem caused configuration failure. Once that is finished, the Jinkers job will literally, the pipeline will upload the configuration into tower that centralized control pool. Then it will deploy the configuration into latest stage immediately. Every config change will start a pipeline run. Every pipeline run will deploy the configuration into latest stage automatically. It takes about three minutes after config is landed, till it get activated in latest stage. After that finish, we'll see there a sender notification to the on-call engineer and a meeting for approval. If the on-call engineer will there approve this, click approve button, it will go to the next stage, which is the canary and the dev app. Those stages still activate the new configuration in those stages and listen for several minutes and keep pulling the service status. If there is no problem, it will consider the canary stage got passed and it will send another notification for approval. Once that is approved, it will go to next, it will go to the first stage. In the first stage, we do regions in parallel. So, however, we see in each region, we do availability zone after availability zone. So, every time we just deploy one availability zone. And if there is something not happened, we will stop there, investigate before it is continue rolling out. So, when there was a rollout configuration, how do we know there is a problem? So, we have a detailed problem through a near real-time feedback loop. And this is the definition of negative feedback loop from Wikipedia. I don't want to repeat, but it's basically like you take the output of your amplifier, apply some algorithm and convert the part of the output and feed it into the input. And through that, you keep tuning your system to get your ideal output. How do we apply that to our mesh? And the feedback part is basically used to detect the problem. And what problem do we detect? We detect unwanted reject conflicts. Basically, this is through XES protocol. And also, we have house check failures, like data plan for data failures in both data plan and control plan. For a data plan side, service may be unhealthy. Or control plan side, maybe no config available for that specific cluster. Or maybe mesh infrastructure is not ready. Or maybe listener is not up. Or another case is when you send out a configuration and it triggers some bug and envoy just disappeared. Then we lose clients. And the last of our check we do is version mismatch. You send out a configuration to our SA, pay envoy on this cluster, please use version 100. And then you observe some of them or all of them are actually running on an old build, which is like 99. And then you know, oh, there's a problem because it's not on the right version. So the first thing, the first failure that we did have is the XDS resource acceptance or rejection. So within XDS protocol, it's about require discover request and response. From this flow, I got the config pipeline, send a new configuration of a new version of the configuration to the tower. And the tower is sending this new version to beacon, which is the host agent. And then that host agent convert this new version to XDS response and send it to our envoy, give it to our envoy. And we look, get the version and then do a bunch of checks and find it. Hmm, I am okay with this config and I accept it. Then it will use the discovery request to act. So the new version. And then this will be aggregated in beacon. And beacon will aggregate the report back to the tower. So the pipeline will keep current tower to see, oh, is there any failure? Is there any failure? So another case is, for example, the new version contains or require a new secret. The secret, however, wasn't granted to the process of envoy. So envoy will gather the new configuration and see, oh, I need a new secret. They will ask XDS for the new secret and XDS reject the request. And envoy will report an error to beacon. And beacon will eventually send this aggregated result to tower. Tower will say, oh, okay, here, mark the cluster. See, this cluster had this problem. And it will send the error to ELO key. And at the same time, the configure pipeline will see this error and fill the deployment. And we do this cluster by cluster. So any cluster has a configuration problem will be detected and will be reflected on configuration pipeline. So another part is the holistic house check. We house check listener and we house check the configuration of the child check control plan. And for like the host agent, SDS or part agent, and also check listeners as both L7 and L4. And we implement that as a script so that each platform can invoke it easily as it does not depend on any platform. So real-time feedback. The last thing is the confidence. How does the pipeline build the confidence? It's built on top of confidence. So envoy reported to beacon every 30 seconds. Beacon reported to centralized control plan, which is the tower every 30 seconds. So in total, after one minute, you should be able to see the last result after you send out the new configuration. So starting from T plus one minute, it started building confidence. And after another minute, which is T plus two minutes, they will start, they will look at the confidence and say, oh, if I get high confidence, they will say, oh, I hope for, if this configuration will not touch 1000 clusters, and all of them has more than 90% acceptance rate, then we got enough confidence and then let's exit early. We don't have to wait. Just go, let's go just move to our next stage. Or if it did not get enough confidence, it just keep waiting, keep looking, keep looking, keep looking. And after five minutes, it will come out and it will fail. And it will contact an uncle engineer to take a look what's going on. So with regard to the visibility, since everyone uses Slack, we each send a Slack message on every deployment activity to our channel. From the right side, you can see that the mesh configuration was rolled out and there's notification got sent out. And after it finished the latest stage, it sent another notification, see, oh, please approve this change to develop and canary. And as you can see, someone clicked up Excel and it moved to prod. And it has been, once though, it takes several minutes before it's actually finished about actually 10 minutes maybe to finish prod and it sent a congratulation, see, everything was good. And on the left side, you can see that the configuration actually had a problem. And the problem detected that problem as a notification. See, there is an arrow, please investigate. And it has that ELK link so that the uncle engineer can click that link and see what's detailed, see the detailed arrow. So in summary, we have 1000 clusters and which so, so millions of RPS at the edge side. And from the change landing time to its fully activity in production, if everything goes well, it's less than 30 minutes. And we had, we just finished the V2 to V3 API and configuration migration. We had zero incident, thanks to the configuration pipeline and that for the near real time feedback group. So because we have built the control plan to be a generic resource distribution service, instead of just for XDS, other teams and other organizations are seeing value of this system and they are moving their existing service, their configuration into this mesh control plan. So that they, because they also want this per AC validation, per AC protection. So lost some best practice and lessons we learned. First, infrastructure governance is crucial. This is very important for our success. So let's admit today service mesh is still a quickly evolving world. Therefore, changes are expected most of the time. And two years ago, we have been talking about XDS V2. Last year, we talked about V3, we talked about Universal Data Plane API. And this year, if you look at XDS, your UDP is already everywhere. And because of that, we need to define clear boundary between each role, between each team and who owns what. Otherwise, you will run into fragmentation. That's gonna make your mesh unmanageable someday. And that's, I cannot emphasize this enough. And you will know. And the next one is MTLS. It's your good friend, if it's not the best. Next, we have 99 problem with our control plan, which delayed the EDS update, make sure this made certain endpoints data store. And because of that, if we did not have XDS, did not have MTLS, we would have a self-series, because the requests are routed to incorrect service. However, because we added mutual TRS, so the connections was blocked at layer four. So layer seven did not even get a chance to talk to the wrong endpoints. We did not have had a self-zero because of the bug. The only thing we saw was a slightly increased latency at the API service. Other than that, no problem. So it actually helped us with self-zero. It's not just for authentication. Another one is RTDS. RTDS is very powerful and sometimes very dangerous. So here is a sad story. We added a global layer for our control plan, for our unwind. And we saw that it would be a good idea if we were able to roll out something globally. And someday, one of our developers wanted to change something in the global entry, global layer, and he tapped in a command line. However, he forgot to quote that JSON string in a quote, quote, quote pair. And this sent a more formatted JSON to Envoy within two seconds. Every Envoy of Pinterest got this wrong value within two seconds. And sadly, there is a bug in Envoy which caused a catastrophic failure, catastrophic backtracking bug which caused Envoy out of my memory within one minute. So Envoy crashed because of OOM. And they restarted and they worked for maybe 30 seconds, then crashed again. So it crashed every minute. And it took us a lot of effort and almost 30 minutes to recover every Envoy. So this is a sad story. The lesson learned here is RTDS is powerful, but if you're not necessary, don't use a global layer. Actually, after this incident, we disabled the global layer. And it's that we did.