 Hello everyone. Today I'm here to talk to you about how we improved the performance of service-to-service interactions using Envoy at the Wikimedia Foundation. Now today I'll start with introducing you to our infrastructures, explain why Envoy made sense in that context. I'll explain briefly how the transition went, how we went through the transition of introducing Envoy in production and finally I'll focus on what we gained from this. Let's start with the introduction of who we are. The Wikimedia Foundation is a non-profit organization that runs the infrastructure that supports Wikipedia and its system projects like Wiktionery or Wikidata. We do quite a lot of traffic. Monthly we get 21 billion page views. This is the data from last August. Now, how is our infrastructure structured? We have five data centers, three of which are just caching pops. So they are just points of presence that only host the caching layer, while the two main data centers are here in blue and green and are both located in the United States, one in Virginia and the other in Dallas, and they host our whole application stack. Now what happens when somebody makes a request? Let's say from Africa, somebody connects to Wikipedia, they will be directed by GeoDNS to the nearest point of presence, which is Isams. If the visitor is logged in or if the page is not present, their traffic will be sent to the application layer data centers and the response will be fetched back from there, computed and fetched back from there. If the user is surfing anonymously and the page has been seen in the last day, then they will get directly response from the caching data center. Now our main data centers that run the application stack, our system is a mix of some applications, microservices running on Kubernetes and some stuff still to be moved to Kubernetes from our legacy environment, which is basically physical hosts. One of these applications is our de facto old monolith, which is MediaWiki. The other peculiarity of MediaWiki is that while most of the other applications can serve traffic from both the main data centers, MediaWiki is active passive, meaning it can only serve traffic from one data center at a time. This means that services from one data center might fetch stuff from the Wiki API and the other data center, but also that MediaWiki might need to connect to data stores or other services in both data centers. As an example, let's say that when we get an edit, we need to notify our elastic search clusters that power the search box on Wikipedia about the fact that that article has been modified. Now we have independent clusters in the two data centers, so we need to send the update to both, and this means that some of the traffic will go across the data center, the arrows in red here, and this means that while we encrypted a long time ago, the traffic between the edges, vacation pops, and the main data centers, we didn't encrypt this traffic, the traffic between applications, which means that this traffic across the data center was going in the clear. Now, if the last ten years told us anything is that if you run more than one data center, you really want the communications across the data centers to be encrypted, because that makes the life of state nation actors harder when they want to snoop on the data of your users. And so we needed to introduce TLS in front of all of our applications. Now, we could have asked every development team to develop, to add TLS termination to their application, but that would have meant asking all these teams to become somewhat experts in configuring a TLS stack, and at the same time it would have meant that we would have to track the security effects in multiple applications stacks. So we decided pretty early we're going to install a TLS termination sidecar. Now, we chose Envoy for that function for a series of reasons. The first one is Envoy is not open core as some other TLS terminating proxy. Then there was a reason of performance. We've seen reports that Envoy is blitzingly fast, and I told you before our logged in users always get sent back to the main data centers. Now, the logged in users are typically the editors, the people that add content to the wikis. And these are in some ways our most valued users, because they are the ones that build the projects that make them successful. And these users already pay a price for the loyalty by always being sent to the main data centers. The reason for that is that when you're logged in, you can change the interface, the appearance of the site, so we can just send you a cached copy. Now, we didn't want our introduction of encryption between services to add another penalty to this view. And apart from this, Envoy has been designed to be the perfect service to service middleware. So it has a series of characteristics that we really wanted to add, one of which is its observability features. It adds telemetry within all services. It gives you the ability to emit tracing data. It has some additional things that you want when you build a through microservices infrastructure, like rate limiting and circuit breaking built into the proxy so that you can have a common implementation across services. And finally, it's very easy to compute. I was joking, and this is kind of a cheap shot, but there is a reason why I'm naming configuration, and it will be clear in a couple of slides. So let's move. We understood why, which is Envoy. Let's move to how we did the transition. Again, Wikipedia's are very high traffic. So we run a very large website with a lot of edge cases and a ton of traffic. If working at a foundation for six years taught me anything is that there was always an edge case, so we wanted to proceed with a certain level of caution. And that meant that we divided the transition into phases. First, we introduced the last termination in front of all services, and we made the other services called MbHTPS. And then only in a second moment, we started introducing configuration in Envoy to be able to act as a service proxy, route service proxy, and reconfigure the applications to use it progressively. Now, the problem that we faced during the transition, the biggest problem that we faced during the transition is that the configuration of Envoy is very complex. It's well documented, but there is a steep learning curve, and we didn't feel that everybody in the team needed to become an expert on Envoy. Also, we had the problem that we had two different templating engines. One for Kubernetes, that's Elm, and one for the legacy environment, that's Puppet. So we needed to have basically a common template where we sadly had to implement just the templating primitives, but we wanted to have the same data structure defining listeners and clusters for Envoy to be used both by Kubernetes and the legacy environment. And we wanted this to be simple and easy to understand when necessary, even if they have no private experience with Envoy. So we came up with a simple YAML data structure. Here is an example that's taken from our configuration. This example doesn't have all the keys that you can define, but this is a good portion of them, and basically our design goal was any necessary that is 15 lines of documentation and is able to add a basic listener to the Envoy configuration. The other goal that we had was for this to be boring. And when I say boring, I mean that we didn't want to have surprises. Once you have Envoy mediating all your HTTP traffic between microservices, changing something in its configuration becomes really scary because the blast radius is huge. I want to make a shout out to the Envoy devs for adding more validate to the server so that it allowed us to easily catch any errors that we were introducing in the configuration directly in the continuous integration environment. And some investment in making your continuous integration environment check that the configuration is doing what it expects to do is really something you should invest in. Now, all transitions that happen in real life happen with some struggles. I want to name some of the struggles we had, and the first and foremost was we use a level four load balancing, which means that we balance connections, IP connections, and not requests. Now, Envoy tries to funnel as many requests as possible across the same connection. That's one of the strengths, one of the reasons that we made big gains later. But this also means that it makes it very bad to be load balanced through a connection-based load balancer because you can send one million requests across one connection and three requests across another connection. You've balanced these connections across the cans, but you didn't balance requests, which is what you really want to balance. So what we did was just to limit the number of requests that you can send over one single connection to 1,000 by default, and that was enough to make the problems that I was naming basically go away. We also had another problem, which was we did see, especially for the applications that run at high scales, so thousands and thousands of requests per second, that we had some mysterious connection timeouts, connection failures, sorry, happening from time to time. And we took it back to the fact that application servers typically define a keep alive timeout for HTTP connections so that they can basically kill a connection that's kept alive by the client if the client doesn't send any data over that connection for more than 10 seconds. And turned out that basically we had to account for that in Envoy as well. Let me go back to the example I made before. Here we define the keep alive, which becomes an idle timeout in Envoy speak for 4.5 seconds because this application and then gate analytics is another JS application and no JS by default as an idle timeout of 5 seconds. So just keep that value a bit smaller on the Envoy side than on the server side and all these errors that we had go away. And finally, since we didn't choose to not go with the Istio way we can just make all routing through Envoy transparent to the application by using IP table source series. And we decided to actually go and change the configuration of the application to find a request for Envoy. Also because this way we could just switch one backend at a time if we need to. Well, the problem is that sometimes the same configuration key is used for finding the upstream server call and also to output some data to the user like the URL for a CSS. And thus when you use localhost 5.6.0.3 in a configuration for our mobile application service you might break mobile Wikipedia like I did. The point is just it's not always cost free to change the configuration of a service to point to localhost. Let's go back to the good news. As I said before one of the reasons we choose Envoy 4.0 was performance. Now we knew performance is great in Envoy but what we didn't expect is that we would actually improve the performance of our stack by introducing Envoy. And the reason for that is PHP. Now it's easy to dunk on PHP but it's also undeniable that for all its flows it's very successful. It's used to run some of the largest websites in the world and one of the creators of HHVM, Keith Adams, has argued that the reason for PHP's success is its scoping nodes which means the scope of any of the execution of a PHP script in a web server is the web request. So at the start of a web request you start with basically an empty scope. You have nothing besides some globals and the request variables. And then you have to build anything. You have to allocate memory, you have to build anything you need to make all the connections you need and at the end of a request everything, the memory you allocated, the file descriptor is open, everything is run away. You can see how this makes it very, very easy to write a web application PHP without having to worry about memory leaks or such stuff. At the same time this gives you another unique advantage which is serving requests concurrently in PHP is incredibly easy. You have to do nothing to be able to do that because every request is atomic by default. So it's shared nothing architecture where you can run things in parallel as much as you want. This is an approximation, forgive me, if you know PHP better you were probably saying well actually at this point but for the sake of the argument let's just assume that nothing is shared between requests. This means that also you can share things like connection pools. And this means that whenever your PHP application has to call other services it needs to create a new connection for every request it needs to make. And this is a cost, but this cost is even bigger if your connection is using TLS because when you use TLS you have at least two additional round trips to account for saying at least because it depends on a series of factors but it's at least two round trips. It's one if you get to TLS 1.3 but good luck using TLS 1.3 from PHP then there's the cost of establishing actually the TLS connection from the computational point of view. You have to exchange a certificate. You have to exchange secrets. You have to recreate for every request the session tickets on the server side because as we said shared nothing on the client side. So what we expect is basically that this will be introduced to Azure CPU usage, higher network usage and in the case you have latency over your network also latency to the applications. Now I said before it's a bit more complex than saying share nothing. That's because PHP extensions are written in C and they can bypass the share nothing behavior of PHP and in fact the cure extension is what everybody uses to make a remote HTTP calls in HHVM allows you to pre-define share connection pools for specific host names, remote host names. In Zend PHP which is what we are using this is not possible at all. There's no way to do that. So we knew that introducing TLS especially for requests that went across the center could made us pay a big price and we wanted to test how much to do that we just did this very very small benchmark of cross-dc performance in our production environment. We just wrote a small script that would call will fetch using the PHP call extension the Elasticsearch partner page from the other data center which is 35 milliseconds more or less from Ripple Wave. And then we called this script with a concurrency of 100 under three different conditions. In one case we pointed the script to fetch the data directly from Elasticsearch but using the HTTP endpoint so no encryption we got 720 requests per second of throughput. Then we configured it to connect to Elasticsearch directly but using TLS and the performance was severely reduced. And finally we configured envoy and the same machine to manage connections to Elasticsearch and we made the PHP script connect to Envoy and localhost. So the connections were encrypted but mediated by Envoy and in this case we obtained 1,050 requests per second which is more than double that we got with direct calls and much much better even that direct calls with no encryption. Now take these numbers with some discretion there is a large share of bars like across those numbers but still it's under 20 throughput gain when we use Envoy for TLS calls to a remote data center. Okay this would solve our problem we want to call encrypted across data centers and we don't want to lose performance which meant that we started the migration to a phase two of a migration of introducing Envoy as a middleware from our biggest application which is using PHP which is MediaWiki instead of starting like it's customary for necessary from a smallish service that we don't worry too much about. The gain that we were seeing in front of us was too large to ignore so we started from there and let's see what happened. So we said before we expect to get gains of latency mainly across data centers we expect to get a reduction in the CPU and network. Let's see what happened when we transitioned the user to the use of Envoy for calling session store from MediaWiki. Session store you can guess from the name is just a small Gulang application that is used to manage user sessions and to provide that data to other services first of all MediaWiki. At the time of the transition about one quarter of a wiki traffic was using session store so MediaWiki was doing 4,500 requests per second to session store. And still at the moment of the transition you can see here that the CPU usage from all the pods of session store went from 2.5 seconds per second so 2.5 CPUs to about 0.7 CPUs. This is amazing but let's see what happened with the network. So for the network the effect was even more unexpected. What we saw was a big drop in the number of bytes exchange and we expected that. But we also saw that the difference between transmitted and received data basically disappeared because we weren't transmitting continuously the TLS certificate 4,500 times per second but just a few times per minute. Okay this all looks very good right on paper but what about the latency which is what we really care about because that's what the users see. And the latency of both the service and MediaWiki was reduced significantly. Let's see how. So this is the latency buckets traced for the session store service you can see here below that the green line which is requests that are served in less than 1 millisecond doubled basically at the time of the transition at the same time the number of requests that took over 10 milliseconds almost disappeared. So what effects on the latency of session store let's see if it had any effect on MediaWiki let's always remember it was about one fourth of the traffic that was affected by the transition because only one fourth of the traffic was served was using session store at the time. And at the time of the transition you can see here that the number of requests the percentage of requests that MediaWiki was responding to in less than 100 milliseconds went from about 21% between 20 and 22 it went to be between 27 and 25 so at 5% the increase in the number of responses that took less than 100 milliseconds not super shocking as a result but still pretty impressive if you keep in mind that we did no optimizations at the application layer. Now this game is dependent on volume and I want to show it to you with a couple of more graphs this is of two different deployments of the same application it's a REST gateway to Kafka basically one getting 2000 requests per second from MediaWiki and in this case the CPU where the usage reduction is about 25% and one where it gets 13,000 requests per second and in this case the CPU usage reduction was 40% so you have to keep in mind that how much you gain by introducing something that does persistent connections on behalf of PHP depends on the volume of course that you make if you're in the range of making hundreds of calls per second probably this is not very important for you but still there are other things that we gained that are applicable even to small scale infrastructures and one is as I told you before PHP runs actively passive so at some point at the start of September we switched which data center we were serving MediaWiki from because we want from time to time to verify that the infrastructure is sound and working healthy in both data centers everything went good but overnight we noticed that the save times reported from end users doubled for selling edits now normally you have no idea what to look because it's anything everything basically is working fresh from MediaWiki all of its stack so you have to figure out which part is not working as expected thanks to using Envoy we were able to pinpoint within minutes where the problem was it was an upstream service called Envoy and to find the resolution this was a huge advantage compared to anything that we had before also in addition to that circuit breaking in Envoy is amazing even in its default configuration we have some services that MediaWiki calls basically on every request and if one of those service works too slow traditionally would have basically all the requests piling up in PHP waiting for a response from this remote service because we couldn't be too aggressive with timeouts but introducing Envoy even in its default configuration Envoy would detect that too many timeouts were received from the remote service and start considering it unhealthy and just returning a fast error to MediaWiki which means that MediaWiki was still able to operate even though it was in a degraded state it was maybe not reporting statistics about visits but still it was serving pages to the users which is what we care most about finally just one notice I want to say that we still have some edge scratches that's normal that's expected when you're transitioning to use something completely new it will take time to earn all the amount but overall I think this was an unmitigated success by using Envoy we were able to encrypt all the communications between our services to add a lot of observability to them and we were able to chromatically improve the performance of our PHP applications and really anyone running a PHP application at scale should think of doing something like this well thank you for your time I hope I didn't bore you too much I just want to point out that you can see all of our dashboards and all of the graphs that I showed you are taken from at Grafana.wepedmedia.org and you can talk to us on FreeNode, on IRC on FreeNode and my team is hiring thank you very much with this final blame this plug I just thank you all for listening goodbye hello everyone I hope you can hear me so first of all, yes, Jake is asking when I'm saying circuit breaking I'm talking about the concept of circuit breaking in the case of Envoy in terms of Envoy's terminology it's outlier detection in this case, yes sorry I should have been clearer I realized why I was listening now so I think I answered in the chat already because I knew that I was running a bit late for the time so I see that Derek was asking if the author starter works differently when swapping config files we wanted to use the the author starter anyways for things that are not running Kubernetes because we don't want to racially serve all the requests whenever we have to restart Envoy so at that point it was natural to just use it, just change the files we have a procedure to build the files basically on the app and deploy them, the configuration files and then just we just send a signature to the author starter and just starts again and Envoy to run the new configuration also from what I remember if the configuration is wrong the old version should keep working from what I remember but maybe it changed over the last year okay I didn't see other questions if I missed any please just tell me ask them again some time machine loading failures during the author start no interesting we didn't experience that but probably we're not running Envoy at the edge we're just running it between the services internally so every Envoy is really doing the amount of requests that probably could trigger that every Envoy is running at pretty low scale if you want running maybe a thousand requests per second overall at peak okay thank you very much for listening have a nice rest of the conference