 All right, welcome back everyone. Hope that abbreviated break was sufficient. Try to keep this interesting heading into the home stretch here. So I wanted to talk today about the evolution of Twitter's edge, in particular how Twitter came to be using Envoy at scale at the edge. My hope is this interesting from a couple angles. Firstly, a case study of a large user Envoy as an edge proxy versus the service mesh use case. And secondly, an open source story, I think it's pretty cool how ideas that were born within one context, coming back in a different context, spawning a popular open source project, and then that project being returning full circle to an organization where some of those ideas were born. I'll go a little bit more into that. But yeah, if you have any questions why open source is eating the world, this is an awesome example. So my name's Ryland. I'm the manager of the TFE team at Twitter, which is the Twitter front end team. Formally, I was on the edge team at Netflix. So I've been sort of thinking about problems at scale at the edge for a while. The TFE team at Twitter is a team that now is responsible for handling hundreds of billions of requests a day filtered through Envoy. And everything I'm gonna talk about is essentially the result of two years of work on behalf of the TFE team. So I'm not gonna take credit for any of it, but I also wanted to mention the CSL team, which is the core systems library team at Twitter, who the TFE team collaborated with to sort of adopt Envoy at scale within Twitter. So we pride ourselves at Twitter at our availability and reliability. So if you remember a couple of weeks ago when Facebook and Instagram went down, half the internet seemed to be burning. Twitter stayed up and we were pretty proud of that given there's a bunch of things, sort of unique challenges I think we have, which makes that pretty hard to accomplish. Namely, one thing is the scale of it. So 200 million daily active users coming in, wanting to use Twitter, sending millions of requests a second. And to make matters worse, those users are geographically distributed, right? So despite our data centers primarily being located in the US, our user base is growing much more rapidly in other parts of the world with less good internet connections. And yet we have to manage to make a service that's both responsive and reliable for all of our users globally. We have to deal with unpredictable load spikes for a variety of reasons. Some of them are more predictable than others, for example. Every New Year's at exactly midnight, there'll be a huge load spike in Japan just because people tend to queue up tweets and send them all at once. And that has caused us grief in the past and we've sort of learned how to adapt to that one. But once in a while, there'll be a television show, say Big Brother Brazil, that nobody could have anticipated how popular that would be. And that almost took out Twitter and Brazil, except for some last minute heroics. So there's always sort of interesting things cropping up from time to time. Some of the load spikes are not organic, some of them are malicious as well. So Twitter has a large unauthenticated API. So for much of Twitter's API service area, you don't have to log in at all so you can go sort of view a tweet or embed a tweet or visit a user's timeline without any kind of authentication. And that makes it even more difficult to protect against malicious users or vigilante internet archivists, which are something I didn't realize existed until I came to Twitter. People who are fundamentally against link shorteners and want to hit every sort of link shortener that you have in order to figure out what's behind it. And beyond that, we don't really have much control over the clients that are communicating with us. So if you go to Twitter on your phone, Android, iOS, you may be using a Twitter owned and operated client. Otherwise you might be using web browser or a Go script, a Python script that communicating through our APIs. We just don't have a lot of control over that. And yet we managed to stay up, but the story wasn't always that good. So I wanted to rewind a little bit about sort of Twitter pre-2014 before we were really doing much of anything at the edge. We had some sort of points of presence globally networking pops, but we really weren't doing anything like terminating TLS or anything fancy at the edge at all. And this started to become more and more of a concern. So in many locations around the world, I'll take Brazil as an example here. The latency was just simply too high for users to reliably connect and use the Twitter service. And there's a bunch of reasons for that. So if we did a bunch of research and broke it down sort of what's taking so long, and we kind of came to conclusion that most of this time is spent in the communication part. So establishing connections to Twitter, doing TLS negotiation, multiple round trips and transmitting and receiving data over unreliable network connections. And we came to the conclusion that we could make the backend services as fast as we wanted to and it really wouldn't make much of a dent in terms of what these users in less well-connected areas were experiencing. By and large, it was this time that it took between the user and the data center that was taking up the vast majority of people's time versus anything that we were actually doing inside the data center. So this issue kind of came to a head prior to the 2014 World Cup in Brazil where we just realized it was gonna be a massive missed opportunity if the service went down in Brazil or was unusable during this time because we wanted to be part of the global conversation and allow people to tweet from the World Cup. So enter the Twitter streaming aggregator created by a Twitter employee called Matt Klein in 2013. The name kind of comes from, it was originally intended for a different use case for the Twitter sort of consuming the Twitter fire hose. So like everything that's happening, everything that's going on within Twitter. We offer a service where you can sort of get a stream of all these aggregated events. But around about, you know, early 2014 before the World Cup, we decided to repurpose the Twitter stream aggregator as a generic as proxy. And if you look at sort of the description of what it's intended to do, it might sound a bit familiar, right? So L7 versus proxy speaks HTTP one and two. And the goal was to terminate TLS and essentially handle a ton of TLS connections from our users and then multiplex over a, like a small number of high quality connections back to our data center. And so the first time we ran this at scale was in strategic locations prior to World Cup 2014. It was a huge success. We stayed up during the World Cup and TSA was born. And it was so successful, you know, I could characterize the next several years worth of edge development at Twitter by basically expanding on this idea. So this went from sort of a proof of concept to being installed in many edge locations around the world. So Twitter owned pops globally. So we do research, figure out where customers are getting a less than optimal experience and figure out what the best place is that we can install these pops where we terminate TLS close to users. So, you know, all these round trips you do only can happen locally. And then you get like a nice Twitter owned connection back to our data centers. There's a couple other cool things about this too. So it sort of separates the protocols used between user and pop and then pop back to data center. So we can have you connect over HTTP one or whatever protocol to your local pop and then multiplex over sort of nice HTTP two connections back to our data center. And then furthermore, it gives us the ability to shift traffic back and forth. So we don't necessarily have to home to a single data center from a single pop. If a data center is having an issue, if an individual service is having an issue, we can redirect some or all of traffic from each pop to other data centers where the service is currently running fine. So that's given us a lot of flexibility in terms of being able to dynamically shift global traffic. So if everything was so good, then where did Envoy come in? Well, TSA has been around for quite some time at Twitter seven years, which is kind of like the lifetime and the technology world and started to get a little bit crafty. So in particular, it's pretty bespoke C++ code. It's not always RFC compliant and the main author left for Lyft, which tends to put a damper on feature development some ways, but it's essentially modern hardware evolved. This software was designed for a maximum say like 20, 24 core machines at the time. We have much larger numbers of cores available to us right now, but we found that it just wasn't able to scale. The threading model isn't quite as nice as Envoy is now. There were some mistakes made in some ways. This is the one you build to throw away before you get it right. It didn't support a bunch of modern protocols that we really wanted to use. We wanted to experiment with say HTTP three or GRPC since TSA was lacking proper trailer support that made that difficult. And then things like TLS 1.3 would have been a huge effort to retrofit into our existing proxy TSA. And then from a security perspective, getting this up and running, we ended up forking open SSL and making our own modifications to it, which became a challenge to pour it into more up-to-date versions of open SSL. So we had issues and that would, it required lots of work to sort of port these changes to modern versions of open SSL, which was an ongoing nightmare. There's no concept of a control plane. It's all statically configured and there's not really the ability to do hot restart. So we found ourselves every time we wanted to make a configuration change having to drain out a server, start up a new version of the server and then sort of route traffic back to it, which has an impact on users since you'd like to keep those connections open as much as possible. So around 2017, we started looking at replacing TSA with an open source alternative. Around about the same time, we started hearing a lot about this project called Envoy coming out of Lyft. We kicked the tires a little bit on various options, but I'd say it was in 2019 that we sort of committed to Envoy as a good solution, in part because it was so similar to TSA that it was pretty straightforward to reintegrate back into the Twitter ecosystem. And then secondly, fortuitously, because the CSL team wanted to use Envoy in the context of a server smash and so that gave us a good opportunity to combine our resources on things like setting up a build pipeline, writing the bones of a control plane and everything like that that could be configured both as a service mesh and as an edge proxy. And so we started development of what we called a project called T3 in 2019 and starting sort of late last year and into this year, we've been rolling it out in production at Twitter. And so I'll talk a little bit about how T3 is set up. So it's an L7 proxy based on Envoy. What we love about it is that unlike our small team having to essentially implement every feature ourselves, Envoy is backed by a very mature, open OSS community that's very active and supportive. We've been able to learn from the community, both learn and give back in terms of fixes that we've implemented for ourselves that are also generically useful. We managed to make it a drop in replacement for our existing proxy, which is no mean feat considering like seven years of cruft were built into it, but Envoy's extensibility made it super easy to find like the places that we needed to extend in order to match the sort of features that we had in production. And it's a lot faster than we had. So we found, like I said, we were unable to scale our old proxy beyond a certain number of cores, but even on a like for like comparison, we found Envoy to have about twice as much throughput with similar or better latency. And then it also unlocks the ability to make use of sort of the vast amount of CPU power that we were essentially leaving on the table with TSA. So we've been very happy from a performance perspective. But even more so, the fact that Envoy has modern and growing support for modern protocols is a huge deal to us and it lets us experiment with things that would have taken lots and lots of development effort, otherwise having to build this into our own internal proxy. And from a security perspective, we've been able to now achieve sort of the ability to integrate security patches and build and turn around and deploy within a matter of hours versus like a months long project. We get hot restarts for free and we get dynamic configuration via the control plane. So if I zoom in a little bit on how this looks. So at Twitter, we've built our own internal control plane based on Scala. The reason we've done this is because we have a bunch of existing libraries. So the core infrastructure of Twitter is mostly built around Scala libraries. If you have any experience with FNAGLE or anything like that. So we have essentially fat clients that interact with a whole bunch of systems out there. So service discovery, feature switches, all of this is built into FNAGLE and sort of the core set of libraries that are maintained by the CSL team. So it was pretty straightforward to implement our own control plane that essentially serves on Envoy's XDS API but backed by all these existing Scala libraries. And in terms of the data plane, so we took Envoy, we have a few of our own sort of modifications of the core Envoy plus a bunch of filters that we've implemented and we build it all together. They're all C++ filters and then we sort of distribute our own Envoy binary. So some examples of stuff we've implemented. I'll go into a few more later but Twitter specific metrics extensions. We use a globally unique connection hash just so we can correlate things later on in our data pipeline so we can see exactly what happened during the lifetime of an individual connection. And then some other things are important to us were the MUX protocol, which is a RPC protocol we use internally at Twitter. It's the RPC multiplexing protocol. We worked a little bit with Envoy maintainers to implement Start TLS, which lets you start with an unsecure connection and then upgrade to like a secure MUX connection. And more. But the stuff we love about Envoy are the things that we didn't have to build ourselves and would have taken us lots of time. So IPv6 was a huge one. It's nobody's idea of fun to like retrofit a seven year old C++ proxy that wasn't designed for IPv6 with IPv6 support but we got that more or less for free with Envoy. And like many people where Twitter is running out of IPv for address space. So we're in the middle of a big migration now so that was a huge win. Things like HTTP 3, there's a lot of interest in starting to play around with that testing, the newer versions of protocols like H3. And there's a lot of internal demand for GRPC services inside the data center that talked to another. We run a different version of this internally as well. So communication will also go through an edge proxy there. So up until Envoy, we weren't unable to sort of offer support for GRPC because of lack of trailer support. And then things like SNI server name indication apart from being kind of a prerequisite for HTTP 3 lets us simplify our configuration a lot. And then TLS 1.3 is super interesting at the edge because it's a huge performance win. So for example, the reason that's important to us is that TLS 1.3 reduces the number of round trips required from two to one essentially. And if you remember our analysis showed that like almost all the time that all of us all the latency you have when you make a request to Twitter is actually the process of establishing that connection to the data center. This turns out to be a huge win. So just when we turn this on, we looked pretty closely. We gathered a bunch of client metrics and we noticed sort of 38% drop in handshake time and that translates into something like almost 30% drop in the home timeline pull to requests, pull to refresh request duration. So when you're scrolling Twitter and you're trying to refresh the screen it translates directly into an improvement of the user experience. Yeah, and then so as like the last speaker mentioned bot mitigation is a huge deal to us at Twitter. So yeah, we're constantly flooded with malicious requests either trying to take down Twitter, trying to find interesting unused usernames, just trying to gather data for whatever purpose. And prior to Envoy we had sort of a brain dead way of dealing with this in TSA. So what we'd essentially do is if you make a connection we would just drop the connection after some number of requests and that was designed to prevent users from establishing connection and then flooding as many requests as they wanted to through that single connection and sort of overwhelming upstream services. But of course that has a negative effect on legitimate users as well. So we looked at the rate limiting options that were available in Envoy and initially we determined that none of them were really a good fit for what we wanted to do here. Either local rate limiting or global rate limiting. And the reason being is that they're both sort of assuming a fair playing field or lack of malicious users. So one user could connect if you're using say local rate limiting in a token bucket then one user could connect and instantly drain that bucket of all tokens thereby denying access to every legitimate user who comes and lands on that instance. And if you were to use global rate limiting that would be sort of even worse at the edge. So what we wanted was a way to rate limit users per connection but not deny access to other legitimate users who happened to be connected through that same Envoy instance. So we implemented and contributed back to Envoy. This is available now. A per connection rate limiting filter. So this allows us to rate limit quotas on a per connection basis. And so we can sort of set this per V host. We can say you get X amount of tokens per minute. And if you exhaust them then we just start returning like four, two, nine's or redirect you to somewhere else. And that ended up being a huge win for us. So as soon as that went out we noticed a 20% improvement in the TFE tail latency. So TFE being essentially the service that Envoy talks to in our case it's like our Edge API server. Just because of the number of requests that the much lower number of requests that it had to process. So that was great but there's still sort of an easy way to work around that which is simply disconnect and reconnect on a different Edge instance. And then thereby getting a fresh batch of tokens and you can continue sort of spamming Twitter. So the second measure that we implemented in Envoy was TLS fingerprinting. Which again Envoy's TLS inspector filter is very extensible and we found it quite easy to implement this. This is based on some research out of Salesforce. The notion here being that when you connect and negotiate TLS there's a bunch of the spec is sufficiently vague and there's enough options that different clients will send parameters or elliptic curves in a slightly different order. And we can take all the information and generate a single sort of hash representing the way you've negotiated TLS. And that hash turns out to be consistent across sort of a particular version of a browser will have a particular hash or like a particular collection of Python libraries will have a particular hash. And so that's all fed into a live pipeline where we can look at all the connection hashes that are flowing through the system and flag up pretty quickly a botnet that's suddenly attacking Twitter with a connection hash with a TLS fingerprint we've never seen before. So yeah, that's been another huge win. So that takes us to sort of today. So the next question is what's next? Well Twitter just announced a deal with AWS. So one of our plans is to open another two data centers but not Twitter-owned data centers with cloud data centers. And so one of the things we're doing is exploring ideas around running on boy inside cloud pops that we can sort of spin up on demand and geographically advantageous areas where we know some like large local event is gonna generate a huge spike in traffic. And then we don't have to have the overhead of maintaining all the data centers and planning for that in advance. We can sort of spin those up on demand. We're also sort of broadly working on pushing more fine grained routing logic to the edge of the Twitter infrastructure. So things like routing, like user stickiness essentially routing based on properties of the user that are connected. I know companies like Facebook have explored this area routing based on sort of like social connection graphs and you can actually achieve much more efficient use of resources by redirecting certain users to consistently to a certain data center. And that's something we're building now based on some consistent hashing utilities that are already in Envoy. And then furthermore routing based on sort of a finer grained notion of the resource that you're looking for. So not just blindly routing between data centers but looking at which particular backend service you want and what's the instantaneous health of that backend service and then routing to a data center where that service is present and healthy. And then sort of in the longer term, we have a lot of ideas we wanna explore with Envoy. Twitter does a lot of caching in the data center but not a lot of caching at the edge. So that's a huge opportunity. So we'd like to explore sort of Envoy's HTTP caching filter so we can cache things at the edge, both like materialized views of API objects as well as things like auth tokens and pre-computed API responses. So if we know you always log in at a particular edge location, we can compute your timeline, push it out to the location and never have to go back to a data center to retrieve that information. So that's something that we've sort of already prototyped and are looking to expand on. The other interesting conversations we're having around Envoy mobile, which is really attractive having your edge be Envoy, also having the other side of that connection be Envoy. That means you can deploy protocol changes and things in lockstep and then Unify. Typically we've had sort of totally different observability stacks for like mobile and backend services that would unify all that and be a replacement for some of the stuff we've built in-house. So we have sort of a mechanism for beacons to be sent out to different potential connection points to Twitter's network to allow clients to intelligently choose. But there's some really cool stuff around multi-dimensional connection message built into Envoy mobile. I'll skip over this given that I'm out of time and I have one minute to take questions, but thank you. Thank you. Any questions? Raise your hand. Anyone? Okay, if you have remote questions. Nope, thank you. All right, well I'll be on the Slack and I would be remiss if I didn't say we're hiring so if this sort of thing is interesting to you feel free to reach out to me. I'm on Slack or Twitter. Thank you. Thank you.