 Hi, everyone. Thanks so much for coming to this talk. I'm going to be talking a little about how and why Yelp built a smart HTTP Ingress proxy. Before we start the talk, I just wanted to introduce myself real quickly. My name is Chris Keel. I am a software engineer at Yelp. My main focus is on using infrastructure to try to make the lives of our full stack engineers easier. And today I'm going to talk about one of our most recent projects to do just that. Before we get into it, just a quick introduction to what Yelp is in case you're not familiar with it. Yelp is all about helping connect their users with great local businesses, whether that's restaurants, delivery, home services, or more. This is what the Yelp homepage looks like right now. We're probably most widely known for our business listings where you can rate and review restaurants and other businesses. But I'm not really here to talk about Yelp, the product, but instead how we use our infrastructure to support the people building the product. The title of this talk is How and Why Yelp Built a Smart HTTP Proxy. And that's ultimately what I'm going to focus on. But before we jump into that, just a quick overview of what you can expect from this talk. To start with, I'm going to briefly discuss what Yelp's backend looks like and how we've adopted a service-oriented architecture over the past few years. As part of that discussion, I'll discuss some of the challenges we've had with scaling to handle the large number of services. And finally, we'll go over how those challenges led to implementing a new HTTP proxy for all of our incoming web requests that we call the Routing Service. We'll talk a lot about how we came up with the architecture for this proxy and a little bit about its implementation. To start with, let's go through the exercise of what happens when you type Yelp.com into your browser and you hit Enter. So we're going to start on the left here with one of our users at their laptop. They're just typing Yelp.com and hitting Enter. The first piece of infrastructure that we'll see this request is our third-party CDNs who have points of presence around the world. After that, the CDNs will route the traffic to Yelp's load balancers. This is the first piece of infrastructure that's actually controlled by Yelp. And then finally, those load balancers will route their request to our Python web application, which generates the response that we send back to the user, a simple, right? And this is actually what it looked like about five years ago. And it turned out that as we grew, this one monolithic Python web application was starting to get much too large. With a large number of engineers working in it, the code size, it became very difficult to actually work on and deploy this service. It's too difficult to upgrade things. We have flaky tests that are difficult to work around, lots of different scaling challenges. Instead, what we do now is we actually have many different web applications, all of which are responsible for different parts of the site. So for example, we might have the home page service, which is responsible for serving the home page. We might have the biz page service that serves biz page, the search page for search, et cetera. The way this works is that our load balancer will look at the request that it sees and it will decide which service is the appropriate one to serve this page and it'll send it to the correct one based on that. Because each of these services is serving traffic directly from the internet now, they have a lot of new responsibilities. So for example, bot detection and rate limiting, user authentication where we figure out which user a request belongs to, reporting metrics, both our business metrics and also metrics like the timings for requests and whether any errors are occurring and lots more. This leads to some problems though. The first of which is that our microservices aren't very micro anymore and it's starting to become a big maintenance burden on our future developers. Ideally, we want our future developers to be focusing on building this product and not spending their time on this boilerplate code that we need everywhere. Additionally, keeping those behaviors up to date across services is starting to become really hard due to code drift. So if we have 20 different services, it's very difficult to upgrade one of these features because you now need to go update 20 services and roll out these changes and keep everything in sync. So we started looking at what we could do about all these shared behaviors and whether there's a way we could simplify it. We basically have two main strategies for what we could do with the shared features. The first one was see whether we can move these to an earlier layer. So from the diagram before, that would be the CDNs or the load balancers and that works for some things. For example, our request blocking where we just have a list of IP addresses we don't want to serve traffic to. That's something that's easy to move up a layer. But some things don't really fit in those layers and so we have to package those up and use those in every service instead. That is an improvement. It makes it easier to share this code. You don't have to copy paste it around but it does still mean you have this code drift problem and it's difficult to keep things in sync. And this still isn't great because these strategies don't really work for all of our features. A lot of the times you really want to have exactly one implementation of something. So a common example is something like our business metrics. We really want to be able to trust that those metrics are accurate. And so if we have 20 different services that are writing business metrics, it's really hard for us to know that there's not one out of those 20 services that's doing some kind of mistake in messing up our metrics for everyone. Sometimes you have things you want to do at the edge. For example, bot detection and rate limiting. Ideally we want to block requests as soon as they come in as early on as possible if we know we're going to end up blocking them anyway. We don't want to spend a lot of time processing these requests. Additionally, sometimes it's difficult to error-prone to implement things in every service. So if you think of something like a CAPSHO which we use when we're rate limiting someone we think might be a bot, we don't really want to do this in every service. That requires rendering this template page, looking at the response, validating it, and finally kind of recording the successor failure. So that's pretty heavy weight to ask every service to actually do. And on top of that, we still have these very heavy weight services that we don't like. So we want one place to implement these things but it has to be pretty early on and it has to be fairly smart. That's where we started thinking about whether we could introduce a new service which we called the routing service in the middle of the request flow. So we'd have to have the routing service in the middle doing some kind of magic for all those features we wanted before proxying a request onto the service. Essentially you can think of it as one more layer of HTTP load balancers but a smarter one. So we brainstormed a little about the features that could fit and there were a lot of ideas. So for example, bot detection rate limiting, metrics and logging, advanced request routing. We have all these great ideas for what could potentially go in there but we don't really know whether this is a good idea yet. So it's just something worth exploring. And I wanted to stop the talk right there because I imagine a lot of people are kind of thinking, you know, why would a load balancer be doing this? Isn't that a big violation of responsibilities? These are application concerns. These are business concerns. These don't really have anything to do with balancing traffic at all. And I think that's a really important question and it's one I struggled with a lot myself to kind of like come to terms with but also in talking with others and seeing their concerns on it. And ultimately I had to kind of step back and think about what the purpose of infrastructure actually is in the context of a large company trying to serve a product. And the truth is that no one really cares how cool or how great Yelp's infrastructure is if the company's losing money or if our users hate our product. The input for us is really just a means to an end. And I don't say that to downplay the importance of infrastructure. It's something I'm very passionate about and I know is super critical. But I say that to emphasize that sometimes we can blur the layers a little in the interest of the big picture. So we're not trying to build the next big popular generic HTTP proxy and then convince you how to use it. We're essentially trying to solve our own problems here. And this also, it isn't an excuse for poor architecture or for sloppy design. It's more of a guiding principle for what kind of things we can build. And in particular, something we believe pretty strongly in is that we can kind of move some of the smarts and some of the complexity into the infrastructure so that we can make the jobs easier for the people who are actually building the product by handling some of that complexity for them so they don't have to worry about it. So let's talk about architecture. At this point where we're thinking the routing service might be worth doing but we're not really convinced. We're still kind of debating it. And we still think of it as this magic box in the middle of our request flow with no idea how it would actually work. So this is where we start thinking about how we would architect it. And we did this by looking at a few of the main problems we anticipated needing to solve and coming up with some designs to address those. The first challenge we looked at is that we have a lot of different types of web requests including our desktop main site at Yelp.com. We have the mobile site. We have a bunch of different APIs both internal like for our mobile apps and also public APIs. And then we have a bunch of different microsites like company blogs that are sort of one-offs. And all of these requests need different combinations of features and behaviors. So to give you a concrete example, we couldn't show captures on our API site because our API site, the clients are expecting to get a JSON response with the data. And if we start sending them this full HTML response with a caption, they're not going to have any idea what to do with that response. To add to the challenge here, for legacy reasons because the stuff kind of grew organically over many years, there isn't always a clear divide between these different types of requests. So they can share the same domain names. They can have the same path prefixes. And the logic for differentiating them can be pretty complex. To address this, the first abstraction we came up with was that each request would be categorized into exactly one site. So in the updated diagram here, you can see a request coming in on the left. And the very first thing we do with it is we run this determined site function. After we've done that, we split off into two different parallel chains for each site or many different ones. And each site chain is where we do the real processing. So the key here is that each site will be handled independently from this point on as entirely parallel change. A site is kind of like a virtual host in something like Apache or Nginx, but it's not really restricted to a specific set of domains. So instead, the selection logic can be written in code and it may be more complicated for our legacy sites if it needs to be. The second challenge that we looked at is now that we know our requests are going to be separated into sites, how do we choose which behaviors to apply on a per site basis? So we know that different sites will need different features and we won't find a way that lets us declare which features apply to a given site in a sustainable way. It's easy to imagine that if we just kind of jumped in and started writing code that we might end up with this giant function. That's something like, if the request is on site A or site D, then do this. Else, if it's on site C, then do this. And something like that would make it very hard for us to trace the request flow through the service and understand what's actually happening to that request. Ideally, we want our behaviors to have a clear modular separation and to be configured in a way that's easy to understand and easy to follow. The abstraction we adopted here was to adopt the middleware pattern for our features. So what this means is that each site will consist of a chain of middlewares declared in a specific defined order. And the request will go from one feature to the next with an explicit ordering. This makes it easy to change which features are enabled on each site. It makes it easy to reorder the features and just kind of reuse them without having this spaghetti code. Each middleware will see the request both at the ingress and then again in reverse at the egress. So that allows you to do things like block the requested ingress, the way that's implemented because you just don't forward it down the chain. Or at egress, you can do things like change the response. You can manipulate the body or you can add a response header. I want to talk a little about why this middleware pattern works so well for us. And the main reason I like it is is that it helps avoid that spaghetti code we talked about. We don't have complicated logic to apply features and it's very easy to understand what's happening as the request flows through the service. Another big thing is that it's easy to reuse these features across sites. Each site can just declare an ordered list of middlewares at once in any special config needed for each feature. And then also the fact that we have clear code boundaries. We have a very explicit entry point and exit point for each feature and that makes it easy to isolate things like timings and errors. So we can see exactly how long a feature is taking and whether any errors are coming from that feature. But it also means we can assign clear code ownership for a feature. Because we have lots of different responsibilities, we may have experts spread across our teams for the different features in the routing service. And being able to assign clear code ownership is a big benefit for us. Just to give you an example at the bottom here, I went to our metrics dashboard a few days ago and just pulled this timing for one of our features in the routing service. And you can see that this pattern makes it really easy to see. Here, our zip can feature and how much time it's taking on average. So at this point, we've got the majority of our service architecture and we just have this one bit of magic left at the end where we choose which service to route to. And that's our third challenge, which is traffic routing. And traffic routing is this really complex problem because there are a lot of different requirements. In our case at Yelp, our service mesh takes care of the actual proxying of one request from one service to another. But it's going to be the job of our new routing service to determine which service we even want to talk to. So we have both logical requirements here. So if you think of the example I gave before where we might have a search service serving the search page and a biz service serving the biz page, we might need to actually look at the request and figure out which service it actually belongs to based on the path. But we also have a lot of operational requirements. So for example, we might want to route to different versions during code deploys. We might want to be able to route to a different data center based on things like whether this is a read only or read write request. And we also might want to be able to send a certain percent of traffic one place in the rest somewhere else. To be honest with you here, we kind of cheated on this abstraction. I don't really have a great solution here because there's just too many potential requirements and constraints to have a very generic reusable piece of infrastructure. Instead, what we decided to do was we simply settled on a function interface. So each site will be responsible for declaring this simple interface service for request as the function. It takes in a request object and it returns the name of the service it wants to talk to. And while this is a bit of a cop-out, we have pre-built functionality that essentially handles the vast majority of our common use cases. So if you're doing something like path-based routing, routing to services based on the version or data center routing, that's something we've already built in. So it's really only the sites that need this complex behavior that are going to need to actually go out of their way to implement this. And the nice thing here is that even though the complicated sites are going to need to implement their own logic, at least it slots into our architecture in a way that's consistent and easy to understand and we don't really have any snowflake sites. So at this point, we're pretty happy with the architecture we've come up with and we started thinking about how we would actually implement this. This is the next big point on deciding if this is really something we think is worth doing. So the first option we considered was writing a new HTTP proxy from scratch in Python or Java. I think the reason this is so appealing to us is because we already have a lot of expertise in these languages and also we have a lot of code written in them. And so it would be nice if we can reuse our existing code libraries and not have to reimplement those in some new language. Kind of the problem here is performance of these languages. Even with Java, which is a very fast language, it's going to be really hard to match the performance of these top HTTP proxies if you think of something like Nginx which have years and years of optimizations built into them that we would have to try to compete with. Additionally, we're worried about all the edge cases involved in handling what's essentially raw traffic from the Internet. We get a lot of this what we call junk traffic from kind of poorly written user agents or bots. They don't necessarily follow all the RFCs. And something like Nginx already pretty much knows how to deal with this random not really following the spec traffic. And we don't really want to spend their time reinventing that. The second option we considered was using Envoy. Envoy is this new service proxy. It handles both service mesh and also edge proxies. And the way it works is you basically implement your features as either C++ filters or potentially Lua filters. This is a really cool option because of the performance and because this is a very modern proxy and the flexibility you get from that. I think the main problems for us, one is the limited language choice. It's pretty much just C++ or Lua, although that has a lot of restrictions on it. That means we won't be able to reuse our existing code. But I think the bigger thing for us is that especially at the time it wasn't a super mature project. It hadn't been adopted very widely. And we were pretty concerned about that for something we were going to build out as such a important piece of our infrastructure. I will say Envoy is much more mature today. In fact, we're using it internally for our service mesh now. And it actually follows some of the similar design principles. For example, its filters follow the middleware pattern. So I think if we were redoing this today, I don't think the maturity would be a concern for us. I think Envoy would definitely be a top candidate to consider. Another interesting option is to use something called edge computing. Edge computing is this feature offered by a lot of modern CDNs. For example, Cloudflare workers is an example of this. But if you look at I know Fastly, I know Cloudfront, I'll have their own examples of this. The way it works is you essentially upload a code bundle to them and they will run that code on their own machines on the request for you as they're serving traffic. You can do arbitrary manipulations on the request. You can interact with routing. You can implement a lot of the features we talked about. I think one of the big problems here was, again, especially at the time, these were relatively immature. There wasn't a lot of prior art. We weren't doing anything like this internally. And so there were a lot of unknowns on how dependable and how reliable this would be. Another issue is that this pretty strongly ties us to a single CDN provider. That's something we're not crazy about. We actually run our site with two independent CDNs that we can swap between during incidents if we have to. I think the portability of this edge computing is still kind of being developed. It's not really there yet. So that was another concern for us. The final option we considered was a project called OpenResty. OpenResty is a distribution of NGINX plus Lua plus some other modules. Essentially, what you're doing is you're deploying NGINX itself, which is this dependable, well-understood proxy, but then you're embedding custom Lua code that you've written. You get the performance of NGINX and the dependability of NGINX, but you get to actually embed your own code. Again, the main downside here is the restricted language. Lua is a fine language, but we don't have our code in it, so we're going to have to rewrite a lot of stuff. To give you a more concrete example of what OpenResty actually looks like, I put just a simple hello world up here. So at the top is a pretty standard NGINX config file, except in that location block where you would normally specify what you're going to proxy to or a directory to serve files from. We have this new directive called content by Lua file. That just points to the path to a Lua file. At the bottom, I've got an example of that. It runs the script which actually generates the response that's sent to the user. So this is just sending 200 okay and then hello world. This is a very simple example, but it can be much more complex. You can import other Lua files here. You can make network calls. You can do all kinds of stuff here you can do from pretty much any other language. So we ended up making a table comparing our options that looked something like this. We implemented prototypes for our most promising options to get some idea for how they would work and look in practice and what the performance would be like. No option really satisfied everything we wanted, but we kind of decided we really wanted to prioritize the matureness and the battle-testedness of the solution, especially because this is going to be such a critical part of our infrastructure that was kind of our top priority. So for that reason, we ended up going with OpenResty. We really liked that it's based on nginx, which is this proxy that we have so much experience with and we really trust. And we knew this would give us a lot of confidence going forward. So we did it. We put a few people on this project to develop a first version of our routing service. And we wrote the first version in just a few weeks. Initially, it was just a couple hundred lines of Lua and all it did was serve as a pretty much transparent proxy. It would just add this response header when a request flowed through it and we then rolled it out in front of all of our traffic slowly. This was really nice because it meant that if we discovered any issues we could easily take it out of the request flow and just kind of roll back in that way. Which is something we can't do now now that we've put all these critical features into it we can't actually serve our site without it. Since then, we've moved a lot of features into it and it's proven to be super stable. It's incredibly efficient. We spend almost no money hosting the routing service even though it's serving all of our traffic. And you can actually check it out yourself. If you go to pretty much any help website and look at the response headers, you'll still see this xRoutingServiceResponseHeader with a little bit of debug information in it that dates all the way back to when we were just testing it out at the start as a transparent proxy. Just some specific reflections on kind of what we learned from implementing this new critical infrastructure and a new technology stack. One thing is we definitely spent more time during the initial development. We really wanted to make sure we were treating this like it was a first class piece of infrastructure. And so that meant we spent some time on things like metrics, logging, experimentation. These are things we already have good implementations for outside of the Lua ecosystem. But because we're using a new language, we kind of had to spend some time implementing these. The nice thing was that because a lot of the time, even when we had to re-implement stuff, we were able to consolidate it into the routing service. We didn't necessarily have to end up maintaining two versions of things forever. Once we moved things into the routing service, a lot of times we were able to retire the old implementation and not have to worry about that going forward. And a few thoughts on Lua specifically as a language. First of all, it's a very simple and easy to pick up language. None of us had really worked with it in a serious way before. And just within a couple of days, we were writing it very easily. It's super easy to pick up. You do need to watch out for so much sharp edges with Lua. For example, everything is global by default if you don't declare it otherwise. A lot of operations will fail silently. It doesn't really have good error handling or exception handling. And a lot of functions in the standard library have some caveats that you won't necessarily expect unless you've read about it in the documentation. One way you can kind of work around this is by investing in code quality tooling. The Linter is in particular, Lua check is something we used a lot. Go a long way to protecting against some of those sharp edges. In particular, they'll catch things like having globals by accident very, very well. I strongly recommend that you treat it as serious code. Lua is often used as this embedded language and where you're only writing a few lines at a time. But for us, we spent a lot of time on the architecture here. We didn't just treat it as a scripting language that we could kind of wing it as we're going. We designed it with being able to write effective tests in mind, for example. And that really went a long way to making sure this service is designed and implemented in a way that's going to be sustainable going forward. And then finally, one thing we definitely learned the hard way is that Lua is definitely not a batteries included language. It's standard library of very minimal. It doesn't have a lot of things you would expect from a language like Python or Java. And so you'll find yourself kind of copy and pasting stuff from Stack Overflow constantly, like merging two tables or splitting a string. And you'll find that you have like five different implementations copied from five different Stack Overflow answers across your code base that all have different interfaces and in different caveats. I definitely recommend don't do that. Instead, what you can do, there are libraries for Lua that are essentially supplemental standard libraries and they just provide these common functions for things like working with strings or working with tables. I definitely recommend picking one of those up from the start of your project. Just adopt that as your one standard way to do these things and then you can use that consistently across your code base. A couple of things just finally on reflections on the development and rollout. Starting with the architecture rather than just jumping in and writing code was really helpful for us because it helped us during our planning and prototyping. We didn't need to reinvent the project halfway through as things got thorny. Bet on OpenResty. I think this is a really great project. I think it's got a great future. It worked super reliably for us. Launching as a transparent proxy first, I think that really enabled us to solve all the problems with just the proxy itself. In particular, one of the things we did is we deployed this as a regular service on our service mesh and we discovered a couple of edge cases that we didn't really notice with other services because this service was so high in terms of requests per second that others hadn't really noticed these problems and it took us a little extra time to fix those. And finally, spending extra time on experimentation was a big deal for us. We have the ability to launch new features very carefully. We can do things like launch first to internal traffic or launch two consistent percentages of traffic. We can also target specific countries which makes for a very nice way to compare the timings and errors, for example, from country A and country B and make sure we're launching new features safely. That's all I've got. I'm recording this about two weeks from when you're seeing it but if everything's gone right, we should have about five minutes left for questions. Before I answer those, I know we're short on time and we have limited opportunities for talking at this virtual conference but I'd still really love to hear from you. Please reach out via email, cqo.yelp.com or on Twitter at imcqo. I'd really love to discuss any of this with you. Thank you so much for attending. All right. Hi, everyone. Thank you so much for attending and watching the talk. I was super excited to be able to give that and looking forward to hearing from you all. If you do have questions, please feel free to enter those in. I don't see any yet, but I'm happy to answer those and also to chat over Slack. I had a couple people say, not questions, but thanks for the talk. Thanks for the comment. I appreciate that. Pamela asked, you mentioned architecture before code. Do you have any tool recommendations for doing so? I'm not totally sure if I understand the question but I think the thing for us was basically this was like a big unknown for us. We weren't entirely sure if we wanted to commit to building this, especially because the kind of the backend infrastructure people were very nervous when I started talking about this. So, you know, we really needed to make sure we were convincing them that this is a good idea. And so in particular, what we did is we did a lot of kind of talking with them and understanding their concerns and figuring out how we're going to build this and planning that out in a document and getting sign off from all the stakeholders and then kind of thinking about what the challenges we anticipated were and how we would address those and how we would actually design this service before we started writing any code. So I don't have a specific tool recommendation because it was more of a, I think it was more of like a process that we followed. But yeah, we were just very thorough with the planning on this project more so than, more so maybe than was necessary, but that kind of enabled us to sell this to everyone because it was a kind of a controversial idea at the time. Thank you for the question. Yeah, Pamela says, has faced some of the similar challenges with getting a good design first. Yeah, I can definitely kind of commissary with that. That's something we've definitely like kind of had to build up in our organization. And yeah, I think just like going with the planning, especially for something like this, that's going to be such a critical piece of infrastructure is super important. I don't think you regret spending time on the planning before you jump into implementation. Awesome. I believe we're coming up on time here. But yeah, they're going to post, I believe a message in the conference, letting you know what Slack channel we're moving to. So please feel free to jump over there and ask any questions or just ping me on Slack directly. I'd love to talk with you all there. Thanks so much for attending.