 Hello, everyone. So today I'm excited to share with you an intriguing success story, the migration of the web arrangement platform to Envoy. So this is a story of how we had Spotify revamped or web proxy layer to use Envoy, obviously. We'll dive into some technical details, share the lessons that we learned, and a bigger picture of how this looked for us. But I want to mention that we are initiating this endeavor with a purpose to present to our peers the practical application of Envoy. Our intention is to share some experiences, the lessons that we learned, as I said, and challenges that we've encountered while adopting Envoy. And we hope that by doing so, we can initiate some fruitful discussions about the future and potential improvements and future that we can all make together. So yeah, today we have a concise yet impactful agenda. We're going to start by presenting the Spotify parameter. We're going to show an overview of our operational scope. Then we're going to zoom in into the web management platform. And we want to tell you about how this is a key component that enriches our ecosystem. Then we're going to explore the motivation behind our migration and also share some technical insights into the actual migration work. To conclude, we're going to wrap up and look at the migration process and look ahead at our vision and evolving infrastructure. So with no further ado, let's embark on this journey together exploring technological evolution of Spotify. But yeah, first and foremost, allow me to introduce ourselves. My name is Sabrina Zotti. I am a software engineer at Spotify. I'm based in Milan, Italy. And today I have the privilege to present alongside Oliver Sol, staff engineer at Spotify, based in Stockholm, Sweden. Oliver and I both work within the computer networking product area at Spotify. And specifically, we are part of the ATC team with a primary focus on the design and management of Spotify's infrastructure perimeter. I will now leave the spotlight to Oliver to introduce our work. Thanks, Sabrina. So we'll talk today about our web enrichment platform. But before we do that, let's zoom out and talk a little bit about Spotify's overall perimeter, which can provide some context for our presentation. Spotify runs back in services in Google's cloud platform. And traffic from Spotify clients like mobile apps, TVs, cars, refrigerators, what have you, are routed through a Google Cloud load balancer. Just behind that load balancer, we have our Envoy-based edge proxy layer in all five of our GCP regions. We presented our edge proxy migration story here at EnvoyCon in 2019, and we'd be happy to take questions about that later or any time during the day. Just a side note, many of you may be curious. As an audio service, of course, our clients need the audio files themselves. Those are streamed from a CDN, and we're not going to talk about those today. Edge proxy, however, takes between 6 and 10 million RPS every day. Here's a recent graph of a week's traffic by GCP region. However, the vast majority of that traffic is our service traffic, like when your Spotify client needs to load a new playlist or you add a track to your liked songs. Only about 1% of our traffic is web traffic, and you can imagine that that traffic has much different patterns and more specific needs in the proxy layer than our regular service traffic. We route the web traffic through edge proxy and then to our web enrichment platform, which has many web-specific capabilities that our web service owners rely on. That's what we'll dive into today, and Sabrina's going to tell you a little more about that. Thank you, Oliver. So before zooming in to the web enrichment platform, I'd like to start by mentioning an interesting fact. So three years ago, when the web enrichment platform was created, back then it was Team Nibbler. The primary goal of this project is to dismantle this monolithic system supporting.botspotify.com into separate web services developed by a different team, obviously. To make a long story short, this project resulted in the development of the platform that we are not going to present that allows the segregation of client-facing applications from common business logic concerns. But not only that, as well as also breaking up the monolith in multiple leaner simpler applications. This aspect of the platform caught a lot of attention and became quite popular within Spotify. And fast forward to today, the web enrichment platform is now standard for the web traffic. So about the web enrichment platform. So you might wonder what this is exactly. It's an infrastructure designed to handle the routing and the enrichment of the web requests. From an architectural perspective, it consists of three layers. So a configuration layer, a routing layer, obviously, and the enrichment layer. Take away until now is that all requests directed to Spotify will pass through the web enrichment platform. But let's backtrack for a moment. So Oliver mentioned the concept of edge proxy, right? So if you tuned in in our previous presentation in 2019, edge proxy serves as the initial point of entry for all the requests that hit Spotify. And that still stands. However, in the case of web services, edge proxy directs the request to utilize the web enrichment platform. So within the routing layer, the request can be configured to either directly access the web service, meaning that it will only use the routing component of the platform, or it can be enhanced with enrichments. More on that on the next slide. But currently, the routing layer is based on Envoy and directs the request to the appropriate web service. The enrichment layer is a tool, a feature, that allows developers to incorporate business logic into their requests, such as cookie handling, market analyses, or even compliance to information standards like CSRF or CSP. But what does the enrichment layer actually look like in practice? Once a request reaches the web enrichment platform during the routing phase, the platform will verify whether the request should be enhanced with some business logic or not. If yes, the router will initiate a sub-request to Concierge. Concierge is another component of the platform and serves as the enrichment orchestrator. So to ensure a flexible infrastructure, the enrichment layer operates based on a plugin strategy, so Concierge. Any member of our organization can develop an Enricher, which is essentially a library that performs a specific function. This approach allows developers to extract business logic that maybe it's not really related to their application scope. And this helps their work more efficient, their application leaner, and while also providing the rest of the organization with reusable functionalities. We have an actual directory of Enrichers, kind of like a marketplace where people can go and browse and search for an Enricher that fits the requirements. If such an Enricher doesn't exist, then individuals can develop and enable it with the assistance of Concierge in the Web Enrichment platform. This sounds something like a marketplace of Enrichers, I can say. I think it sounds pretty convenient, right? So if you think about it, if we developers at Spotify are already using the platform for routing, adding this kind of logic here is not a bad deal. Currently, we support four types of enrichments within our system. The first one is content by Enricher. In this case, the response will include a custom status code and optionally a body content. So in this case, the request will not pass through any web service. The second one is content by redirect. With this option, the response will be a redirect instead of hitting the web service. Then we have content by proxy. In this, this feature applies the instructions at the routing layer and directs the request to the appropriate web service. But the web services will process the request and will return the response further on to the client. And then we have the final one, client response. If Concierge responds with client response, then these instructions will be implemented. And the instructions will only be visible to the client and not to the web service. One important thing to note is that this process involves a mutual exchange of information. For instance, when using a content by redirect, the request will pass through the web enrichment platform multiple times. Spoiler, if this sounds familiar to you, you might be onto something. Oliver will now tell you more about the behind the scenes of the configuration layer. Thanks, Sabrina. OK, let's talk about the configuration layer. And this is where we have to let you in on a little secret. The web enrichment platform was originally built around Nginx. So this configuration layer may not smell very much like an Envoy control plane, but bear with us. So how does it work? Logically, the configuration layer has two responsibilities, transforming our configuration API into Nginx configuration and performing service discovery of the upstream web services. But architecturally, we need to manage a fleet of proxies that autoscale. And since we don't have a control plane, we need that configuration logic to be running alongside each proxy replica. The configuration layer then is co-located next to each proxy replica as another process in the Nginx container and runs service discovery every 30 seconds, dynamically updating the Nginx configuration. On the other hand, when there's an update to the static configuration, like when a new web service is added, we make a new deployment of the router layer and the configuration is bundled along with it. As we move to Envoy, we use the same approach, supported by Envoy's dynamic configuration mechanism. The configuration layer simply produces Envoy YAML rather than Nginx configuration. And during the migration, the configuration layer played a key role in abstracting away the proxy implementation from our users. Ideally, they would have no idea that we're changing out the engine, so to speak. OK, so we just covered the overall structure of the Web Envision Platform. And now I'd like to cover some of our rationale for rebasing the platform to sit on top of Envoy. There are many reasons we chose to migrate to Envoy, but the primary reason was organizational. The team who originally created the Web Enrichment Platform, as Sabrina mentioned, they were called Nibbler, had seen that Envoy was gaining a foothold at Spotify and that Edge Proxy had benefited greatly from using Envoy. They also saw the potential of merging the Web Enrichment Platform directly into Edge Proxy, which of course would never happen unless it moved to Envoy. Now, from the technical perspective, Envoy's feature set was roughly aligned with the features we needed to support in the Web Enrichment Platform. The team had done some discovery into Envoy and hadn't found any major feature blockers. And additionally, they'd noted that Envoy had a vibrant open-source community, all of you, which made the team feel confident that we could contribute to Envoy to fill any gaps that we might find. The Envoy feature set that was most important to the Web Enrichment Platform was the set of features supporting extensibility, XSALT-Z, XPROC, Support for Lua, Wasm, the native plugin model, and as we heard earlier, Go as well. However, it's important to note that although Envoy supports high performance, performance wasn't terribly important to us for this application. Extensibility was really the key feature. Anyway, the team decided to go ahead with their decision to use Envoy and started engineering work early last year. Although we don't currently have concrete plans to merge the Web Enrichment Platform with Envoy, we have joined the teams. Nibbler merged with ATC earlier this year. And now, Sabrina's gonna tell you a little bit more about how the engineering work actually went. Thank you, Oliver. Let's take now a look at how we re-engineered Web on Envoy. Luckily for us, the transition was quite intuitive and something that was very helpful in the process was obviously Envoy flexibility. Many of the changes were primarily concerning the configuration and the routing layer with varying degrees of difficulty. We were lucky to have a lot of one-on-one matching between the features, but I want to mention a few of the things that we need to change. Starting from the upstream handling, first we need to slightly change just how the service discovery works. Envoy played a crucial role in this process, as our services need to find each other through engine room, the configuration layer. Then we had the health checks. For now, we rely on the active health checks through the outlier detection to ensure that the services are healthy. Another thing was timeouts. We can now configure timeouts request per upstream, and this ensures responsiveness. Regex says, so Regex says this feature is very important for us, because we use them for request processing, so we had to work a bit around limitations with the Regex engine in order to keep Regex patterns efficient. But thanks to Envoy being open source, we were able to improve this feature to fit all requirements, and this resulted in a PR that we opened, allowing us and potentially others to expand the capabilities. These changes addressed an issue and countered with the Regex rewrite as part of the redirect when the mesh path does not match the entire path. The problem was that only the mesh section of the original path, for instance, the prefix, was substituted with the Regex rewrite, whereas the expected behavior is to replace the entirety of the path. Another thing is that now customers have the ability to determine the appropriate course of action when errors are returned by a vertical. This is a feature that we offered in the platform. This can involve intercepting the error and invoking another vertical that displays the error page. As a result, all the verticals within a website can use the same error page without needing to independently implement it. It appears that such a functionality is still not ready for production in Envoy, so we had to depend on our custom law filter to fulfill this requirement further. Moving on, speaking of integration tasks, we're incorporating Envoy to validate the configuration and prevent unsupported or complex Regex patterns. We also managed to simplify our system by using built-in Envoy features, reducing the amount of custom code we had to maintain. Features like service auth filter, dynamic forward proxy, or external authorization filter played a significant role in reducing our custom Lua plugins. Additionally, we benefited from some new features as well. For instance, we may greatly use of the Envoy admin endpoint to support our operational tasks. On another note, as Oliver mentioned before, throughout this process, we were so pleasantly surprised by the supportive community and assistance that we received. While this project is still in progress, we are very pleased with the progress that we made and so enthusiastic about our upcoming plans. Sabrina just covered many of the successes we had as we implemented the Envoy layer, but of course it wasn't all perfect. There were a few places Envoy wasn't quite so aligned with the features we already offered in the platform. Let's talk about a couple of specific issues we had. Rate limiting is the first example. The web and version platform supports rate limiting and quite a few of our users had enabled the feature. But the core issue here is that rate limiting in Nginx and rate limiting in Envoy differ in some key ways. Specifically, Nginx supports using the client IP address as the rate limiting key, whereas Envoy does not unless we use the global rate limiting filter and the separate distributed rate limiting infrastructure. We didn't wanna build out that separate rate limiting infrastructure, so we decided to look at the problem a bit more holistically. We looked at our rate limiting users and tried to understand their various use cases. And it turned out that most of them use rate limiting to protect their services from unwanted traffic spikes. But supporting that use case has a few problems. Users don't have a good way to determine a specific rate over which they wanted to limit their traffic, especially when their traffic varies over the course of a day. And from an operational perspective, we would have to do capacity planning of a rate limiting infrastructure, which is especially hard when we don't really know what traffic to expect. So we'd been down this path before with Edge Proxy and we didn't feel like blindly using the same approach. So the team decided to pause and look at more long-term solutions. And unfortunately, I must say we haven't settled on a solution quite yet, but rather we're investing in our broader traffic protection strategy. The other example I wanna talk about is file serving. And I should start by saying that this was not really an Nginx versus Envoy issue. We had a feature in our web and Russian platform that allowed users to define raw Nginx config. Yeah, yes, you heard that right. It's the definition of a leaky abstraction. Whatever good reasons the team had for including this feature in the first place, clearly we couldn't port that forward, port that functionality forward to be supported under Envoy. So again, we took a look at what our users were doing with this mechanism and luckily we found that everyone was using it simply to serve small files such as a robots.txt file. I think we really dodged a bullet here. This could have been abused much more widely with many more use cases. But given our findings, we settled on building a proper file serving feature into the API and currently the implementation simply just proxies a request back to a GCS bucket. But in talking to our users about this feature, we also heard that they want the ability to use a downstream cache. And so we're considering if and how we might support HTTP caching in the platform. It's not commonly used for backend services but it would be useful in the web space. Now Sabrina's gonna tell us a bit about the actual migration. Thank you. So with this slide, I just want to provide a quick snapshot of how the migration process looked. The good news is that everything went smooth and we currently are somewhere near 90% completed in terms of the migration. Until now, we have made the switch with zero downtime and pretty much zero issues almost. However, we did take some precautions and systematically moved our traffic following this procedure. So we set up some migration cohorts dividing the user base in order to facilitate a control transition. Then we contacted the owners of the web services and maintained a clear communication and collaboration with them. When it was time to schedule the migration window meant being an optimal timeframe for minimal disruption. One thing that was very helpful for us was that we were able to temporarily operate traffic splitting with the help of envoy proxy. This helped us oversee potential issues in time and also avoid disruptions. Monitoring was key. After all, we prepared a set of dashboard with overview of traffic and what was happening with the requests and the enrichers as well as monitoring resources. When we assessed that everything was fine then we did a full switch on traffic. So 100% of traffic going to the web management platform with envoy. And then some more monitoring to ensure peak performance as well as collaborating with the service owners for the long term. Next, Oliver will share a snippet of our future plans. So now that we're almost done with the migration where do we go from here? Well, we're not sure exactly but a main focus area in the next planning cycle is to evolve the web enrichment platform. The first couple items that we have here in this list are almost certain, a control plane. Let's face it, the current configuration layer is a bit of a hot mess and we'd love to reuse our expertise with the edge proxy control plane and apply that to the web enrichment platform. Also, the tap filter. We already have a few use cases for the tap filter or at least use cases where we think we might start with the tap filter and iterate towards custom filters if and when we think that makes sense. One specific effort we're actually currently working on is creating a granular data set of traffic going through edge proxy. We've already run a prototype that uses the tap filter and some sampling and I think it's pretty likely that we'll use the same approach specifically targeting our web traffic in the web enrichment platform. And the other items here are a bit more exploratory. As Sabrina hinted at before, careful viewers may have noticed that the API that we offer in Concierge and the enrichment layer looks very similar to Envoy's XPROC API. So perhaps we rebuild Concierge as an Envoy filter and call external admirers with XPROC. Another idea is to have some coordination between the configuration APIs of edge proxy and the web enrichment platform. Maybe we could automatically provision endpoints in edge proxy if we know their web enrichment platform users. And lastly, the mythical idea of actually merging the web enrichment platform into edge proxy itself. There are so many ways we could actually do this and I'm sure there'll be plenty of ideas by the team over the coming planning cycles. So of course, we'll have to wait and see what the future will bring, but I can speak for the team and say that the future is bright with Envoy. So lastly, we'd like to say a heartfelt thank you to the current and past members of both teams, Nibbler and ATC. Thank you to the conference organizers for inviting us to speak and the countless others who helped this migration to success. I extend the thank you as well. And with that, we'd love to take any questions that you might have or give you a short break otherwise. And also please use the QR code here to submit any feedback that you might have for us. Thank you. Thanks for the presentation. I noticed one of the things you listed as a challenge during the migration was something to do with redirect policy for customer response. Also dealing with some interesting coax of internal redirects in a migration at the moment, so I'm curious to hear more about that if you can elaborate. That's something we might have to look up. I don't remember the details. That was something that I believe one of our past team members submitted an upstream patch in Envoy where I think there was, I think there was something where you could redirect but it didn't maintain the host header. It was something like that, do you recall? Yeah, I'm not sure. I'll tell you what, we'll chat later and look it up and get you a better answer. Curious about your experience with Envoy filters? I mean, I've heard typically that it's preferable to use C++ but what about some of the other languages? What was your experience and the pros and cons? I can take a stab at this one. This is an interesting question. I think specifically for the web enrichment platform, we already kind of had a choice made for us because we had so much custom code in Lua already to support Nginx. We basically just moved everything over and we didn't try and redesign things at that point. From a more strategic standpoint, taking into consideration both the web enrichment platform and Edge Proxy that our team also maintains, there are differing opinions in the team, I think. But maybe we can say that for Edge Proxy where we're really running all of Spotify's traffic, we would prefer to have something which is very maintainable and very high performance. But perhaps there are cases where we might want to prototype something in perhaps a language that team members are more practiced with. We don't have that many people who are actually good at C++. So I think we're interested in looking at some of the other interfaces. I'm also personally interested in Wasim as well because I think some of the logic that we might want to see in these various proxy layers, we might wanna run them outside of Envoy as well. And so then having a portable format would be useful. Thank you. I have two questions for you. I guess first, are you using Envoy for API also? Is that your team, a different team? I'm just curious what that looks like, web versus API? So it used to be multiple teams but those teams have joined. So now it's all one happy family. Gotcha. Okay, but are you using Envoy for your APIs as well? Yes, edge proxy that we talked about before. That's where all the APIs go. Got it. And then you mentioned control plane. Have you started your journey into that yet or it's just like an idea right now? I'm just curious, like have you started looking at, I mean, there's lots of control planes for Envoy out there, right? You know, I'm curious if you've found anything that looks interesting for web specifically or not. Yeah, that's interesting. Do you want to answer or fill in? Yes, well, as Oliver mentioned, many of the decisions that we took are coming from the previous architecture of the WebRange platform, but control planes definitely something that we are looking into and plan to invest in this of Copic cycle. And I can say as well, I mean, part of your question is interesting. Anything specifically about web, we haven't really looked at anything specific about web, but edge proxy does have a custom control plane written in Java. And I believe we're also the maintainers of the Java control plane in the Envoy community. So that's pretty much the no brainer direction will go. How we actually support specific web features, I think that'll just be an engineering exercise on the team. Got it. Okay, thanks. Sure. Great talk, thank you. A bunch of questions come to mind, but I'll just narrow down to two quick ones. One of them is for your, in general, this seems like a pretty high throughput application for Envoy. I'm wondering if you've looked at your usage of features related to the performance and scalability of your proxy, in particular, your reliance on regexes for doing redirects. We've been shy about scaling, about having lots of regex in your URL, I'm not thinking they'll be a drag on performance. And if you've done flame graphs, et cetera, to look at that. I'll just also shoot out my other question, which is that you talked about writing some custom filters and I'm wondering if you're following along what Alyssa referenced about XPROC in a way to offload some of that to maybe other services that you might have running to service those content modifications. Thanks. Yes, okay, so those questions. Around performance, as I mentioned before, for this application performance, performance isn't very high on our list, but I imagine we will get to performance at some point, but in my mind, it very much depends on how we consider merging the Web Enversion Platform with Edge Proxy. In Edge Proxy, that is very performance sensitive and we're doing a lot of engineering work to increase performance of Web Proxy itself. Around performance for regexes, yes, that could conceivably be a problem in the Web Enversion Platform, but we're not so worried about it. We can always just throw more hardware at it for the time being, so I think. And around XPROC, yes, we're definitely looking into XPROC and I think XPROC can really help us evolve the enrichment part of this platform. Thank you for the presentation. In my company, UF, I was actually going through a similar migration from other reverse proxy to Envoy. Actually, the first question, were you able to achieve 10 miles in the retries to upstream within Envoy or just via Envoy filters? That is a question I don't think I can answer. Do you have an answer? I think we don't extensively use retries. I think we set them to one only, so. I see, okay. Cool, yeah. The second question you mentioned about the global control point, right? In your scale, I assume that you probably have lots of Edge proxies, right? Is the control point, how are you guys handling, let's say, updates to Envoy, updates to the filters, the plug-ins, source code, how are you achieving that with this kind of distributed Edge proxies? Do you want to come answer? Nina is a part of ATC with us. Yeah, so the scalability of control point for the Edge proxy layer has been a known problem for us, so first of all, we switched to Delta XDS. And recently we had a problem with Java control plane when it got hammered by Envoy, like in a single region. Let's say we had some issue with the DDoS attack and then we, okay, let's scale up the Edge layer and then after that, when the control plane restarts and all the Envoy's tried to connect roughly simultaneously and to request the first state of the world snapshot, then our control plane just went crazy and it just stopped serving requests and Envoy's basically got the stale EDS assignments. And for that, what we have implemented as a short term roughly low cost remediation was what we call XDS rate limiting and we also reached out to the community and asked if other, like rest of the community would be interested in that, basically how that works. Every time there is a new XDS client connected to the control plane, we keep internal logic in the Java control plane that checks, okay, this is a new client connected and I'll have seen that client before. If it's a new client, it has a configuration of how many concurrent XDS streams it allows per second and if that one reaches, it sends, I think, some five or code and like, resource exhausted, try later and we rely on the mechanism in Envoy to do retries and like say, let's say in the largest region, it takes us with 10 concurrent XDS per second, streams per second, it takes us roughly 30 seconds for all the Envoy's to get the first snapshot but in that way we stabilize the control plane and we can scale up. We tried a very large, like 500 Envoy machines in a single region with a single control plane and we did control plane restart and XDS rate limiting has been proven like a battle proof mechanism for that. Yeah. That's great. Just one kind of ball up question. Like, you know, do you have any mechanism allows you to do gradual rock? Let's say that you are rolling out a new configuration. How do you make sure that you know to control the blast radius to start with? This is a feature that we are very much looking forward and currently we are running on GCE and this is not supported and we are on the way to moving to Kubernetes and this is the feature that we are looking for. Basically every Envoy upgrade for us is a stress event. We like it's create the risk events calendar. We notifying their company and then we are having our hands shaking and like looking at the graphs for early detection but we have cost bunch of bad incidents because we don't have that capability of reaching our rollouts. But we do like canary deployments and then rolling out globally. Cool, yeah. Thank you. This is a good input for our deployments team. We're definitely looking for globally progressive rollouts but we don't have that at the moment. Hi, I'm wondering what you've considered on the sort of simpler side of rate limiting so that you don't wanna roll out the whole infrastructure that Envoy can provide and that your users are used to something else that's simpler and different. Like you said, you had some things in mind and I wanna know what those are. Yes, this is complicated. We can talk more about this afterwards but I mean we're looking from a technology perspective, we're looking into RLQS. That doesn't necessarily address the product side of the problem really. One idea we had is because we're interested in per second rate limits. Durability of the rate limiting infrastructure isn't so important. And so we were considering using Maglev on the rate limiting cluster and then basically you're distributing the IP addresses along the cluster hash. And if one of those rate limiting instances might go down, it doesn't matter so much, it doesn't take down the whole infrastructure. And that way you could also auto scale the rate limiting infrastructure taking care of some of the capacity planning concerns. But that's just an idea. I have an open tab in my code editor that has something kind of working in Envoy but that's not supported at the moment. I think we're about out of time but I think we're about out of time but if there's any more questions we'd be happy to take them offline. Thanks guys. Fascinating talk.