 All right. Hello everyone. It's not that big of a room. So I'm gonna tone down the mic is super loud, too We've been hearing so far and you're probably gonna hear quite a bit more about How networking is moving closer and closer to the edge right get closer to your clients Push it as far as you can well today I want to talk about pushing it just a little further than that and actually running envoy on Your end user client devices themselves With a little project that's called on boy mobile My name is JP. I work at Lyft on This thing called envoy mobile. So the idea here is that We can take everything is great about envoy that you know love that you run on the server side and We can embed it to run it in mobile apps to run it on iOS on Android And to run it to power the networking capabilities of those apps so a few years ago the team started Taking a look at what it would take to bring envoy and embed it into a mobile client So I'm always built around on boy It's designed to run on iOS and Android specifically although it does run on various platforms So well, it's just not a target platform and one of the main Differences is that we have a number of constraints by nature of running in these highly sandboxed highly Locked down mobile environments Right where as an application developer you ship a binary to end users and that ends up running as a single binary single process most of the time And so you can't just run your networking client as a sidecar and spawn a process as your app is running and Where you request off to it? You have to run it as part of your applications process sharing memory sharing CPU cycles all of that We built on boy mobile as a polyglot library and what I mean by this is that there are apis for various languages and platforms that you might want to use We have apis for C++ Swift Kotlin objective C Java Python although you don't really run that one in apps and the idea here being that we want to make the experience of making network requests for application developers as Idiomatic and native feelings possible on those platforms And what this allows the library to do is both a field right at home for someone who for example writes Kotlin all-day long ready Android apps or Swift on iOS but at the same time offer some amount of flexibility for larger teams who may want to invest into building a single Interface to their networking apis and write that in C++ ship it in a cross-platform way Like on boy on boy mobiles open source licensed under Apache 2.0 And you can find more about it at on vulnerable.io Which is where you see the nice screenshot of the website there so why? Did we go out and build this and by we I mean that on boy was first Started around 2019 by a few lift folks and over over time More and more folks especially Google, but also from other companies started contributing Why is this useful to build well? There's a lot of what you probably know and love about on boy that would be hugely beneficial to bring in a mobile application context And so that was sort of one of the main ideas and one of the things that on boys quite known for is its rich set of observability capabilities There is just a rich set of stats that you can emit by just nature of using the library And what's nice about this is that especially on mobile where you have very limited resources You have to be very careful in how you instrument and you monitor and you observe What's happening on your end-users devices? They might be in a constrained environment both from CPU or memory perspective They might be in an intermittent connectivity state You don't want to end up competing for the same resources that you're trying to observe Therefore degrading the whole experience right so that rich set of observability and especially how Lightweight and stream died a streamlined and optimized it is is extremely helpful to bring to a mobile context another reason is that Envoy has all of these support for newer networking technologies And I think we're actually going to hear some talks later today about some of what those are and and some of the recent developments and Envoy upstream there and what's nice about this is that on Certain platforms like on Apple's platforms the networking library and the networking engine ships with the operating system So if you go back and you target You support your end-users still using in OS. That's a few years old They're not gonna have access to Everything that's happened since then and what's nice about being able to ship our own networking library with our apps Is that we can go and backport those and support all of our users? Not just the ones on the bleeding edge Another thing that bringing Envoy to mobile gets us is a certain amount of consistency Where instead of building against different apis different Having different engines on different platforms and different OS versions all having their own quirks their own ways to think about them their own Ideas for how to architect the networking logic around them We can share that not just across the mobile code bases But also on the server so that you know if you're compressing something with Envoy proxy on the server side using Brotly the version that you are decompressing on the client is using that same code, right? So there's a high degree of compatibility and consistency Another layer that bring Envoy mobile to bring Envoy to mobile gets us is a lot of control these Network engines that ship with the operating system on the war extremely powerful fine-tuned, but Ultimately they end up targeting the long the the vast majority of use cases They want to be good at everything, but it means that they don't necessarily expose Many different ways to fine-tune and optimize and configure the networking engine for your specific needs and if you're building a product where You have a pretty well-known usage pattern and every bite matters You probably want the ability to fine-tune and say well these endpoints. I know speak H2 for example So skip the whole alpn process Debugability is another big one that bring Envoy to mobile gets us where on Alternate network engines those are closed source they ship with the OS You can't step through the code necessarily to understand deep down What's happening when you have a request that's timing out or you have something that you're just trying to observe and By having full access to the source code and building it in a loading it into the IDE so that you can step through the code Set break points really introspect and understand what's going on You can gain a whole new level of understanding in terms of what's going on under the hood So those are some of the reasons why It's been helpful for us to continue to invest in this effort to bring Envoy to mobile So brief timeline Matt touched on this a little bit earlier Envoy was announced to the world in 2016 Three years later some folks at Lyft started working on bringing that to a mobile context then for two years after that there was a lot of Work that was done in order to bring it up to speed to be at Compatibility or as close as possible to the existing well-established well-polished network engines that were already out there in the industry and Then from 2021 to 2022 Lyft ran a number of production experiments in order to really Understand where we still had performance or other Regressions compared to the existing stacks and we worked hard to both understand them and Bridge them and this is something that was only possible because of Envoy's open nature Right where we're able to dig into the source code talk to some of the maintainers to the contributors and understand What might be going on and what might help and then finally just a few months ago We completed rolling out Envoy mobile to all of our user-facing mobile apps And now if you've taken a Lyft in the last few months if you've rented a scooter or a bike Using any of the Lyft apps odds are that you were using Envoy on your device to actually make those requests So we only did this once we were fully comfortable and confident that we weren't regressing the user experience for anyone And so we made sure to be at par or exceeding any of the performance measurements that we could make with the existing pre-existing networking engines So we published a blog post on Lyft's engineering blog edge.lift.com you can read more about it there and To give you a sense of what we now have now that Envoy powers a vast majority of network requests in the Lyft apps is We get these real-time Dashboards and statistics on what's happening at scale and this might seem very simple to you This is table stakes when You're building any sort of service I product But what you have to understand is that the mobile landscape is years behind when it comes to this kind of level of observability and If you were to build this with any of the existing networking engines You'd probably either be losing out on a lot of data or you'd be degrading the user experience And so the fact that Envoy has such a strong focus on rich and optimized Observability out of the box means that we get a lot of this stuff More or less for free and as stats get added upstream in the main Envoy project We can take advantage of those and and add them to our observability So this is really huge and probably one of the main Advantages that we now have by powering the Lyft platforms with Envoy mobile on the client side So I want to paint a picture here a little bit and just talk about the road to bringing Envoy to mobile and the the obstacles or the challenges that we had along the way and Envoy is this highly deployed widely used product, right? So it's it's already been quite heavily optimized and works quite well, right? So you would think you just plug that into an app and start using it to make requests and everything's beautiful Well, it turns out that the environments in which mobile applications run Tends to be quite different than a server application. So to paint in very broad strokes here In a server environment You tend to have a very stable internet connection. You're directly wired into the backbone. You have a wired connection to the network It tends to be quite fast. If it's not you'll probably change vendors, right? You have control over Who is providing this service? If you don't like it if there's an outage You go and open a ticket you work with your provider. You work through it. It is not the status quo it is not the normal level of operation you tend to have a certain level of Reliability that you're expecting out of your infrastructure and you have control right where if you're on a provider You're on premises and you don't like any of the quality of service that you're getting you have some amount of agency in terms of Moving that over. I'm not saying it's simple, but you do have that level of control. You contrast that to mobile and and Mobile networks are inherently Unreliable not only do you not control the carrier side and the provider side of it? You also don't control where the device is physically located. You don't you don't control the health of the device that your application is running on you don't control how How constrained it is? You have a wide amount of variability in terms of the quality and the performance Envelopes of devices that your users are running You have a lot of carriers and this is even more true if you operate outside of just North America Which is primarily where lift operates Your users can configure their device in many ways and you don't really have the option to say Oh, your configuration isn't really one that we like and so we're not going to support you everything has to work we made some measurements and Some of the numbers are a little small in the chart, but just look at the variability in terms of the the bars here The bottom here is 80 85 percent success rate The top is around 98 percent success rate and what this is measuring is the success rate by carrier and by networking library that we've measured with one of the lift apps over the span of 30 days and What this helps illustrate is that not only is there a wide amount of variability across carriers but also the top performing Device or operating system is not consistent neither is the top performing networking library and this is helpful to make the point that We don't control where what carrier our users are running their devices on we need to offer them the best possible service regardless of where they are and This is another reason why using on-voy mobiles has been helpful to us because if you see this and you're using one of the The the the networking engine that's shipping with the operating system You have very little control over how much you can influence that in order to help fix or bridge some of these gaps that you're seeing here All right, so we've talked a little bit about mobile constraints. Here's some more Devices have a finite amount of memory that you are competing with the rest of the application for They have unreliable connectivity we touched on that. There are many carriers. You have to support them all user configurations you can install a Http proxy to run on your phone a VPN you can configure your DNS settings to point to something that returns garbage data half the time Users are frequently moving between networks and between interfaces and physical radios on the device as well Where and this is especially true of lifts use cases where you might be Requesting a ride as you're exiting your condo building entering the elevator leaving your Wi-Fi Right and those requests still have to perform as best as they possibly can or you might be trying to unlock a scooter Between you know large buildings This is a fun one the operating system can kill your app effectively at any moment but typically when it's misbehaving and Generally, you don't want to be the reason why the operating system is killing you right? So we need to make sure that the networking layer is not responsible or is not overly hungry for these resources Another tricky thing with with mobile deployments is that you can't SSH into a production environment and troubleshoot Why something might not be working? Well you really need to rely on your observability and More often than not in the mobile space that just means that you have very little insight into what's happening And that's yet another reason why the observability angle of Envoy is hugely helpful to us From the developer experience point of view if you're one of the things that we had to tackle Bringing Envoy to the mobile side was getting it working and compiling in The IDEs the mobile application developers use right so now you can build and run Envoy in Xcode or with Android Studio and still have full access to the The profilers that ship with those or be able to set breakpoints Which is pretty nice, but that was certainly some amount of work that we had to do Another concern on mobile is binary size because we ship a single binary to all of our users We need to really be careful how much heft we're adding to that binary especially if they're updating weekly Right, and you don't want to balloon the application size just because you wanted to add some observability into your networking layer We learned some things by rolling this out. First of all We've learned that both iOS and Android tend to perform in some pretty surprising ways So one example We discovered the hard way that Android can block some DNS resolutions when the when the app is back rounded and When the device is in low power mode and so we had to Go and experiment with various different ways that we could perform these DNS resolutions Prowd at the at the source code for Android and try to understand how we could work around this Another one that took us quite a while to Discover is that Android uses dual stack sockets and for some carrier configurations This meant that if you were trying to connect to an IPv4 address Those users just couldn't connect at all And so what we ended up Figuring out that we needed to do is on Android we now force IPv6 addresses So if it's an IPv4 address we map it to an equivalent IPv6 address so that the the connection can be made Otherwise it can't even connect And this was certainly not something that we intuitively knew So we always force IPv6 on Android for that reason On iOS it also that OS also has its own quirks and What we found is that the envoy uses quite a few and both envoy and envoy mobile use quite a few C++ static state C++ instructors and What iOS in particular does when an application is shutting down and it's being terminated is that it might go and free up Some of those static resources, which means that if something's still accessing that it'll crash So this was a little surprising to us because we found that When we had envoy mobile enabled in our experiments those users were Experiencing way more crashes, but those crashes weren't happening while they were using the application They were happening while they were shutting it down while they were force quitting the app And so that's this added quite a bit of noise to even though it wasn't degrading the user experience necessarily it was adding quite a bit of noise to how our Mobile developers look at our overall observability for crash rates and are saying oh well this this thing isn't stable at all It's crashing well turned out that it's it's fine it just happens to crash when you're shutting it down and You know, we've been working through some of these One by one guarding against some of the state that might disappear, but ultimately it's still still something that can happen And This is just a cost that you have to pay If you're working in any sort of JVM Interacting with with native code is that you do have to cross the J&I And that can it can be costly, but also it's just cumbersome more often than not So we try to minimize how often we cross that barrier So some of the things that we did in order to address some of these concerns For one trimming binary size We were able to run some analyses using tools like bloaty or Or other other means and try and identify Where some of the heft was coming from and at this point? We've gotten the the overhead of linking envoy mobile into apps to be less than five megs generally speaking which is not very light but for applications that tend to be you know 50 plus megs in size it It's a good enough trade-off in our case at least for Lyft ID integration so this is something that that we worked on with some open-source projects like the basal rules excode proj set a rule set where you can Build and run envoy and you have full access to all of the sources A C plus plus in this case. There's some objective C plus plus objective C swift You can step through those breakpoints use LDB And this has been extremely helpful in terms of understanding how How things work under the hood where if you do run into an edge case that you try and understand You have access to To to that same thing for Android studio So some of the things that the teams have been working on recently So earlier I mentioned that envoy mobile power is the 98% of Lyft's Mobile network requests, so there's a 2% holdout that we're keeping there part of that is to continue running comparison experiments with the the previous networking stacks that we were shipping with But another reason is that we do have a number of gaps where we still don't offer Full support for certain configurations and those are generally they fall into two buckets One is VPNs and one is HTTP proxies that users can install and run on their phones So on the VPN front we recently added support for VPNs on Android This required moving from one DNS resolver to another That's backed by a different System call that allows us to operate in the background in low power mode and and supports VPNs So that that's now done We have work in progress actually the the code here is mostly shipped And we're running some production experiments to see how we perform when the user has an HTTP proxy installed on Android We Recently made a change so that envoy can delegate some of the some of the SSL certificate chain Evaluation to the operating system so that you're consistent with whatever the OS provides And whatever the OS trusts So one of the main remaining gaps that we still have is a GP proxies on iOS a Lot of the code that was written for proxy support on Android will be portable here and will work But there is a few things that still need to be done such as parsing and interpreting pack files And within the next few months we should have that up and running So this is really the last big piece of core functionality the last main rollout gap that that currently we require delegating to to to the alternate legacy stacks Another thing that we want to experiment with is quake a quake in HTTP HTTP 3 so this is something that on mobile does Support at the moment, but we haven't run any production experiments And so there's probably some stuff that we'll uncover once we start turning that on and running some production experiments But the fact that envoy has rich quick in HTTP 3 support does mean that we do get a lot of that More or less for free and actually that's something that I want to reiterate is that The envelope over project it lives in a separate git repo But ultimately it it vendors Envoy as a git sub module and we tend to keep that up to date on a weekly basis And what that means is that as more stats or bug fixes or features added upstream and envoy We get to benefit from that and so do all the users that we deploy to And another thing that some contributors from Google are working on is improving The xxs integration and support from a mobile perspective so caching some of the xxs values And generally just making it work. Well on mobile I had some slides on how the architecture works I'll brush through this pretty briefly because I'm running a long time, but at a very high level what's going on here is that you have Different frameworks and and products that we that we produce such as frameworks for iOS frameworks for Android a Common C and C++ layer that these bindings delegate to and then ultimately that calls out into envoy And recently We've also been putting more love into not just the Swift and Kotlin APIs But first-class C++ APIs that also call into this common core So Mostly three you can think of it in three pieces you have sort of the user-facing APIs You have the the the middle core bridge layer that then calls into the C++ native code and envoy I Will skip through this But I'm happy to dive into more details I'll just stop on this slide real quick because what are the ways where we interoperate between? Kotlin Java Swift Objective-C is we effectively allow the platform layer those higher-level language layers to Define callbacks and closures and functions that are invoked either during envoy filters or at very various parts of the pipeline and We'll go and carry some state for Either that that step in in the filter chain such as like stream Intel and Carry it back up to the platform layer where we'll go and deserialize it at that layer. So that's sort of how We can communicate across languages and this illustrates it a little bit where you have The envoy engine at the upper left that's configured with those higher-level languages That goes and translates things into an envoy Set of concepts which effectively calls back into that middle layer, which then invokes the The C functions that are passed in converted into top-level language lambdas at the very top So just briefly over the last few years. There's Over a dozen folks who've been contributing to the project. I've been working on it for the last year But ultimately some folks have been working on it since long before that and so I just wanted to call that out we have contributors from within Lyft from within Google, but also from elsewhere in the community and Of course every time that you've contributed something to upstream envoy We've been able to benefit from that in in many ways in our project as well So that's it. I can open it up for questions Yeah, that's a great question. So to repeat the context is from the perspective of a room full of server engineers how What sort of opportunities does this unlock from from a server perspective? Well for one if you have any sort of expertise within your your server team in terms of how envoy works how to configure Filters or stats that domain knowledge applies on the mobile side now as well. So Your networking team your server networking team can now collaborate more closely with your client networking team if If you adopt envoy mobile and you've already adopted envoy on the server. So that's one another is that like I mentioned if you If you do any sort of optimizations or bug fixes in Code that happens to be shared between envoy mobile and envoy proxy on the server side That ends up benefiting mobile clients as well So in terms of sharing the code base, that's helpful I mentioned compatibility and consistency earlier where if you have a filter that operates bilaterally Then it you can be running the same filter on Your server as you are on the client and you have a high degree of confidence that they're going to be compatible Because they're they're using the exact same code, right? So a common example there is compression and decompression where you know that you know you're unlikely to run into an edge case where One broadly compressor compresses things one way, but it's incompatible with an alternate implementation, right? It's the same implementation. So there's a number of ways and There's even things beyond that where By sharing But by understanding and knowing that you're using the same networking engine on both ends of that connection There are some optimizations that you can do as well. So it extends beyond just developer productivity There's also some aspects of of end user performance involved Yes, Matt wants to ask something which is kind of scary Do you want to answer it as well? I? Just wanted to say real quick that as server engineers We tend to be very excited that you know, we're 99.99% success rate, but the reality is that the client might be a hundred percent broken, right? I mean that's like a pervasive problem So I just wanted to add that as server engineers We need to care what is happening on the client because we can be serving a hundred percent success And there's broken json or something and the client is a hundred percent broken So just wanted to add that we should care more about the end-to-end problem here And I think as JP was saying that with the shared library we can participate more, right? There's just more there's more shared knowledge because I think historically at most companies There's these silos. There's the client people and the server people and I think that's not super productive So I just wanted to add that color No, that's some great color Matt and just briefly on that what we discovered is when we do look at server-side success rate versus success rate from the perspective of the client We have seen some surprises there where the The numbers didn't match up and so we had Additional insight into where to invest in terms of bridging that gap. I don't know how we're doing on time I'm happy to take more questions Okay, all right. Yes Yeah, okay, so the question is that the audience member noticed that there were different success rates across different Networking libraries and across different carriers and operating systems and in some cases Envoy mobile is underperforming or Rather you were all session or okay. Shopee is outperforming Envoy mobile because of the observability that we now have We are able to discover these gaps where previously Okay, well, maybe we were just on a single networking engine per operating system But we still didn't have as much of that visibility into the disparity both From within operating systems or within carriers or sometimes even geographically where carriers end up Sometimes a be testing networking configurations geographically and so you end up having someone in Detroit that has a very different networking experience than someone in Manhattan for example, so yes, there's still some areas where it's some specific Yeah, so specific areas where The alternate networking engines are outperforming what we can do with Envoy mobile but because we have that added Observability we know the areas where we can go and dig into and optimize and potentially fix some issues and our barrier to Fully rolling out at least to 98% was to outperform globally Right or to match or outperform globally And so we looked at a number of top-level business metrics to make sure that we weren't taking any step back there Even though in some specific areas, we might be underperforming other areas were outperforming The the idea is that overall we we didn't take a step back One more okay one more yes, right so the question is How can other companies go and perform some of the same measurements and experiments that we did and ultimately? This is one of the reasons why we've invested in Envoy mobile is because of the observability that we can get out of it So in many ways what you're asking is how can you get the same level of observability with the alternate stacks? And the truth is it's very hard to to accomplish that so what we ended up doing is we uh knowingly Degrade the experience for some users on the alternate networking stacks on your all session or okay HP in order to add additional Instrumentation right because that ends up competing for resources so we know that that there is a hit there And we sample a small percentage of user traffic that we randomize in order to compare with the same type of Observability overhead on Envoy mobile right because you can't necessarily compare Envoy mobiles optimized stats To other stats that would otherwise be adding overhead because that's not comparing apples to apples So we have the same sort of heavy weight instrumentation on both ends and then we'll run at scale Public experiments and sometimes take multiple weeks of gathering data in order to understand How both perform Okay, thank you very much for your time