 Hello there. Unfortunately, Matt cannot make it today. He is recovering from COVID, but we have Alyssa, singer-maintainer of Envoy, Jan, Greg, and Joshua here to answer any questions you might have. Generally, this ends up mostly Q&A, but if you don't ask questions, we'll have to figure out something to blather about, so please come on up to the mic and ask questions as you have them. Hello, Kevin from Akamai, burgeoning Envoy community at Akamai. A couple questions from our engineering folks that they wanted me to bring forward today, so I apologize if some of these are public domain and we didn't do our due diligence to find the answers to those. So, one of them has to do with the external processor filter, so we're pretty jazzed about that. We're experimenting with it. The question was around whether or not it's considered production ready. Do you want to check that out? Yes, or almost there. I think if it's not there, it will be there within days or weeks because we have a GA that's a product that uses that, that has that as a requirement, so it will be production ready very, very soon. If it's not already, I don't know the exact status. Is there a particular version we should be looking out for? We did a number of hardening measures for specifically to harden it against untrusted exproc servers, and that all went into the most recent release, which is 128, I think. Yeah, that's right. So, yes, that should be the most recent, the most recent version has the most robust and hardened version of exproc. Okay, perfect. The other question they had was around connection handoff after handshake. Is there any work that's been done either for TCP or UDP to be able to handoff a connection? I wouldn't think that would be a non-boy thing, it would be more of a downstream psyllium thing, but just that was a question they had. Well, for TCP, you'd need kernel involvement somehow, and that has not been investigated at all to my knowledge. Okay. In theory, you could do it with quick, but I don't know that anyone has looked into such a thing. I believe Raven's actually looking into that for support HV3 hot restart, but I am not holding my breath. It's a really hard problem. Okay. So, yeah, I would say that that's not currently a supported feature. For a hot restart, though, we're just looking at routing to the correct envoy during the hot restart, not actually moving the connection. Yeah, we're not turning the handoff connection, we drain new connections to the envoy and then eventually close connections to the old envoy. It's something we wouldn't be averse to. I know that's something that Akamai tends to do, so if you guys want to add support, we would review it. Okay. But I don't think anyone else has looked into that yet. Okay, cool. The other question was kind of around scaling for configuration, multi-tenancy certificates, clusters, routing information, just kind of general metrics there or gotchas that we should be aware of. Thanks, I could not find that. Yeah, scaling is a lot of, is a big concern for a number of the products that we run at Google. Let's see, you mentioned scaling for the version. Certificates, but I think you guys handle those outside of Envoy, if I remember from last time I talked to you. Certificates, yes. Certificates, yes. We offload them. In fact, the way in which we did that is in the open source C++ class hierarchy and it is possible to inject out of band handling and that's how we do that. Okay. We're using the straight upstream Envoy, but we have linked in with a replacement main, the ability to inject in certificate handling, so we handle that in code that we have no consort. It is possible to follow the pattern that we have done and replicate that yourself to offload certificates also. So that's different than the, there's a hook in there that allows you to just use the private key signing operation. So this is a different hook that's lower into the stack that does that so you don't even need the certificate itself. Exactly. TLS offload. TLS offload? Yeah. Okay. Yeah, I mean the request is handled in process, but the negotiation on connection is offloaded. Okay. You also mentioned just large numbers of clusters. Code got merged in, I would say maybe a quarter or two ago to make the whole cluster initialization lazy in it. I don't remember the exact scope of it, but it had a huge impact in memory usage on our multi cluster systems. When we were loading a lot of clusters that never actually got used. Okay. And that's at two levels. There's just the stats in clusters, and then the whole cluster information itself can be lazy loaded. And that had a huge memory footprint. We're always looking at memory footprint for a large scale multi tenant systems. Okay. Is that going to be extended into other areas like routes or endpoints as well or not sure? Could be. I don't, I haven't looked at the detailed plans. I wasn't involved in that myself. Traditionally as we hit scaling pain points, we scale them. If you hit different scaling pain points, you can scale them. Sure. The clusters had been, had been a distinct pain point. So that's one that we addressed recently. And we can point to the PRs where that got done also, if that is helpful. Okay. I think just knowing that it's there is probably the key. And then if that's a pattern that would be reusable, then we can certainly look at it if we need to for other areas. Okay. And I think the, the last one I had here was around the cash filter plugin. Understand that's not something that should be used for production or anything. But they were wondering if you had any example code or anything around remote HTTP calls for that. So I think the idea is that the cash is remote or we use an existing cash we might happen to have. You know, is there any examples of that that's out there that you can point us to? Yeah, the cash filter is probably seeing the most like it was built years ago. And, but it's kind of getting some attention now from one of the maintainers Raven, who has built a file cache based on that interface. Oh. And so that is in the code base now and could be maybe used to extend it to, I mean, it was obviously intended for a remote cache. The whole system was, was designed to, to be asynchronous in that way. But nobody has flushed one out yet or had the priority to do so. But, but Raven has built a file cache on it, which is at least a good start. Okay. Yeah, I'll say unfortunately we ended up branching the cash filter. The team that does the most caching and Google wanted to move a little bit more faster than they could get reviews. They are interested in open sourcing it. There's kind of an ongoing negotiation with headcounted so I can't promise it'll happen this year. There's a separate project, which I'm hoping will end up an open source involving that basically uses some of the upstream APIs. We have to do a better job of handling connection persistence around live streaming, which may or may not be interesting. But again, as with all things on the on the road map, it's kind of unclear what will end in open source any given year. But it's definitely an area we would love to see more investment in. So if there are features that you want there, we would absolutely love to have kind of a more functional cash filter there. Okay. All right, that sounds. Yeah, just trying to understand trying to avoid working on something that might already be almost ready to come out or something like that. So, okay, good. I think those were the questions that that I had. So I appreciate it and we'll probably follow up a little bit more offline as well. Sounds great. All right, thanks. Thanks. And for folks who came in late again, this is just open Q&A. So if we don't have questions, we can bother. But if you have questions, please come up to the mic. Hi, this is Sadeesh with Tesla. So yesterday, there was a point mentioned on the parsing engine, the HTTP parsing engine that's used in Envoy, right? Like the HTTP parser and the Bolster parser. So from a security roadmap perspective, if we let's say a year from now is the expectation that the frequency of CVs that show up with respect to security folks mucking around with headers basically impact data plane, is the expectation that Bolster parser kind of will help reduce the frequency of these data plane related CVs? Absolutely. So one of the things that our Envoy platform team has done over time is every year we look at the CVs we have, and we try to tackle areas which have had kind of low hanging fruit for security improvements. So both switching over the codex, again, we've had multiple zero days due to our HB2 parser. We've had at least one CVE due to HB parser, I think, two or three. But the really big one, as far as I'm concerned, is the unified header validation. There's a bunch of gotchas that we continue to find in Envoy code related to header validation, character checking and whatnot. And I think once we kind of finalize this transition, I would sincerely hope we have fewer CVEs. Now that's not going to be zero again. We had this emergency releasing Q4, the frame shift, rapid reset attacks. I've been in edge proxying for 16 years now. That was the most effective attack against major companies that I have seen in my career. That sort of thing is not going to, I mean, that's a one in 16 years sort of attack. But that sort of thing is still probably going to happen where we find something the industry's never considered before. But the really low hanging, this was written by a startup overnight by Matt Klein, type bugs where we think Colony is authority that should go away and go away permanently. Gotcha. So the future Envoy releases next year will incorporate the Bolsa parser. Yep. Yeah, Balsa is on by default. We're hoping to turn OGHV2 up by default this quarter and then UHV hopefully early next year. Gotcha. My second question is a question that I asked Matt last year is how big can an Envoy configuration get before it breaks? And his response was there's no free lunch. Find it out and let me know. Yeah. And I would say that's that's still kind of the case. I mean, I believe we have cases on the order of hundreds of thousands of of configs and it works. But again, we'll hit something where we have memory pressure did clusters and we have to go fix that. Or like, you know, ladies and gentlemen, go fix that. So for us, and again, I think Google tends to be at the forefront of scaling, though maybe Akamai will be seen. You know, when we hit something that simply doesn't scale well, we refactor Envoy until it does. So it's basically a function of resource that you allocate to your XDS instance and the resource that is allocated for the data plane proxy. And as long as the admin thread does not die for whatever reason. So it should be. Yeah. Okay. And again, it's possible that you'll hit an area like, you know, one thing we found at Google is is if you're doing a lot of like regex matching that can get really expensive to like walk through ordered rules. So it is entirely possible that you can configure Envoy is that hit a scaling point that we haven't hit. Right. And then it's a question of do you change how you surface those configs or do you try to optimize those code paths. And again, that's a decision you'll have to make when you hit those those scaling points. Yeah. So one of our Envoy shards is around 150,000 lines and counting hasn't broken in over a year yet. So yeah. And XDS stream times three minutes, four minutes, which is really impressive. Like because what we're essentially doing is you have a bunch of swagger's you parse the swagger's out, build the routes, populate it and set up authentication or whatever. And it works and it's scary that it's works because I don't know when it's going to break. So, okay. I mean, one suggestion I would have as a longtime proxy expert is make sure you can air your configs, you know, do not roll out a config of death wide scale. That is something that sadly still happens even at major companies. Actually, I have a question about that. Are you using you're using an XDS with a control plane server and and state of the world every every so many seconds and it's that big and it works. The good part is that I know the point Matt made last year is with if you keep pushing a lot of updates to XDS that is not recommended. I don't recall the exact reason why but unfortunately, but fortunately, we are not there like we don't need to have frequent updates. So so far it lines up. We just have to tell a developer him and just wait for like five minutes and you'll get your configuration reloaded be happy with it. I guess it is stable. Yeah, that's good. I am very interested in what happens with all of the like on for stats and admin thread while XDS is loading and it might like be that that backs up and that might not be good. So that could be a possible pain point in the future to look for which was it again. Well the XDS parsing and stats and all the admin handling is all on the main thread. We don't have a large blocking, you know, compute bound operation on the main thread or sex. Yes, that could, you know, that could hold up those other things from occurring and might back them up and that might have a fact that we have. I think there's also health checks and outlier detection. There's some other stuff ring on the main thread. So that is in fact one scaling point that we have not hit internally with our on by deployments that we have hit and other deployments where we've had to break operations like the conflict reload and stats and health checks into their own thread pool. They're blocking each other. So as you do these, these very large monolithic updates that may be something that a pain point that you hit that that you need that feature before we do the admin thread is what we are watching with the eagle eyes because again, like I don't know when it's going to break. But yeah, I think the whole concept of configuration sharding is very difficult to it depends on the business right like it depends on how your business decides to use the proxy. I think we'll have to find out the answer. So another question I have is on egress proxy. I know that the egress proxy is still not GA and it's in beta or I don't know whether it's in 1.28. When you say egress proxy, do you mean like dynamic forward proxy or do you mean? Yeah, the dynamic forward proxy. So we have several interesting security use cases for that. It's it basically basically surrounds ZTA. And what we're interested is if a client makes if you somehow force a client to tunnel their traffic through egress proxy using DNS, it's very simple. But if DNS is not in play and if an attacker is using IP address to connect, then you need to have some form of an ebpf mechanism to basically switch, basically muck around the destination IP in their packet and force them to the egress proxy. So are there any thoughts and ideas on some solution because ebpf is really easy to run on Kubernetes and all. But if you're looking at bare metal or like stateful application that are hosted on VMs or even windows applications, how do we design a holistic solution where we don't want to use IP tables for sure. But we definitely want to use ebpf force the request on any client to go to egress proxy. Is there any works in the future or Not that I think any of us are familiar with. Sorry. Okay. I have one more question, but it's got to do with the rate limiting, but also it's got to do with rate limiting and regex. I know that in one of the releases this year, the regex engine was changed or there's a hyperscaling regex engine or something along those lines. I don't recall. So if we were Yeah, that's an Intel contribution, I believe to take advantage of hardware available on x86. And it is, I thought it was on the order of like a 20 to 30% speed improvement over are you too, but I might not be remembering right. Okay, so But it's a different syntax. So you have to So we'll have to have compatible hardware to effectively leverage that 20 to get that 20% bump. Right, but I think it's not just it was Intel that contributed but they claim that this would also be working on on AMD also. Okay, but maybe not arm. Got it. Okay, one more question. Sorry. It's just coming to me now. Yesterday one in one of the talks they mentioned about Okay, I when I saw the source code, you have the admin thread and the worker threads. There is a trace trace object call that goes to the admin thread that essentially reports the latency for every request. I don't know exactly which C++ method that is. But they said in the talk they said that one way to choke your admin thread is in the event there's a thundering herd requests and your worker threads are doing a lot of work and they're reporting back to the main thread, the request ID, the request start time and the request end time. Your admin thread can get consumed and that will obviously impact your config the time it takes for you to reload your configurations and whatnot. Again, I'm not able to remember that exact method. It's trace something method. How do we get visibility for or basically just set up a small counter that keeps track of these things because if you really want to be protective of your admin thread, you want to know all the methods that are used to make calls back to the admin thread. I don't know if I have an exact answer to your question, but I believe we have stats and histograms on ePoll behavior for every thread. So you should be able to look at the ePoll stats. But they're also by default, I think. If you're sensitive to latency, you can turn them on. They may be, but we turn them on. Yeah, no, I turn them on also. But they are off by default given that they're pretty. I think the issue is that some of them are histograms and histograms are very bad for stats D. Got it. And so if you don't use stats D, it's probably fine to turn them on. And if you do use stats D, then it's expensive. Fair enough. So that tends to be a really good indicator of if any of your threads are getting bogged down, you can basically always have those younger monitoring pages. And if, again, you've got these, you know, so config reloads or you're doing sense of reacts or tracing, you should be able to see that going up and keep an eye on it. And if it gets too high, I get to investigate, see what's causing the problem and then break it out into worker pools. The other thing I wanted to note about regexes is that even the fastest regexes are not as fast as trimatchers or prefix matchers. And so to the extent that you can express your rate limiting rules in terms of those, it's probably going to benefit you 100%. And especially when you're doing a lot of rewrites of parts or even query programs. I think the rejects that have come out this year is we, I think we did easily see a 5% improvement without anything like the same VM new online version. It's faster. And yeah, so yeah, I think those are my questions. Thanks. Yeah, there's also a path matcher like a special which was introduced recently you can investigate as a replacement for rejects. Violeta, I'm new to the space, but I'm a very nosy person. So I was wondering, do you have any advice for me to learn how the community thinks what the vision is basically where do the discussions happen? I think unfortunately a lot of discussions that I'm aware of happen on the online maintainer channel, which is not particularly useful to someone who's not an online maintainer. I think in general, the more you get involved with the project, the more you'll get a fear for what's going on. You'll get review advice and sometimes other people get pulled in. As you get more involved, you may end up an extension owner at which point you'll get people that are in code reviews and you'll kind of see different maintainers review styles or have people ping you and ask questions. So I think just getting involved, finding an area that you're interested in working on. And again, as mentioned, I'd love to sync up with you after the con and figure out some opportunities for that. But the more involved you get, the more people you get to know and then the more you ping them offline or get hooked into channels when we're brainstorming how to do file-based caching or something, it just kind of evolves organically. I don't think we used to early on in the project, we had very regular meetings, but we haven't had them lately because people haven't been putting anything on the agenda. So that used to be a way to at least kind of be in the water cooler conversations and those just haven't happened lately. I was thinking of trying to revive them, at least having like once a quarter just for new people to get names and faces. Melissa, how well known is that? I think the only reason I know about it is that you told me. But there must be some way to discover it. To discover what's the fact? The bi-weekly meeting that we don't have anything on the agenda for. I have it to do to update the docs on that, so I will get that done in the next month. Sounds good. I'll also say that on Slack there's also the Dev and Users channels, Envoy Dev and Envoy Users, which I look at, I don't have time to answer everything, but you can probably engage people there also and you don't need to be a maintainer to be on this. Yeah, sounds good. I've definitely seen some conversations there, but it seems like they're very short and they die very quickly. Sounds good. Would you recommend going to the community meeting without the meeting you were referencing? That was what I was just saying is that it's not regularly held anymore. But again, if you have, you know, for people who have like a larger design proposal, they have the option of putting it on the agenda and then we will hold the meeting if there's anything on the agenda. So I will do a better job of updating the docs to make that clear. But other than that, yeah, I think we mostly just hang out on Slack. And again, people are welcome to file issues to try to start discussions. It generally doesn't work. We generally throw a pro request out there and start in on something or just ping someone and talk to me because again, like we really do love having people doing contributions and we will, you know, and people, the more you help out with the project, the more people will like help them, you know, help answer your questions and prioritize your views and everything else. Sounds good. I have a second question. One of my teammates, sorry, I don't have details. One of my teammates would like to be a maintainer. And I believe he has been trying this year. Do you have any advice that I can relate to him about anything he can do to be more strategic? I think the thing that we recommend is that people email the maintainers list early and often we often have areas of the code that are not well owned and not well maintained. And if people are willing to pick those up, it also means that all of the maintainers will pay more attention to that person's code and reviews. So often we'll have people who try to get involved with the community by doing work that other companies don't particularly want. And then there's a whole bunch of back and forth and people aren't looking at their code very carefully. And then they say, oh, it would be a maintainer. And then no one is willing to go back and, you know, comb over their code reviews. So I do encourage people to reach out early and often, even if it's obvious they're not going to be a maintainer in the next week or two. But to say like, hey, I'm interested, how do I get more involved? And then we're happy to help out there. You guys have any other suggestions on that? Yeah, I mean, I think that I don't know if this works particularly well, but we do have this mechanism. It's called first pass reviewers so that when a PR comes in, there's a maintainer on call which will assign it to one of the maintainers to do a review on. But if there's first pass maintainers that are active, then we can assign it to randomly pick a first pass maintainer, a first pass reviewer. And that will get you known about how deeply you look at code and how well you understand the system. I don't know if I would start with that, but you could start by trying to write some PRs, getting some feedback and then get involved and learn the code that way. Thank you. Hello, I'm Sabrina from Spotify. I wanted to ask you what is the status of the rate limit quota system feature? Is this something that is close to release or? No. It's been close to release for many months. And I don't have the ETA for when it's going to be delivered. It is going through the like a staging qualification right now. So I don't exactly know the GA, like general availability on it or even alpha. But I know that right now it's in the integration stage. It's live enough to start doing developer testing. But unfortunately, I don't have much more information than that. My guess is that it should happen sometime early next year. But again, that's a guess. Thank you for your answer. Is this something that someone could potentially contribute moving forward to move forward? I think the biggest gap in the open source is really the server implementation. And that's been asked a couple of times. And I think the for rate limit quota server, the implementation is more complicated than for the for the existing global rate limiter. So that's, I think, where the biggest gap is and where probably most of the effort would need to go to to make it available to the open source community. Because right now we have the client for it. We have the extension and envoy. But there is nothing to hook it up. That's open source. Yeah, that's where I see where the next effort is going to go. How that's going to unfold, I unfortunately don't know. Okay, got it. Thank you. I have a question for us. That kind of reminded me of like generally, if somebody wants to contribute more to envoy already has a lot in the past and is looking for a new area seems like maybe we have a good view that others don't on what needs help. And we should have some place where we list. Shovel ready project. I mean, actually, we have that internally for the hour, like at cool for the platform team. But we should have one that is open source. Yeah, so externally we have we have an areas like the issue tag for beginners and help wanted areas that we are more interested in. I think we don't really have a great way of publicizing like again when we have a maintainer that steps down. We have our kind of internal process for, okay, we need to find people to own this extension, but we don't really have an external way to be like, we really need someone to own tracing or to own stats or to own whatever. So yeah, that is an area we could, we could look into. Yeah, we have that tagging system. It's good, but I bet it's not curated and it might be kind of noisy to just look through and maybe that would be a good thing for us to do while we're on call or something. We've been working on Sillium lately and like with Sillium network policies, any layer-through-layer-four policies get handled by a T-proxy implementation Sillium that is implemented in EBPF. Any of the layer-seven policies are punted to Envoy. I was just wondering, has the Envoy community or the maintainers considered at all using EBPF for any of Envoy's functionality? Obviously, EBPF, there's just, there's a lot of interest just because of how performant it is. I know Envoy is kind of an extensible system. I was just curious if there were any considerations to taking some of those capabilities of Envoy and kind of integrating them at the EBPF level. I could be misremembering, but I thought we used EBPF in our H3 implementation. Yeah, we do. So I think, I think in general, we are open to it. As with all things in Envoy, we end up picking up technology as is needed. So again, you know, Google's a heavy user of H3, we found that not using EBPF is a performance non-sterter, so we added that directly into Envoy. We haven't hit areas in our production network where EBPF is a huge performance win, but maybe as Akamai gets more involved, they're going to have an area. If there are areas where it would speed up the data claim for you and your environments, I think we'd welcome contributions there. Like anything that makes Envoy more performant, we are enthusiastic about, it's just a question of someone caring enough to contribute it upstream. Is it those for H3 support? H3 support, yeah. We use it for routing packets to the correct worker thread based on the connection ID. But there's, the other thing to consider is Envoy is cross-platform, and so there always has to be a fallback that isn't EBPF for everything important. For the HTTP3 support, now was that just a decision that the maintainers made just from your knowledge, or was there some kind of performance baseline that you did and saw that, wow, with EBPF approach? H3 was more performant than... Bit of both. Again, having done these deployments internally at Google, before we used EBPF, you had to basically do in-userspace packet passing, which sucks and is obviously incredibly non-performant. So we know what the performance gains were, and for our own use of Envoy, we were not willing to have the packet passing be the default out of the EBPF. So again, if there's areas where you and your own deployment want to play around and try to write rules and find it to be performant, we would be enthusiastic about improvements to Envoy's performance. With H3, because H3 is actually one of the very few areas where we don't have cross-company review, because there wasn't anyone else who wanted to do the code reviews, and Google had been the lead deployment. I don't think we published numbers, because it had been so many years since we did anything other than EBPF, we couldn't say, like, oh, this is an expert sentiment improvement. But we would encourage people when they add new code that hadn't been production battle tested to say, like, yes, this has this improvement. It's worth the review time and the lack of readability. Can I assume just being able to search up an issue or something like that could find that PR issue for adding that, just to see kind of the details of how that was done? Yeah. All right, thank you. That one may have been glummed in with some other larger PR. Might have been. You have two minutes left, so I'd say we could take one more question if anyone has one. Oh, you quick. Hey guys, Mark from HashCorp. So I do have a lot of, like, customers that go through Envoy logs, a lot of rich data in there, and sifting through the logs can, depending on what the component level is, can take some time. So I just wanted to know, like, from the maintainers, like, how do you usually sift through logs? Is it, like, from what perspective do you guys look at to, like, try to troubleshoot from that end? Or do you look in the code base? Do you go, like, from trace to bugs to info? I almost always try to use metrics instead of logs as long as I possibly can. Okay. But then, yeah, sometimes you have to look at logs. And one more quick question. You know, sometimes the logs are helpful. Sometimes you have, like, hey, like, you have the specific error, but then sometimes it can be blink. Like, if it's, like, a certificate or some error, say, something denied or something, and then it doesn't always have a specific reason. Is that an opportunity for, like, someone wanting to contribute to, like... Absolutely. There's a ton of work in trying to surface areas so that you can, in access logs, know the exact reason any single stream failed. We did it for every single L7 reason. We did it for a bunch of HTTP, like, various issues, but we haven't, like, gone through the factory. So absolutely, please, please, please, please. Better error reporting is the best. Cool. Thank you. Appreciate it. Cool. And I think that's it. Again, I know I can hang around outside if people just want to say hi or have other follow-up questions. Otherwise, see you, hopefully, next year's Unplugged on.