 Oh, okay. Cool. Looks like we don't have too much agenda. Should we just talk about CI now, Chris, since you're here? Yeah, I gotta leave soon. Got 10, 15 minutes. Could you just give us an update of what happened with Circle and then we can- Yeah, no, we just had a negotiation to bump up the amount of the container limit from 14 to 20, I believe. And we don't really have insight if that's enough for the Envoy project or not. So, you know, for me, I'm really looking for the Envoy maintainers to tell me like we would be happy with N number. Yeah. So that's kind of, it's kind of- Like I think in parallel to this discussion is like a bazel remote caching, which would give us like a huge win, I think, or even just getting Circle CI to cache build objects and it can actually, you know, cache and restore. And I feel it just needs someone to go in and investigate and we'll probably, you know, save like, I don't know, 80% of build time if we do that. How does that work, Harvey? Does it cache the objects? And it kind of works like, it just works like the local, I forget the tool name, but the one that like looks at the hash of the file and like sees it. What's that? Like cCache or- Yeah, does it work like that? Yeah, yeah, it's like content addressable storage. I mean, in turning within Bazel, that's kind of like how Bazel represents its own cache. Although you'll be forgiven for thinking otherwise, given how it deals with changes to build flags. But yes, in principle, that's exactly how it works. It's like the command line and the status of every of all its inputs and hashes and builds a hash out of that, basically. So what, like concretely, would we just rent like a cloud instance with like an SSD and then like we would point, we would run the Bazel cache on it and point it at it or something? You don't even need to do that. You can use GCS. It's like, you know, GCS said like, Bazel has like explicit first class support now for GCS. You give it the credentials for a bucket on GCS. Really? Yeah. And it will directly speak to it. Okay, that sounds amazing. So can we- Yeah, Liza said he's actually had some experience to do his personal build this way and it's pretty good. We have an internal meeting with, you know, well, not internal meeting, John will be there as well, but with the folks interested in Bazel comparing that with the actual remote build execution and caching, which is going into Alpha from Google. And we could potentially use that, but I think the performance, I think of using GCS would still be probably way better than what we have today. So we could consider at least trying that out. We just need someone to own that. I mean, right now I probably don't have the cycles for that. Right, right. But like we could even theoretically set that up so that individual devs could also use it, right? Like if you're internet connected or if you're not like in the cloud, would it be too slow? Well, I think yeah, you have those considerations. And also, I mean, we currently share a bucket with everyone else because we have to share those credentials and you know, that obviously is building implications and trust implications. So everyone's gonna have to really maintain their own GCS bucket for that. Got it. Yeah, it does seem like we need someone, whether one of the current maintainers or some volunteer to step up and kind of help own the general CI problem. Like it seems like we would even benefit from some basic graphs or metrics around like how many lines of code are we building over time? Like how long is CI taking? Because you know, I suspect that we've increased the number of lines of code by a ton. And there may have been a couple of regressions here and there around built performance, but I think the combination of them is essentially making the build probably two X or two and a half Xs slow is when we first switch to circle. I mean, one of the problems you'll find is that I think Greg did some work on this, I believe, and or Steve, Steven and I, I forget which one of those you did, but there's a lot of noise in those numbers. Yeah, but like I feel like if we had some basic graphs or data, there would be some trends. Like if you were to look at it monthly, like I'm pretty sure you would see that lines of code is X and like build time is Y and you know, you might be able to extrapolate. So it just feels like something that longer term in less we can get on a more powerful CI system. I mean, my other longer term concern or maybe it's now around circle is that circle, their instance size basically max out right now at what we have and those are effectively eight core, I think they have like 16 or eight gigs of RAM. And obviously just the way that we do the builds, like we could saturate a 32 core machine easily. And my concern is that like in the future, as the lines of code go up, we really have, I think, so there's caching for sure. But then I think we have two options. Like option one is we keep getting vphere boxes to do individual builds or we actually need to think about splitting the build somehow so that it's more parallel. And I don't know exactly how that would work, but like we'd have to split it into pieces. I mean, I think guys, what Google's doing right now with Basil is trying to externalize much of what makes, you know, I want to repo work very well and that is not having to decompose the build in this way. So one thing that's coming into alpha right now is remote build farm, which will essentially be something that we supply to probably to the onboard project. And we'll always be able to use remote build workers running off in a cloud environment outside of circle CI to do the work and then we pull back to Basil and hand back objects and things like that. I see, so we would still be able to use the entire circle workflow and all of that, but the build would call out to a remote build farm, which would actually do the build and do the caching and do all of those things. That's my understanding. I'm, you know, that's the meeting that these folks are still pending, but I feel that would be a better way to scale the build than you need to burn human cycles on this. No, I mean, that sounds amazing. My only concern is last time that we did this, I think we talked to them and there was all these security concerns where like we had to like opt in to the builds for people and like it sounded really horrible. So I just didn't know if that's gonna be fixed or not. I think, yeah. Well, maybe we can loop you in on that meeting, but I think it's a different model of the way we're going this time. So we'll see how that goes. Okay. Well, maybe then like, I feel like we could try 20 containers, but I'm wondering, could we convince Lizon to maybe look into setting up like the simple GCS builds caching? Like that sounds great. I think if he had, if he had those circles CI credentials necessary to go in a pocketed, he could probably do that. I can work with him directly to help him set up whatever he needs. Like that part is super easy. So he's not on the call, but we can talk to him. So if he's willing to do that and he already kind of knows what to do, that sounds really great. Because I also tend to agree that I feel like in most of our builds, the actual object files don't change. So like, I'm guessing caching would be very effective. Well, specifically because like, every time someone submits a PR or a small change, that they're rebuilding and there's just a huge amount of redundancy across those conversions, it's really just between identical builds. I have one other actually question that's somewhat related and you might not be able to say this or not on a public call. But do you know within Google, do you use like whole program linking or do you build and release like we do currently with Envoy? I'm not familiar enough actually, maybe Alyssa can talk about it. Like you're talking about like... Like I'm talking about like FLTO where you basically, you do. Okay, yeah. Oh, mine's both are good, I can say. Yeah, both of mine's both are good. No, no, no, right, I was just actually, so the reason that I bring that up is I'm curious actually if that would change build performance at all because if we're generating intermediate code only, like for the individual compiles and then only optimizing or linking basically at the final link time. Like I guess I'm curious if for example, like even in the release build, like if we were compiling with FLTO for the tests, you could basically potentially compile them without optimizations, but for the binary you could link the binary with like full optimization or something like that. And I'm just wondering if that would take less time. I'm not sure, but I think that the kinds of optimizations are different. Like when you're compiling individual compilation units, there's a lot of that's like intra-procedural optimization and then at link time, it's the inter-procedural stuff that that's going on, right? Yeah, I don't know, but that's something that indirectly occurred to me that we might like, it just seems like we should turn on FLTO and we don't build with that currently, but that's a separate thing. So, okay. All right, well, so I guess the next step here then are to see how we do with the 20 in terms of queuing, and then maybe follow up with Lizon to see if he would be interested in helping us out with this GCS setup. And then there's the call with the Bazel team to talk about, yeah. Okay, sounds good. Does anyone else have anything to say about that? No. Great. Let's see. Before you leave, Chris, did you wanna talk about EnvoyCon at all? No, I shamelessly sent an email to EnvoyDev. Yeah, I was actually, I was thinking about that that we had, and I think you should send that email to EnvoyUsers and to Envoy right now also. Okay, we'll do. Envoy Announce, I think is still broken. I think only a couple people can send email to it, so I'll fix that. No worries, I'm sure more people are on Dev and users anyway. That's probably true. Yeah, so maybe just forward it to users. Okay, cool. All right, looks like Ben wanted to discuss, oh, the light, hot restart. Are you there, Ben? Yeah, I'm here. Yeah, hi everybody. I'm Ben Plotnik from Yelp. In development with Envoy, we're a smart stack HAProxy shop looking to kinda swap out the Synapse HAProxy portion with Envoy. Yeah, so I opened this issue talking about the problem of hot restarting across hot restart versions. We've written a few blogs about how we do this in Synapse HAProxy. But basically the idea is that you do, originally we did SRU's port, there's some race conditions potentially, and so we have some kind of hacks around it, but basically it seemed like other people are not running into this problem, which was surprising, I guess, but yeah. I can tell you what we're doing at Lyft, which is not optimal, but it works, and then we can discuss other options. At Lyft what we do, and I'm happy to share this script, though it's pretty obvious, is effectively when we go to deploy Envoy, it compares the hot restart version of the current running binary with the new binary and a couple other things, and it basically determines whether it can hot restart or not. If it can hot restart, it does that. If it can't hot restart, we have a rolling deploy system, so it basically goes into a mode where it'll lock itself using at CD, and then it'll basically do a rolling drain. It'll drain for X period of time, restart, and then that roll will go through. For what it's worth, given that, I would say, on average lately, we probably have an incompatible hot restart version every six to nine months. It just hasn't mattered that much, and that system works okay, you know? Yeah, frankly, we have a similar system for rolling deploys, and we could definitely do this. Operationally, it makes it a little bit complex because one is, so we can do this like hot restart version compare, but we use system D to run Envoy, and it's a little bit hard to kind of tell it, okay, now actually call back and do a hot, like the coordination of all these moving parts is a little bit difficult. What I was thinking was actually to do effectively the same thing, and like compare the hot restart versions, but then have like a lighter hot restart, a less hot restart version or a mode where it drops the stats, but still passes the. Yeah, well there's, yeah, sorry, so there's two parts, just coming back to system D. The one thing that I would say is we have similar problems, and the way that we get around that is that because we effectively run Envoy through the kind of hot restart wrapper, in the hot restart case, like system D or for us, we use run it. It's only aware of that wrapper, so it doesn't really know what's going on under the hood, and then only in the case where it's basically doing that full restart, you would do the drain, and then we effectively just say run it restart or you would do system D restart and just restarts the entire stack. I would be, it's probably out of scope, but I would be curious to know what the problems are with system D there because it seems like it would work, and the only reason that I bring that up is that what we're about to talk about with the light hot restart, it makes sense, but it's still, it's gonna be non-trivial, like there's bugs, and it's like for something that happens every six to nine months, it's not clear to me that it's a good use of time, but I'm not really opposed to it, right? So that's kind of why I'm asking those questions. Yeah, totally. It is definitely like a lower priority concern for us, but it's kind of like that time bomb that's sitting out there, and if we don't have the operational scripts or whatever in place, then somebody who's not me is inevitably gonna pull the trigger at some point. Yeah. Yeah, so the problem, I guess it's like coordinating the, I don't know, I mean, I guess the monitoring and like coordinating different processes because that would mean that like, during the restart, it would mean that the Envoy supervisor process or wrapper script or whatever would have to call out to another, to like another system to restart, and that doesn't, I don't know, that seems... Not, so not necessarily, because what we do with this script is we do everything through the local Envoy admin port. So we'll basically tell Envoy through the admin port to start draining, then we'll basically sleep, and then we'll tell Envoy to quit, and then the supervisor process realizes that it quit, and it just restarts it. Oh, do you drain, actually, do you drain the processes on the, like the tasks on that host? No, we just, sorry, we call health check fail on the local Envoy to have Envoy fail incoming connections so that all of the callers stop sending requests. We have that time box based on our timeouts and various other things. So for internal mesh traffic, we do that for, I don't know what the current time is, 30, 60 seconds or something like that. And then when that time elapses, we basically just kill the Envoy, and then, so we don't even involve run it or system V. So when Envoy exits, it's just like, oh, it's gone, I'll just restart it, and then it's the new version, and then it just goes, yeah. Okay, yeah, I see the difference. So for us, we can't do that necessarily because we have batch processes that run with outgoing connections that are not triggered by incoming connections, so. We have the same problem also. And I think our answer there is basically tough luck, have a same retry policy for the six to nine months that this happens, but yes. Yeah, we do have a system to drain a lot. Like we run on mesos, and when we wanna upgrade mesos, it's like we need to drain the tasks on that host. So we do have a system to do this. It takes a really long time to operate the fleet, and it's kind of prone to failure. So even though this is like a rare thing, it's definitely something that could like take out the entire fleet if it doesn't work well. And even more importantly, like it's more of like a sales and marketing thing. We have the system working right now, so we have to be able to sell to the rest of the operational community like at Yelp that this works. Yeah. So that's probably the more important factor. Yeah, okay, I mean like from the light restart perspective, I think that a couple of people had comment on that, including Harvey and Greg, but I guess we can open it up for other people to chat, but it makes sense to me to have like a simpler version where we don't do any of the shared memory stuff. We can have like a simpler protocol. If we're gonna do this one more time, I kind of agree with Harvey, we should switch the protocol over to a proto-based API. And then you could just have a variant which is like, we'll initiate socket passing and that's it, you know. You know, I don't think it would actually be that hard to implement based on exactly the code we have now, just don't do the stats part. Yep. And just you don't even need to change the protocol really. Just like- That's probably true too, yeah. You could probably just have like a command line flag which is like hot restart basic or something. We're like, well, it only does socket passing and that's it. I still agree with Harvey though, that like if we're gonna say that, you know, this is the protocol that is stable, like we should probably switch it to proto. Like I think it would be trivial and that way we could extend it later and we wouldn't break anything, that would be nice. But, you know, I guess it depends, like from Ben's perspective, like we had talked in that ticket of, you know, do something quote forever and that's the part that concerns me. Like I don't know that I would want to guarantee that anything is forever, but- All right, but with, you know, just a, it's essentially a network protocol even though it's going over a Unix domain socket. That's true. It's easy enough to make those, you know, you can support the old and the new version. That's supporting an old and the new version of shared memory would be terrible. And then you'd have to like convert it in place but network protocol, that's easy. I agree. Relatively. Yep. Yeah. So, you know, no objection for me there. And like I tend to agree that like, if you look at the way Hot Restart works currently, it just goes through this dance of doing shared memory stuff. I feel like you could just tell it to not do parts of that and it would probably just work. Cool. So should I do like a Google doc or something to do the design? Okay. Yeah. Like, and it doesn't have to be long, like, you know, one, two pages, just kind of like what the goals are, like what you would change. And then you can just send it, you know, to that ticket and then we can all comment on it. Cool. Yeah. Yeah. And then in order to get it approved, you have to add your logo to the website. I'm just kidding. We're still in development. We're hoping to have our first production service. We have some dev services going through Envoy right now, but we're hoping to have our first production service this quarter. Awesome. Great. So then we can slap our logo on. All right. Cool. Well, that's exciting. Greg, Harvey, Alyssa, anyone else have anything? Nope. All right. Have a good week, everyone. Bye.