 Hey everyone my name is Alec and I'm really excited to talk to you today about incrementally building the incremental implementation in Envoy's Go Control Plane upstream repo. So I'm a core engineer at Grey Matter and I've worked there since Inception. At Grey Matter we have large customers in fact we operate in production in a large global enterprise and in order to meet scale requirements we found that incremental is a necessary feature set required by Go Control Plane so we set about to add it. This is my first major open source contribution I've been contributing to Go Control Plane here and there small PRs that I wanted to fix minor issues and I saw that would mature the repo a little more but incremental as it's a whole protocol implementation is a large feature set and this is my biggest contribution so far in my career to an upstream repo an open source upstream repo this large. So this right here is a high level timeline that lays out the Go Control Plane implementation path while we were adding incremental. So in March 2018 the initial snapshot cache of Go Control Plane was released this was the first tagged revision of Go Control Plane and it contained not only the simple snapshot cache which I'm sure many of you are familiar with. In October of 2019 the incremental protocol was released by the community this was an upstream change to Envoy itself the protocol was defined as a spec but it wasn't implemented anywhere I believe Envoy only had CDS currently functioning when that was current that was released and in December of 2019 I swooped in to actually begin the implementation and write a proposal on implementing incremental inside of Go Control Plane as I had seen some sort of traction in the Java control plane but there was nothing nothing there in Go Control Plane and then in July 2020 the MUX and linear cache came out and those were targeted for things like better opaque resource handling and other conveniences to help state-of-the-world protocols which is a step in the right direction but we still we still believed that the incremental protocol was the right way forward for performance at scale and as of this month current time the PR for incremental is open working and ready for review so I have linked the initial incremental XDS implementation plan this was our upfront planning document in case anyone wants to read it I'd like to thank the team at Lyft and Go Control Plane for the feedback they provided and the help they gave me to work through the design and as well as think about edge cases and failure scenarios and things like that the main features here were we really set out to achieve performance at scale so we wanted to minimize data over the wire we needed the management server to be a little smarter so to do some things like state management and of course we wanted to maintain backwards compatibility and the reason for that was to not break users code that have inherited Go Control Plane as an upstream resource so the implementation itself consisted of a few things I had to get my hands in the server and the cache for Go Control Plane the two main pillars of the code there were completely different delta discovery request response objects so previously in state-of-the-world we used discovery requests and response and with these new objects that means I couldn't reuse a lot of the existing code as it was specifically targeted for state-of-the-world which is a valid assumption because that was the only thing that was defined as a spec at the time and now with incremental a little more logic has been offloaded into the management server so the server now needs to create a diff and track state so it can intelligently broadcast out changes to resources and clients as it detects changes within its snapshots so again the cache is just a list of snapshots per clients and when things are updated it's the job of the server to understand who has subscribed to these resources when they should receive changes and also when clients unsubscribe so that whole subscription functionality has also been enabled I had to come up with a clever versioning system that actually targets the individual resources themselves as before we were Go Control Plane was doing something that just used the global request response version that was in those discovery request response objects and delta doesn't really have that anymore it just has a simple debugging system version info but that's not really a valid way of detecting change at a granular level so because we needed that granular level of detection we had to develop an algorithm that would efficiently diff those hash those versions or hash those resources and create accurate versions to compare to an previous state so the implementation itself was fairly straightforward the only difficult part was the actual diffing and and creating a fast way to do that because again we're targeting performance at scale so we don't want to hold back the server with a slow diffing algorithm and we need that to be quick so with the map implementation we chose it enabled us to keep a pretty minimal invasiveness to the existing external API all we have to do to inherit this change is just implement these callbacks and you're pretty much good to go there they can be implemented in the similar manner that you've done with the state of the world and and with this new implementation you don't actually have to change the way you set snapshots or anything or create watches there is a new create delta watch function defined in the snapshot interface the snapshot cache interface but that isn't needed unless you're actually implementing your own version of the server so if you're using go control plane supplementation that we provide that's all taken care of for you so again these callbacks are simply just defined because we couldn't reuse the pre-existing state of the world discovery request response objects we had to come up with something similar and compartmentalized because you could have scenarios when certain clients are in state of the world mode but others are in delta mode so again they're they're sharing the same resource pool but receiving items differently so with these callbacks you can have your state of the world callbacks as well as your delta callbacks and treat the functionality different so I wanted to talk about some challenges when implementing this code and working in the repo so I did spend quite a lot of time familiarizing myself with the code base I had to reverse engineer a lot of the relationships between the cache and the server because as I said before I was just doing minor contributions I didn't really fully understand what the code was doing and in doing so I actually went back and contributed a lot of documentation and some resources for newcomers to read and hopefully better understand the code itself that way they don't have to share the same pain that I did when implementing this large feature set so again I'm not going to touch on this but the versioning at the resource level that was another challenge because we had to we had to develop a whole new algorithm just to do that and we couldn't again couldn't use a lot of the pre-existing code because of the fact that they were the differing discovery objects and the last thing I want to talk about was the upstream changes while building incremental this is a fast-growing repo it's maturing quickly and I'm really happy for that but because I was so far in isolation on my on my own the the code did change quite a lot and there was a lot of PRs for preparing for incremental things like that and as I was developing the upstream idea of incremental was also changing so I had to quickly adapt my code but it all worked out in the end and I'm glad how it turned out so here is the PR everything's passing it's working good to go it is ready for review and I just want to thank all those who have actually already reviewed it and provided some feedback I know it's large but I really do appreciate your efforts it's really welcomed and thank you again so go check the PR out if you're interested I would love to have y'all's feedback and feel free to comment or reach out to me specifically if you have any questions on the code so here is the integration test running you'll notice that it has a lot of the log statements with the hashed versions if you actually want to check this out more I provide instructions to run it feel free to go look at it and let me know what you guys think so what's next I'm currently working on implementing ADS for incremental all of the XDS services are complete but ADS does need to be completed I know there's some more features that I need to build for that to actually be done and I'm pretty sure that's probably going to be the most used implementation of incremental the MUX and linear cache implementations need to be done I need to go back and redo and do those because again as I was building this those came out so I didn't have time to also implement those and not just simple I need to think about failure scenarios I actually want to test this and see how it does in production well not just production but I want to see it in a real deployment I haven't done that yet and I want to also performance benchmark this so I want to see how it compares the state of the world what kind of performance gains are we looking at and yeah I really want to put the protocol through the ringer in this repo but again thank you all for tuning into my talk go check out the PR I have a list of resources for the talk in my github feel free to check those out that should include the slides and all the screenshots and things like that thank you again I appreciate all of you who've helped out oh and I'd also like to mention that I am in the on voice slack feel free to message me personally or reach out to me in the XDS or control plane dev channel I'm usually pretty responsive there so if you have any questions on the PR or the code itself for free to hit me up online thank you guys for tuning in