 Good afternoon, everyone. Good morning for our US listeners. I know it's thought it was supposed to be the last one, before the breaks may be hard. Please bear with me. I'm Alexander Chernikov and today I'll talk about the routing changes that landed in 3Bs D13. Let's start with the agenda. So I'll cover the motivation behind these changes. I'll talk about the important implementation details, namely next hopes, next hopes, and their resulting kernel programming interface. Then I'll provide an overview for the newly looked up algorithm framework. Specifically, I'll describe the all algorithms and their performance. And finally, I'll provide details on the overall performance changes, performance changes, and the next steps. So let's start with the motivation. Why all of the changes are needed? There are a number of driving factors. First, we finally wanted to have a working multipath, especially with the wide multipath groups. Second was the need to improve the control plane performance under the traffic load. Finally, it was a dire need to simplify hacking of the routing subsystem. So the problems with the routing subsystem was, primarily, it was the lack of isolation. There was a tight coupling on the routing or frouting with nearly all other networking parts, like struct RTN tree, which is tightly coupled with Radix tree used and was used in 100 different places. So doing any change there was extremely hard, as it would require checking and altering the code for all of the customers. So how do I address all this? First, the primary thing is to design the current programming interface with the isolation goal in mind. Second, the actual design change is the ratio at most vast majority of lookup customers, they don't require actual prefix data. They just care about the interface, gateway, route flags, MTU, and that's it, which is so-called next hop-based information. And the core idea was to construct GPI around the next hops. So let's look at what next hops actually are. That's objects, collections of collection of information necessary to push packets to the wire. It consists of interface, gateway, MTU, and flags. Next hops are immutable and they use a bug-based reclamation. And internally, we use autoresizable hash table to store all this. So before the change, RT entry had most of this information. And now it's just barely a prefix and pointer to next hop, nothing more. We use ref counting for tracking number of used, number of times next hop is used, both in control plane and road caching, which is now next hop caching. Next hop structure is explicitly split into two parts, the public and the private. The public part is, well, public and returned by public current programming interface. It mostly contains the data path. Flags, MTU, gateway, interface pointers, and packet counters is there. Note, there is no more packet counter. They're out and try packet counter anymore. It has moved to next hop. Private part is used internally for housekeeping. It has original routing flags, reference counter, hash membership, and backlinks to all of the relevant objects. Next hops are better for data plane. They, their public part has smaller cache-friendly, where has smaller and cache-friendly footprint. They're better for control plane as they have all entities pre-calculated to deal with and control plane can just operate with pointers for these next hops. Bearing routers and routers with full view will have just hundreds of next hops, not millions. As there are relatively small amount of next hops, it is easy to iterate and store additional information. And yes, there are much less memory implications. And below you can see the example. Netstart has been extended to include all of the, to dump all of the next hops for the particular, for the particular routing table. The interface address here is the source address that will be used. The gateway is IDET interface for the directly reachable routes. Gateway flags are pretty much the same as, as it was before. And type and prepend that, that's something that will be used in, in future changes and are not used right now. Let's talk about the next hop groups. The next hop groups are used to store multi-path route data. So we can point to a single object that has all of the paths. This group are indeed just, just groups of next hops, and they're accompanying weights. Weight here is used to define the proportion of traffic, traffic groups played towards each next hops. So groups are internal to the routing subsystem and are not exposed externally. They are fully immutable and also used epoch-based reclamation. Similar to the next hops, they are stored in automatically resized hash table. And each next hop group has its own index, so it can be referenced either by pointer or by index. The concept of the next hop group is really simple, but we need to have an efficient data plane implementation for it. There are multiple approaches that are typically used for BSD currently. Just use the linear array of next hop pointers with dynamic size. For example, in the scenario you can see on the right, there are two next hops with weight 200 and 300. And so we need to balance the traffic two to three. And we do this by creating the group of size five, which fits perfectly here. There are many corner cases where it's straight way less straightforward that like weight one and weight one 100. In this case, the logic will simply use array of maximum size, which is currently 64, and back the next hops according to their weights. How to use it from user land? For manual added route, you can just use existing route binary functionality just using the weight keyword. Again, the next hop groups are not just to customers directly. Programming, interface, KPI selects the next hop from the next hop group based on the providing flow ID. The actual flow ID symmetric is agnostic. It can be hardware originated value read from the Erics ring. It can be IPv6 flow ID. It can be software, RSS calculated value, or literally anything else. Data playing part of next hop group has the same offset of the flag field as the next hop. Each next hop group has a single flag set, namely MHF underscore multipath. When the lookup result is returned, it is internally changed against presence of this flag and if yes, it's typecasted to the next hop group. And the actual next hop is selected based on the module division of the size of next hop group. So, for the data playing programming interface, instead of family agnostic functions, as it was before, we use per family function functions. They don't require circuit addresses, not placing a requirement to construct such so called in the caller. As stated previously, they transparently handle multipath always returning a specific next hop. So, it's just a function of address to next hop pointer. So, to recap, instead of always return routing entry, which was a combination of prefix, next hop data, and some housekeeping private information. Now we just return public version of next hop specific data and do not expose any internal implementation details. No next hop groups, no private next hop data. For the control plane interface, just use a single function, as before, it uses the same structure together info to pass all the data in. The primary difference here is that entering epoch is required upon calling function. And next hop and next group creation updates, everything happens within the routing subsystem. No external caller has to be, is not required to construct next hops and next hop groups. It is currently done transparently. So, previously it was stated that the vast majority of the callers don't require knowing the prefix of a matched entry. That's indeed the case. However, we still need to address the remaining callers. For example, route get and net flow are the ones that explicitly require knowing the prefix. These special heap lookup RT functions, returning RT entry and the next hop and weight data at the time of the lookup was created. As there is no direct entry access anymore, there are special accessor functions that used to get the attributes out. There are not a lot of attributes remaining there, other than just prefix in the address family. If needed, if that simplifies the consumer code. What about the locking model? Again, next hops and next hop groups are mostly immutable. Changes in the route path attribute or multi-path group forces creation of a new next hop or next hop group. With that approach, no per object specific logs are necessary, and both next hops and next hop groups share the same read write look, which is separately instantiated per routing table. Data plane is now used as next hop pref counts for route caching, again now next hop caching. And your RT entry doesn't hold any reference counts or logs anymore. For the user land part, the changes are mostly transparent. So all binaries such as round binary and the round demons should work on certain stable without any changes or recompilation. Multi-path works with Guaga just out of the box, and it's needed to, additional work is required to support multi-path and bird. Due to the fact how bird read the routing table, it may be implemented later as either the netlink support or the specific extension for the routing circuit that would allow to provide additional details that will help bird have the same simple implementation of the off reading routing table. Next hop stick couples of routing specific from the longest prefix match look up algorithms. Now the algorithm just need to be a pure function that is IP address on the input and provide pointer or index on the output. This is actually a requirement for the high performance look up algorithms. And as it's not always possible to store something more than just 32 bits worth of data is their result. Let's take a look at the look up algorithm framework. This is a new construct in the kernel and it serves multiple goals. First is to optimize performance in the in the particular use case. For example, IPv6 look ups are different from IPv4 ones. And the former is longer and the distribution is more sparse. And currently we still have a single algorithm which was used for both. Similarly for new look ups are pretty much different from look ups on the device with just 10 rows. Each of the each of such scenarios require a special algorithm or at least algorithm which is both well suited in such scenarios. What that's why multiple algorithm is needed and that's why the look up framework is created. Then second, lookless look ups. It further reduces data plane contention and provide better control plane performance during convergence as a routing demon doesn't compare with the data plane for the routing table look. I guess people who run routers on 3dsd with full view has always seen a really slow congestion when one of the VGPP goes down. It is even hard to read the routing table from the kernel and the programming rate is being really slow due to the fact that the look is always contested by that data plane. Then it's the foundation for the other address families. For example in PLS you don't need to have longest prefix much at all. You can just use an index table or hash table. And finally it overall reduces the bar for implementing and testing new algorithms as you can do it in a fairly straightforward way around 200 lines of inside the kernel model should be enough to have an algorithm that that would fit. Let's take a look at the features of the look up framework. The algorithms can be loaded on the fly regardless of what routing tables state what is the current routing table state can be loaded and unloaded selected manually but by default it uses automatic algorithm selection based on the amount of routes and the next hops that currently present in the table. So this automatic algorithm selection is run periodically so yeah is run periodically again there are no data plane logs for most of the algorithm it is lockless and the control plane is fully decoupled from the data plane we have a separate state inside the rib that's the system radix and we have a separate state there the active look up algorithm which allows to have more efficient look ups. Some algorithms are pretty costly to apply update procedures in some algorithms are costly so there is a desire to to have some batching and the look up framework provides this. Internally it has a number of subsystems within the framework the reliable subscription for the routing changes has been implemented. Each framework handles every failure by just spinning the new algorithm instance resulting in the additional simplicity and avoiding to handle too much corner cases simplifying the overall algorithm implementation and it is able to keep like multiple instances of the same different algorithms for a table in sync allowing to gradually move a switch or between the old and new algorithms and again it provides delayed and batched updates. Let's talk about the automatic algorithm selection. Each algorithm has to implement a specific preference callback which is based on the number of routes next hope return that preference failure and framework goes through all of the active algorithms compares the values to select the best one and it does periodic evaluation of such periodic evaluation of all algorithms every 30 seconds or 100 routes if there are changes with the routing table and so with flipping between old and new new has to be at least five percent better to switch. As stated before some of the algorithms are really really costier with the regards to incremental updates so batching has had to be implemented and there are conflicting requirements we certainly want to batch more traumatized the cost of update but on the other hand we need to minimize the update delay and we need to find the switch spot. So how do we do this? First of all for the directly connected routes or for the static routes we don't do batching at all we force immediate immediate change in the algorithm for all of the rest we bucket update into 50 millisecond chunks and probably commit the update if it's before the threshold and the current threshold is 500 routes if it's larger than that this means that we're in some convergence position so we delay and look wait for the next bucket status if in the following 50 milliseconds we're still ahead of these 500 routes we delay more and the maximum delay is 100 millisecond 1000 millisecond and each value is configurable. For the control plane performance again the biggest thing here is that the data path does not contest the routing table lock anymore and this greatly improves the convergence time so in the in the test cases I've been running with one of the bgp full use going down under and the heavier data plain load the time required to complete the update to program all of the updated routes went down from five minutes to basically 40 seconds let's talk about the actual lookup algorithms there are seven algorithms available in currently in the free vSD based first two are the the dpdk ports of dpdk rte lpm library for ipv4 and ipv6 another one is a dxr which is a high performance algorithm contributed by marcozac that's ipv4 only tailed the suited well for both large and small amount of routes then the binary search ones it's currently a big default that's useful these systems with the small amount of routes and finally there is a lopless version of the system radix which can just work as fallback if nothing else works the relative performance can be seen on the table on the right this is again this is a single thread performance and due to the fact that all of it has low class on the same nomenode it should scale linearly let's take a look at each algorithm in more details or for the dpdk again it's a wrapper for dpdk rte lpm6 library for ipv4 it's the two-stage lookup scheme which is variation of year 2024-8 for ipv6 it's actually a list of game tables the worth noting is that dpdk is able to provide immediate updates as in most of the cases the update is just updating a single pointer inside its data structures and the second thing to note is that for ipv6 dpdk only handles global global unicast so for the link local traffic algorithm falls back to two system radix then dpdk originally it was scaled for large scale fibs but it works really well for small scale fib 2 as you can see in the previous picture it is actually passed of currently available for small and large fib scenarios the implementation has been contributed by marcozac and it is not using the it is not able to provide immediate updates so it uses update patching the binary search they're really pretty simple that that's just the array of sorted of sorted ip addresses it's simple and really cash effective for for small route scale array is immutable and is rebuilt on every route change the thing to note here is that actually it is scheduled for rebuilt on every route change but it's scheduled to be rebuilt generally framework works the following way if something if algorithm requests the rebuild it is not rebuilt immediately but it's scheduled to be rebuilt within 50 millisecond so if there is a spike of updates going in the updates will still be amortized within this 50 milliseconds chunk so even if the algorithm is immutable and requires rebuilt on every change it would still doesn't work that bad even for for the full view then the radix locus and effective leads the same variation of system default try uh just built on continuous memory chunk and some optimization on the socket of the lookup keys it is also rebuilt on every route change but as stated in the previous slide framework has some sort of amortization so in case of spike it doesn't perform that badly uh control plane wise it's good for anything maybe up to 1000 routes then let's take a look at the actual performance uh despite the fact that all of the that the micro tests does show the great changes with regards to just routing lookup overall forwarding uses uh in overall cpu cycle spent in forwarding that's about like 10 to 15 percent spent in there maybe up to 30 percent spent into the lookup uh the rest is the rest of the u-cycles are spent in the network drivers so so the actual performance difference are not uh not not as impressive as it was in the previous graph so it's about when we look at the full view lookups it's about 20 uh 21 percent performance increase for apu4 and about 30 percent increase for apu6 that's uh that's the check done by that that's the performance test done by oliver on the original version for the xr and dpk lpm6 for apu6 use case so what are the next steps first of all to support bird bird oh we need to either like to support multipath and bird we need to add direct direct access to uh add functionality to create next hops and next hop groups via earthy soak so the actual routing daemon such as bird aquaga can refer to the uh to the next hop by indexes and not not require complex parsing of the multipath routes and internally they both pros and quagga internally they also uses next hops already so it would be even uh it would be a native interface between the userland and the kernel where routing daemons and kernel don't exchange the full version of next hops providing for each route update providing the full detail but provide the lightweight updates for example if we change the multipath route it would be it would be sufficient just to provide the new multipath index uh new index of the new next hop group to the routing daemon without without the need to list the the entire next hop group or something next hop routes with the next hop groups in the routing circuit um similarly netlink uh netlink brings even more functionality by having native support for these next hops and next hop groups as a couple of years ago semalfunctionality was added to links kernel so the such support has been added to to quagga and that's that's that's the primary changes I wanted to talk about okay let let me try to with the five minutes remaining let let me try to read the questions and providing probably some answers uh for which routing daemons have been tested with the new routing api uh it's both mainly a quagga frr and bird though the api by itself user userland chain ertysoc interface hasn't changed so the the old version still works fdn3 next hop field uh what is used for it's it's just a pointer to to the actual next hop it's a pointer uh for will we lose route with custom mtu in pre bsd 13 uh no it it remains as is uh despite the fact that we have moved the mtu in the next hop underlying it will still work internally a routing subsystem will just create a new next hop within your mtu for in pls i i don't have any estimate but hopefully yes uh can next hop contain interface specific parameters there are no explicit framework to support this right now but ideally ideally yes that that was the design idea that the tunneling interfaces should be able to create their custom next hops with additional data that that can be uh that can be used for quagga with 13 uh 0.0 release well sadly there were a number of issues with ertysoc related changes and 13 uh 0.0 and indeed that that largely uh breaks uh breaks the interface quagga should work with 13 stable but given given the way community like it looks like community has really moved to frr so my suggestion would also be to just to switch to frr i haven't seen a lot of updates with quagga nowadays along with taking three busy 13 to load a full table uh i think it was less than 30 seconds for apu for the the last time i i checked uh but i don't measure this on regular basis uh are any cost routes allowed now any uh any costs are allowed you but you can have multiple slash 32s pointing to a different interfaces but for the local slash remote that's something different that kicks in it's like a administrative distance which is similar to what we see in the like neighboring uh in the like real routers like local route always wins so it's not possible to to have like multiple routes uh so currently if you attach the uh the local apu for route it will just always win and kick the and kick the remote one for forwarding latencies uh no i haven't for porting netlinks yes there was there was a gsoc which delivered the the basic netlink functionality and this is potentially something that will be merged okay so i'm out of time thank you all for listening thank you all for the questions thank you