 Yeah, thank you for this wonderful opportunity to speak about some of the extensions that we have been working on at Ericsson and we are really thankful to the community and to the contributions and that's why we intended to kind of upstream this particular extension that we had developed based on some existing implementations that were already there. So this one is a mixture of two different concepts. So one is the implementation of the tapping filter, the GRPC tapping sync, and the other one is hardware accelerator mem copy, which is done by my colleague Yuzhu from Intel. So just as a background, what we do, so we are a small group of NY developers within Ericsson who modify extend NY HTTP tool filters to fit within the use cases and requirements for a 5G core network function. So in this particular endeavor, I was as a step by my dear colleague, Sven Steinacker, who couldn't be here due to some reason and also Yuzhu from Intel who couldn't be here because of some issues. So the agenda for today is we would give some background on what was the main use case this was trying to solve and why tapping particularly and then the overview of the existing tapping syncs and why we decided to go for a new GRPC based tapping sync and then some additional customizations that we added on top as well to get some statistics and address fame caching and replay mechanism introduced within that tapping filter. And then lastly, it would be regarding the hardware acceleration for mem copies. That was done by my colleague Yuzhu. So starting with just to give some background on why we launched into this whole endeavor. So this is how the 5G core network roughly looks like and it is basically you can consider it as a service mesh with all the interfaces are connected via HTTP tool. And so the network functions on which me and my colleagues work on are the ones marked in green, which are called a service communication proxy and a security and edge protection proxy. Both of them use and why within for specific routing observability or other traffic network traffic requirements that we have to fulfill within a 5G core network. And so let's start with the question itself that we got is why to trace traffic like there are already some existing good mechanisms to observe what is happening within a network. For example, metrics, but metrics can only show a limited information about the certain failure situation as labels cannot just go on forever and they cannot also include complete header body or query information. And then this particular diagram shows a particular pain point that we have in our network, which is that we have a multi vendor ecosystem. And so each vendor comes up with is our her own implementation of the network function. And you can assume that the orange one is made up of one particular vendor, the blue one's from one, different one, the green from a third vendor and the gray one from a fourth. And so let's say there was some sort of fault in the orange colored vendor and it tried to send a wrong subscriber request to the target upstream. So let's say you wanted to isolate somehow that this orange node was the cause of the problem and you started looking at the metrics and you cannot figure out because metrics only show that you get a 5XX from PCF, but it doesn't give you any more information than that. And another thing that you can do is raise the log levels, but then another beautiful thing about Fajji Code Networks and in live operation is that you're pretty much never allowed to raise the log levels or something by the operator. And so there are very little means by which we can get any meaningful information by logs. And then another two methods that we could have is TCP dump or some EBPF based tracing. TCP dump on the interface could work, but that would involve elevating privileges, a big no-no. And also EBPF has the same problem of elevated privileges and other security issues. So all of that makes the envoy that sits in this blue SCP a prime candidate for observing the traffic. And basically going with the decision flow that we had to make on how the architecture should look like. So envoy has both ingress and egress custom filters and we have our own specific set of HTTP filters that we use for supporting all of our use cases. And one of the easiest ways to provide this HTTP to information is by simply logging. However, if there's too many logs, then the, so this blue processing unit that you see, which is consuming the information for the traffic, it will need to process multiple log lines and that would add additional stress to that particular process. So eventually we then decided to start using tapping filter that were there. And we decided to make some experiments with it and try a few things out. And eventually we learned a few things. So there are two basic methods by which you can get a tap with an envoy. One is via TCP tapping and the other one is via HTTP tapping. TCP tapping creates traffic trace representations on a TCP event that is on a connection open or close or on a connection read or write. HTTP tapping on the other hand creates traffic trace representation on an HTTP event on which is header data trailer encoding or decoding sequence. So these basically operate as a callback more or less and the main function of tapping is to just send these internal representation whenever these callbacks are invoked. So basically the way I like to think about filter chains with an envoy is to look at them like pancakes. So basically a different type of a filter functionality represents a different pancake and I like to layer them in a particular sequence. And say on the ingress side I like to layer them as TLS and then a buffer and then the tapping. And then I have all my custom HTTP filter chains. That's it in the middle. And then on the ingress side we have again our tapping transport socket and the buffer transport socket and TLS transport socket. And then let's look at having more functionality within it which is to only trace selectively on particular traffic paths. So for that we had to extend some parts of the tap config input classes with an envoy so as to provide on the listener side some sort of filtering based on source or source IP or port. And then on the ingress side we used existing machinery which is to use these transport socket matches filter which has been very handy to selectively tap just the endpoints that we need. So the way envoy works is it has a representation called as an endpoint for each upstream host that you want to connect to and each of those endpoints can be selectively matched to a type of a transport socket based on some metadata that you insert to this transport socket. So for example you have three hosts H1, H2 and H3 that you configure in your envoy filter with a certain metadata called tap which you set to true for H1 and H3 and for H2 you set it to false and then you set up two transport sockets. One of them enables tapping filter when the tap metadata is present and the other one doesn't enable a tap filter it just has a raw buffer socket. And so in this way on ingress you can selectively trace only the requests that go to H1 and H3 and not to H2. This has been kind of critical for us because the networks that we work with have 400 to 500 upstreams and like equal number of downstreams and all of them are processing basically your 5G core network signalling requests so they are at several million requests per second across multiple envoy nodes so we don't want to overload our system with needless tracing and so selectivity has been critical. And then finally the question is basically which tapping sync to use. So Envoy already offers two tapping syncs. One is file based and the other one is admin based. File based tapping syncs are unacceptable for critical use cases with security and privacy concerns because they basically dump every information regarding the request onto a file and that would then put additional requirements on the storage and so on so forth. So we don't want to go into that direction. Admin sync is the or seem like the prime candidate that we could have used. However on a closer inspection we found out that it had a bit of an issue. So basically what we had was a kind of a bottleneck onto the main thread so Envoy has this basic silo threading approach which is quite fantastic when you want to extend custom filters they are quite easy to write and they are quite easy to understand with regards to connection and request lifetimes. However the problem comes in with this tapping filter that the admin sync is basically existing on the main thread because that's where the admin interface largely lives and therefore the worker threads had to send these traces back to the admin interface and that created a bottleneck on the main thread which you can see in the red one so this was with an example config with static configuration and you can see without tapping the main thread is hardly activated it is just the stats syncing that was happening and with admin interface tapping we see that the CPU utilization just spikes a lot and we decided to avoid that mainly because we have a custom control plane and we never wanted our XDS procedures to ever be blocked or our stats to be interrupted so for that we decided to go with a new tapping interface with the main objective that we don't interfere with the main thread as far as possible to stream the traces and then we decided to build up on the interfaces for the gRPC tapping sync. We took inspiration from an earlier discussion around this topic within the GitHub and we made a prototype of it and then with a custom traffic sync and with multiple worker threads initially and then we kind of fine tuned it and a lot of other stuff. So basically the fundamental aspect is that these information regarding tapping comes in two parts one is the connection related information which is basically a source destination IP and port and then a connection reference ID that is associated with it and then there are your actual trace information which is just base 64 encoded header and body information that would be present in your HTTP2 request and these trace information carries a reference to the connection to which it belongs so each connection information would have a connection ID and that connection ID would be represented within each trace frame and then you can do post-processing on it do it live pretty much and that is the cold part and get a more suitable representation for example pcap or pcap ng and give it to a post-processing sync or analytics unit which gives you live bulk tracing information of your entire network. So one shortcoming however with this pros with this thought experiment was that say your sync had some sort of a disturbance and it had to restart since the connection information was only sent once and later on you're just giving it some reference that connection information is essentially lost by the sync on a restart and so we decided to give a custom handshake procedure and a connection and an address connection replay machinery which is this additional cache that we introduce within each worker thread. Basically it's just a map of a connection reference and a connection reference ID and the whole connection information and it would be populated basically on a connection in it and basically removed from the cache on a connection close which are given via very friendly callbacks within tap config info and based on that you can detect a connection termination or disruption via a custom handshake sequence or via attaching TCP socket properties to the client that is initiating that gRPC push and then you could replay the connection information to the remote sync and that way your connection information is always represented for all your trace information. There are some additional options as well for example there's buffering option and there is also tapping at the HTTP level however the way silo threading basically works is at a particular layer it creates an object so since we are dealing with traffic that are mostly low in number of connection but high in number of requests we don't want to create too many objects by spawning an HTTP level tapping filter so we instead went with this connection level tapping filter and also live streaming rather than buffering meant that we could see the traffic as it was getting processed on the ingress and the egress of the ny side and another thing that we added was the capability to observe the traces that we were producing and produce some statistics for it we had a similar counterpart for our tracing sync for counting similar number of trace events and stuff and that's the way we ensure that on live networks we had a consistency of what we were intending to tap and overall we were quite happy with the tracing solution that we developed the overhead introduced by the tapping system was roughly was lesser than 5% in our max rps capacity for one ny node and end-to-end latency with tapping enabled on multiple listeners and cluster was in the order of 1 to 2 milliseconds however this has to be taken with big conditions applied it depends on what is the loading condition what is the network set up all that but it was quite good for telecom based applications so it is really working quite well and it has been extremely stable and it has been extremely stable in production networks with some of the nodes being deployed since day one a couple of at least three or four quarters back and with several hundred thousand to few million requests per second being traced in live IGCO networks and for the hardware acceleration part I would like to present a recorded video by my colleague who would explain for the details about this this is Ejo from Intel we're glad to share this topic with that to you at invoico unfortunately for some reason I can't draw you today in Chicago so I got to share my part from video now let's let me continue our session on hardware acceleration as you may know for most cases Maricopa is not a problem for envoy requests and response are limited to a few kilobytes in size and won't take much time in the overall precise but for some special scenarios and requirements things are different like traffic mirror a feature allow users shadowing traffic from one cluster to another this is a very useful feature that allows feature teams to bring changes to production with other as little risk as possible and why we'll make a copy of live request data to mirror service with increasing request size the data copy can be quite substantial and another example is TOS memory linearize this is an operation defining buffer system occurs before TLS encryption is applied linearize copy and recombining multiple small size buffer into a large one to reduce the frequency of encryption and associated over time for large size response linearize also come with a significant amount of copy which can up to 10% of overall precise in our test and for traffic tapping our topic today memory copy is also a crucial issue that cannot be ignored when it comes to a large request or response tapping filter would make a copy of all traffic to generate buffer fail then save it locally or send to remote service for later analysis this scenario perform more intensively copy compared to previous scenarios based on our observation the proportion of copy can up to 20% of entire process in some cases so we believe it can be a scenario suitable for hardware acceleration now let's have a brief introduction hardware we're using for acceleration DSA shot for data streaming accelerator is a PCI device integrated in false generation design processor as well as the accelerator it's already hit the market this year DSA support a series of memory copy operations like move the memory copy do cast copy data from one address to another two addresses at once and so on one thing we should know about the factors that affect memory operator acceleration is the copy side is the key value when the copy side exceed a certain range back to 100 kilobytes we can expect benefit from DSA otherwise using CPU is a better idea so there's also an issue we're taking to consideration in our acceleration test then let's consider the issue of integration DSA provides two kinds of library allow us to offload the at two levels DML a library works at application level or using DTO shot for DSA transparent offload library at the library level the advantage of using DML is that we can precisely control every single copy but the drawback is you need very best knowledge on how acceleration can help as we say before in many cases it is not a good idea to offload small side copy and another downside is the modification to the code increase complexity and maintenance costs that may out of our control so what do you want to use is a transparent non-intrusive approach we hope that library works at a low level determine if each copy operation is suitable for acceleration if the copy side above the right hallway set offloaded otherwise give it back to CPU to do it and that is exactly how DTO works the DTO is preloaded with envoy by environment variable and intercept every memory copy function to G lip see all the copies are classified into two categories based on size small ones to CPU large one to DSA in the whole process we don't have to mess with the code and recompel it all the offloaded are transparent to envoy so that's the plan next let's see how we perform our tasks and what we got from the acceleration we design our tasks using tcp tapping with htp 1.1 protocol 1000 clients connect concurrently fills requested by client which is the body of the response from envoy ranging from 64 kilobytes to 1 megabytes in size we also use direct response which means the envoy response directly with the prepared fill in the ones instead of communicating with upstream cluster that will significantly increase the proportion of the memory copying overall process we have two groups one for cpu another using DTO with DSA the DTO threshold is set to 256 kilobytes which means it will only offload for copy size above 256 and for the result as shown in the diagram cpu got an advantage initially and the copy size increase the running neck and neck at 256 following that DSA got better latency than cpu which aligns with our prediction and understanding on DSA that we got more performance benefit from bigger copy size you might have questions about the performance difference below 256 since both are cpu-based operations why is such a difference observed i think that may comes from the interception overhead intercept and determine which way to use need to take some cpu resource and finally let's look at what would happen for hardware memory acceleration and what we can expect it first from a hardware perspective DSA is back to a persistent feature in future generation of xion products so it may be faster i'll support more upload operations that we can introduce into envoy for acceleration and second for envoy on a software level as a community is still exploring the potential of envoy many new scenarios and projects that can void gateway or something emerging so it would be reason to expect there are more suitable cases like cdn or something for void to do acceleration so that's all for my sharing thank you for your listening thank you for the kind listening and regarding the status so we had introduced an rfc a while back to upstream this tapping filter and unfortunately due to other commitments i couldn't continue with it at that time however we are planning to push it continue working on it soon and submit the patches but sometime in 2024 hopefully and yeah that's about it if there are any questions or comments we're kind of up against okay should we take one we're kind of up against time after that you might have to do the rest offline have you considered like more traditional sort of tracing you mentioned tracing quite a lot here but so so so that's why i'm asking like open telemetry tracer there are other http tracing plugins and envoy and if you have considered like why why couldn't you use them why why did you need to go into the route of actually you know tapping the full request um yeah so one reason because the distributed tracing requirements were slightly different than what we had in mind for this this particular tracing requirement was basically to be as low overhead as possible and to consistently give out the representation for the http request in a pgap or pgap ng format so that meant that we had to do quite a lot of post processing outside of envoy and so we needed to get some sort of a representation for that and the open telemetry based solutions didn't exactly fit this one so rather what fits that kind of a use case is really tapping sinks and so we had to use that cool thank you if there's any more questions i guess you'd probably have to catch to vote after yeah thank you so much but thank you great for the talk