 So, we are a small group of N-Wire developers within Ericsson who customize N-Wire filters for our custom use cases for 5G telecom applications. I did this project along with my colleague, Svensternagra at Ericsson, and also Yuzhizhu from Intel who contributed for the hardware acceleration part of this talk. So, for today's agenda, we are going to look at some of the background regarding the problem statement, why we had to go for something like a tapping filter and some overview of the existing tapping sinks that we had in N-Wire, why we had to go for this new GRPC tapping sink and some robustness machinery that we added for our tracing mechanism, and then finally, we would discuss a bit about the hardware acceleration part that we needed for memory copies for this tracing machinery. So, just to give you a brief background of what we work on and where this use case exactly fits in, the latest 5G core network looks something like this where there are a lot of distributed network functions each having its own lifecycle and independent policies and management interfaces. Basically, they all communicate to each other via HTTP2 on a service bus, and me and my colleagues basically work on two of these network functions such as which are called Service Communication Proxy and the Security Edge Protection Proxy. Basically, these two proxies set between the whole traffic that goes through a 5G core network and essentially act as the window into what is the traffic within the 5G core domain. Naturally, in a multi-vendor ecosystem like telecommunications, you can expect that there are multiple of these network functions each coming from different vendors. The diagram that I show here is basically one of the first problems that we faced when one of the network functions coming from one of the vendors was not functioning properly on the ingress side. So, how do we isolate that fault? How do we troubleshoot that? So, that was the basic question. And so, there are some means that a distributed deployment system like NRI with co-ordinators provides you such as metrics and logs. However, metrics can only show limited information about certain failure situations such as labels, which cannot be fully comprehensive and hence, they cannot give proper header query or body parameters regarding the failure. And then raising the log levels is also not an option for capacity-sensitive use cases. Then the main two alternatives that we went across was TCP dump and EVPF. TCP dump basically works sniff on the packets on the ease interface. And basically, if you have encrypted traffic, you wouldn't be getting any useful information out of TCP dump. EVPF on the other hand is more interesting because it can attract the TCP level info coming from kernel itself. However, the problem again is elevated privileges that you have to give to the EVPF program to tap that traffic. And also, TLS would again give you a problem that you cannot make any meaningful assessment about what is the traffic that goes through your system. So Envoy is a useful candidate for extracting complete request information and providing it in real time for post-processing or troubleshooting. So how would such architecture basically look like? So the easiest way to visualize it is that we have an Envoy sitting in the ingress traffic path. And we basically provide some form of representation of the current traffic that Envoy is proxying onto an external processing unit. And that external processing unit can then further do some post-processing and give that as a general representation like TCP dump, like, sorry, PCAP or PCAP-NG, and which can be used for customers for their analytics or their troubleshooting guidelines. So one can go with the easiest approach for this, which is to just log whatever traffic that you have. Unfortunately, you know the pitfall with this, namely that you create too many logs and then you just end up killing capacity for the Envoy container. Eventually, we decided to try out the tapping filter within Envoy. Just another side note to mention to this would be distributed tracing is not exactly the same as tapping filter, mainly because distributed tracing works by tracing the traffic flow by attaching metadata headers, which is not exactly the way tapping works. Tapping just gives you whatever is the raw socket binary information. And hence it is protocol agnostic as well. Although we say here it is for HTTP2, you can easily extend it for other layer 7 protocols as well. And we will look at how Envoy provides some machinery for that. So Envoy provides quite some options for configuring this tapping filter. So basically it offers two main machinery, you could say, which is a tapping level and a tapping sink. So the tapping level can be done at TCP or at an HTTP level. At an HTTP level, you would introduce additional overhead because you are tapping at each individual HTTP2 request or response frame. And that introduces an unnecessary overhead that could easily be overridden if you just make a simple choice of a TCP level tapping wherein you would be tapping the request and response traffic flow basically on the TCP, basically on each TCP event. So that is on a connection in it or on a connection read or write or on a connection close. And you can tap that information and use it for the post processing. And the other part is the tapping sink. The tapping sink is the entity which allows you to stream your traffic representations to. And we will first have a brief look with the tapping levels and the filter chains that Envoy provides to see how we can be creative with TCP traffic tapping. So the filter chain within an Envoy is basically layered and also customizable. You can reorganize them mostly very flexibly. And therefore, even if you have TLS, you can still have clear text representation of your traffic, which has been the major problem that TCP dump couldn't solve simply by sniffing on the 8-0 of the pod. So what we do is basically have a proper layering of our filter chains such that the TLS traffic is first encrypted on the ingress path. Then we buffer it and then we submit it to the tapping filter. And then we have our own custom HTTP2 filter chains, which does the post processing on the filters. And then on the egress path, we have the same request path that we had for ingress basically in reverse wherein first you would tap and then you would buffer it and submit it for the encrypted traffic path out of Envoy and to your target pod. Now here as well, you may have more granular constraints such that on the ingress side, you only tap from certain sources with a specific IP or an egress that you only trace traffic towards a specific IP or towards a specific host. So how do we deal with this issue? And this is also a good means of constraining this filter in case you have too many workloads and you just don't need to tap everywhere. Rather, you want to tap selectively at your pain points within the cluster. And so the way you can do it is with the extension within the tapping filter, which provides you to assign tapping only to a given source IP address. Keep in mind that this source IP address is subject to what is the external traffic policy that you have. So it would be IP address one. If it is local, if it is cluster, it would be IP address two. And so it is sensitive with that. And on the egress side, we use something called as a transport socket filter to ensure that we segregate the tapping only onto those end points which we need to tap. So in this example, we only need to trace the traffic that are going from envoy to these endpoints H1 and H3. And I don't need to trace the traffic from H2 and reduce the overhead that is associated with it. So I basically don't assign a transport socket or a tapping associated with H2 using endpoint metadata and a transport socket matching filter. So here you can see the blue boxes basically have a tapping metadata set to true which would be selectively used by endpoint H1 and H3 because they have the tapping metadata also set to true. Meanwhile, the endpoint H2 wouldn't pick up the tapping filter. So you wouldn't get any traffic tracing out of the edge to endpoint. And then comes the question of which of the tapping syncs to use. So as I said previously, envoy offers three basic tapping syncs, a file-based one admin and a custom one called GRPC that we introduced. The file-based tapping sync is mostly unacceptable for critical use cases with security or privacy concerns because basically, envoy writes the whole request body or headers onto a file itself. The admin-based one seemed very promising until we had a closer look at its performance. And so in this diagram, you can see how basically the admin sync works. So envoy basically has a siloed approach for a routing traffic. So that means that each worker thread has its own context of filters and filter chains. And they provide the traces onto the main thread, which is where the tapping interface leaves. And as a result of this, you can see in the bottom picture of top, the main thread has almost 71% usage simply because we are collecting the traffic traces out of it. This would be a big problem if you had XDS or other stats that are running critically on your network, mainly because XDS and DNS and stats collection all reside on the main thread. And if they ever get bottlenecked because of your tapping, you may not be able to change the configuration. So it impacts all sorts of use cases for your service mesh. Or if you're just having envoy as your gateway, you cannot modify your configuration, which is a big problem. So we tried to overcome it by extending the tracing interfaces to accommodate a gRPC to happen sync. The main objective here was to prevent interference of the main thread as far as possible when we are streaming the traces. We took inspiration from an earlier discussion on GitHub for this issue, and we decided to implement a version of it ourselves. So this diagram basically represents how traces are captured within each worker thread, and then they can be given to a traffic trace sync, which can post-process it into a pcap ngr pcap frame. The tapping system thus created has basically two parts, one which indicates the connection information, which comes at the beginning of every TCP event, as we are tapping things on TCP level. And the other one is the trace information, which would contain all the information about the headers and the body and query parameters, et cetera. The connection information is only sent once, and that is a slight bit of a problem with the tapping implementation such that only in the beginning of an HTTP connection does it send out this connection information. So your traffic sync has to basically cache this and use that as a reference to create your post-processing of, to post-process your trace information segments. So you can imagine a very simple problem wherein your trace sync just happens to restart because of some network disturbance or process issue. That means all your previous information regarding the connections that you were holding so far is lost. And as a means to overcome that particular problem, we decided to add a thread local cache within each worker thread for this tapping. So what it does is basically provides a reference-based indexing of all the connection information that the worker thread is currently processing, and it adds it into a cache. And whenever the tracing sync has a restart, it will detect this via TCP lifetime and other handshake parameters, and it will replay the address information that was present in the cache to the tracing sync all over. So in a practical situation, you might have a few number of connections that are long-lived. And as a result, this cache is not too big. It's just a few hundred connections, which is not too intensive. And you can replay all of those to your sync so that any subsequent trace information would have the adequate connection information to recreate the whole traffic frame, either in traffic frame for the processing. And some of the things by which we kind of verified it for consistency in the field is via counters. We just basically added counters onto our extension and also onto our syncs and basically verified whether the tapping socket works as expected. This wasn't already given within the current standard envoy, so we had to make some extensions for that, and yeah, it ensures consistency. And going now with how we were, how was the experience with deployments in live traffic? Overall, we were quite happy with the tracing solution that we developed. We had like an overhead introduced of about 5% into our max request per second capacity for envoy, and the end-to-end envoy latency as a result of enabling this traffic tapping onto multiple listeners and clusters was in the order of 1 to 2 milliseconds, which are subject to load conditions and networking setup, but still is a very low overhead compared to any other machinery that you might have. And it has been extremely stable in production networks with no elevation of privileges for the containers required, and it works reliably since day one on several of the live 5G networks that we have running several hundred thousand to a few million requests per second. And with that, I would like to move on to the bit regarding a memory copying and memory acceleration that was done by my colleague Yiju. So basically tapping would have a lot of memory copies to mirror it to the tapping syncs, and so Intel came up with an approach for faster memory copies with hardware acceleration. And I would like to just play the presentation that Yiju prepared. Unfortunately, he couldn't be here due to a private issue. This is Yiju from Intel. We are glad to share this topic with Davin to you at coupon 23. Unfortunately, for some reason, I can't draw you today in Chicago, so I got to share my part from video. Now let me continue our session on hardware acceleration. As you may know, for most cases, memory copy is not a problem for envoy. Requests and responses are limited to a few kilobytes in size, and won't take much time in the overall process. But for some special scenarios and requirements, things are different. Like traffic mirror, a feature allows user shadowing traffic from one cluster to another. This is the very useful feature that allows feature teams to bring changes to production with a little risk as possible. And why we'll make a copy of live request data to mirror service? With the increasing request size, the data copied can be quite substantial. And another example is TLS memory linearized. This is an operation defining buffer system occurs before TLS encryption is applied. Linerize copy and recombine multiple small size buffer into a large one to reduce the frequency of encryption and associate overhead. For large size response, linearized also come with a significant amount of copy, which can up to 10% of overall processing over test. As for traffic tapping, our topic today, memory copy is also a crucial issue that cannot be ignored when it comes to a large request or response. Tapping filter would make a copy of all traffic to generate per the buffer fail that they would locally or send to remote service for later analysis. This scenario performs more intensive copy compared to previous scenarios. Based on our observation, the proportion of copy gap to 20% of entire processing some cases. So we believe it can be a scenario suitable for hardware acceleration. Now let's have a brief introduction of the hardware we're using for acceleration. DSA shot for data streaming accelerator is PCI device integrity in false generation Xion processors as one of the good accelerator. It's already hit the market this year. DSA support a series of memory copy operations like move, the memory copy, due cast, copy data from one address to another two addresses at once, and so on. One thing we should know about the factors that affect memory operation acceleration is the copy size is a key value. When the copy size exceeds a certain range, we can expect benefit from DSA. Otherwise, using CPU is a better idea. So that's also an issue we've taken to consideration our acceleration test. Then let's consider the issue of integration. DSA provides two kinds of library allow us to offload at two levels. DML, a library works at application level or using DTO shot for DSA transparent offload library at a library level. The advantage of using DML is that we can precisely control every single copy. But the drawback is you need very bad knowledge on how acceleration can help. As we said before, in many cases, it is not a good idea to offload small size copy. And another downside is the modification to the code increase complexity and maintenance costs that may out of our control. So what we want to use is a transparent, non-intrusive approach. We hope that library works at a low level, determining if each copy operation is suitable for acceleration. If the copy size above the threshold we set offloaded, otherwise, give it back to CPU to do it. And that is exactly how DTO works. The DTO is preloaded with envoy by environment variable and intercept every memory copy function to G-Lib C. All the copies are classified into two categories based on size. Small ones to CPU, large one to DSA. In the whole process, we don't have to mess with the code and recompel it. All the offloaded are transparent to envoy. So that's the plan. Next, let's see how we perform our test and what we got from the acceleration. We designed our test using TCP tapping with HTTP 1.1 protocol. 1,000 clients connect concurrently. Fills requested by client, which is the body of the response from envoy, ranging from 64 kilobytes to 1 megabytes in size. We also use direct response, which means the envoy response directly with the prepared fill in the ones, instead of communicating with upstream cluster. That way significantly increase the proportion of the memory copying overall process. We have two groups, one for CPU, another using DTO with DSA. The DTO threshold is set to 256 kilobytes, which means it will only offload for copy size above 256. And for the result as shown in the diagram, CPU got advantage initially and the copy size increase, they're running neck and neck at 256. Following that, DSA got better latency than CPU, which aligns with our prediction and understanding on DSA that we got more performance benefit from bigger copy size. You might have questions about the performance difference below 256. Since both are CPU based operations, why is such a difference observed? I think that may comes from the interception overhead. Intercept and determine which way to use need to take some CPU resource. And finally, let's look at what would happen for hardware memory acceleration and what we can expect it. First, from a hardware perspective, DSA is back to a persistent feature in future generation of Xion products. So it may be faster or support more upload operations that we can introduce into envoy acceleration. And second, for envoy or software level, as a community is still exploring the potential of envoy, many new scenarios and projects that can envoy gateway or something emerging. So it would be reason to expect there are more suitable cases like CDN or something for envoy to do acceleration. So that's all for my sharing. Thank you for your listening. Yeah, that's what we wanted to present to you. And regarding further comments about the status, we haven't been able to upstream it so far to the envoy, but we are planning to upstream the solution soon. And yeah, hopefully you would be interested in providing reviews, comments, and everything would be appreciated. Yeah, thank you. I was just wondering, during the design process, like, to take into consideration memory concerns, like if you had a really large request and whether they're like perhaps a slow connection to the GRPC server that could have caused those sorts of bottlenecks? Yes. So at the moment, the way it works is basically by moves. So there's no implicit copies within it. So I think we have tried to be as optimized as possible. But yes, we do have some upper thresholds on what traffic we can trace. And I think we cap it at like something like 64k, something like that. And so that is the part where the hardware acceleration really comes into play, I think, which would enable faster copies or faster moves if they are required. And yeah. Okay, so are you basically saying that you truncated the messages at 64k? Yes. Okay. I was also in your envoy con talk. I think you mentioned during that talk that possibly mentioned PCAP support. Is that something that you implemented as well? Yes. So the solution that we came up with was quite broad. And what we intend to present at end upstream is just the envoy extension. Yeah, I was just interested to know more about sort of the infrastructure for the PCAP playback. If that's something that you're able to talk about, but happy to take that offline somewhere. No, I don't think I'm allowed to. Yeah, that's fine. Thank you so much. Yeah, over on. Thanks.