 Hi everyone, I'm Sankur Angunath. I'm a network software engineer in Intel working on enabling service mesh for 5g and edge environments Topic today I'm presenting that I assure about your mesh performance details matter along with Mithika and auto Here's the legal disclaimer High-level agenda. We start off looking at some of the important aspects to keep in mind while measuring your mesh deployment We then share the some of the results and observations With the performance studies you've done on bare metal and virtualized environment Auto would then talk about some of the benchmarking pitfalls around tooling and test environment Mithika would end the talk sharing some non-mime work So if you're someone looking to understand your mesh deployment It's sort of setting up your cluster service mesh application or test pick a benchmarking tool Configure your transactions per second or protocol of interest Destination connections then run the test and capture the results However, if you closely look at this picture, what's missing? Right, so turns out quite a lot for the purposes of this talk let's assume east-west traffic is traffic between Microservices either in the same pod or across the pods either on the same virtual machine or across different virtual machines Or across different hosts and not so traffic is something traffic coming in and out of a specific host So depending on how you configure the number of hops between your source and destination that your Mesh performance has a huge impact for example your load balancer your API gateways ingress controllers firewall all of these add a Good amount of latency Impacting overall performance of your mesh not just that that's your hardware settings related to your bias settings Power management new more awareness type of accelerators or your operating system settings type of networking stack You'd use kernel versus use a plain stack or the type of tuning you would do across your I'll do it three layer Right and your load generator settings all of these have a good amount of impact on your mesh performance so for their Ultimately, what's important is to have a method to have a consistent results across a repeated number of Test cycles, right? So that's a crucial part So we started off looking at performance of on why Leveraging the on wise friend proxy sandbox available part of the on my source code But if you're someone new looking to understand so is mesh with on why At the front part the sandboxes provided a really great way to understand which is where I was just a few short months ago So in this example friend proxy Sandbox provides a simple on why acting as a load balancer at the front Service saying to services as I was one service to each of them having a simple flask app and On why as a cycle process and a docker container? So type of test we do is scale the number of service one endpoints from either anyway from 1 to 100 and look at the amount of transactions result tail latency Changed the traffic and their settings. So ultimately what we're trying to understand is performance impact with scaling the number of course or A number of connections along with scaling the number of flask on my instances So to give you an idea of results. So if you look at here an x-axis you have a number of Apple on why? Microservices from anywhere from two to hundred and y-axis here one side you have the transactions result I decided have latency You could see from the bar graphs of with the one core versus two versus four cores Especially you have and you have a higher container count about hundred At thousand TPS input I could see with four cores all the thousand have been successfully resolved versus one core We're not even reached to the 200 TPS and your p-ninth land latency has a huge impact of 4.5 times When you have four cores available and keeping it steady And when you scale it across the entire socket with 48 cores You could see at 10,000 TPS input you could see could achieve a well over 7,000 TPS With the whole socket versus four cores with just about 200 in this case At 10,000 TPS input and with 64 connections Versus your tail latencies Where they've been studied Across the 10,000 TPS input versus the four core case where your latency goes well over seconds One second if you look at the CPU utilization across the least four cores and When still cross a 1,000 TPS input rate your CPU utilization For your flask cap and psychoproxy goes be close to 100% at anything over 1,000 It's reaching 100% which is which is a lot. So essentially we found that Ember of cores have a significant impact overall mesh performance Isolated cores and core pinning Are necessarily helpful, which is what some of the telco deployments do isolate the cores and Find the microservices makes it's not necessarily helpful and we found that we could do a quite a lot of Optimizations to achieve better performance and I even compared to what's here So here's an example of telco deployment where you would have a Kubernetes in a virtual machine either in Same host different hosts, right? So in this example, we did master Kubernetes master and work one VM and one host and Kubernetes work on a different host and We look at Calico and STO with their defaults And on the worker host for the data plane We'd use OBS dpdk. So some of the telco deployments use either OBS dpdk or sioe type of scenarios in this example, you have two former drivers servicing the data plane traffic and The idea is to understand the impact of Istio in this type of deployment and for DIO client would Talk to engine X web server running in a pod along with On voice I'd call proxy have two configurations here front for DIO client running as a process outside of Kubernetes cluster either within the Kubernetes master VM or on the bare metal host and The idea is to understand the knots of traffic going in and out of this host So to give you a brief View of the results So therefore the VM to VM case where for DIO and VM versus engine X part another VM And they're two different hosts. You could see adding Istio adds about three times the latency across different TPS rate and At about 10,000 TPS where without Istio it could resolve 10,000 TPS Istio adds in about 46% performance degradation and With respect to the case where for DIO is on the master host Reaching the engine X part in a VM Istio adds about four times the latency and About 11th 11% performance degradation about 10,000 DPS rate and I put traffic You could see anywhere from 32 to 60% and depending on the case right So essentially what we found is there's a good amount of performance impact adding Istio in a virtualized environment and we could see Could tune across the stack starting from CPUs in this case we have 10 Cores made available for VM. I saw a CPU number of cores type of cores Calico config or in with respect to MTU a connection rate Right or your on my config with respect to number of cores versus number of on my worker threads So there are a lot of variables across the stack that we could look into a tune into a few observations So hardware tuning a lot of things could be done. For example farm management policies Simple thing to do is turn off all power management options However, we found that depending on the P states and C state config They have a good impact on a tail latency Enabling hyper threading improves performance core isolation isn't necessarily helpful for example Some of the cases we described earlier With bare metal tests like impact of tuning a VM for a new locality or pinning Cuma threads on on the isolated cores have a good impact on tail latencies It could do save CPU cycles when your CPU sidling or offload crypto operations For example leveraging Intel quick assist technology or add vectorized code as another solution On the other side from a low generator perspective that quite a few things to it as it adds its own latency Leveraging kernel networking stack, right? So the latency has a huge impact Depending on the backup algorithm used or and we find that some of the optimizations could be done with desperate resources CPU or hardware tuning micro architectural analysis Matrix bait-based feedback loop, which is how some of the hardware traffic generators do Auto would touch upon some of these in his part Also, we found some of the L2 L3 traffic generators Leverage RFC to 544, but for L7 benchmarking we haven't found low generator leveraging in standardized way So we could see overall Optimizations could be done from a low generator perspective across the stack, etc in fact We've been discussing a lot of this work as part of CNCF service mesh working group, which is part of a SIG network and The the project we've been talking about a lot of these aspects is service mesh performance. In fact, this Project has been submitted to CNCF to be considered as a sandbox project Essentially SMB provides a standardized way of running these mesh performance measurements looking at Venture neutral way of specifying mesh deployment patterns your environment infrastructure Capturing the test results essentially in an automated way, for example, leveraging machinery right and so this is where In fact, I met auto for the first time and we had a quite a few fruitful conversations after that. So with that Pass it on to auto Hello, my name is Otto van der Schaaf and I work as a principal envoy engineer at Red Hat with a focus on performance. I Am an active contributor to Envoy proxies last night Hawk a layer 7 performance characterization tool In this presentation, I will share some of the pitfalls and observations. I have run into while measuring performance of proxies and meshes To start out, I will dive into low generator tuning for a bit and dig into some of the divergences between them The illustrated is nice to visit the performance issue that was at some point reported in the open source Envoy repository Which compared Envoy proxy to HAProxy In that test WRK2 was used to compare latencies across the two proxies and Envoy seemed to add two or three times as much latency At some point I figured that sanity checking these numbers was where I warranted So I set up a reproduction and next to WRK I used Fort IO and Nighthawk to measure latencies While doing so WRK reproduced mean latencies in the right ballpark But Fort IO reported latencies about twice as high and Nighthawk would in turn report latencies in order of magnitude lower They couldn't all possibly be right In the discussion that followed we learned that the reported measurements have been obtained by executing on a virtual machine Whereas my reproduction was on a tuned bare metal machine Still that did not explain the divergence between Nighthawk and the other tools To understand what happened there. We need to dig into the subtleties around request release timings and connection reuse Let's take a look at different strategies that a low generator could use when releasing requests in terms of timings and connections In the display diagram on the vertical axis four rows are representing four connections The x-axis represents time We're shooting for four queries per second As you can see this diagram shows a perfectly timed request release each quarter of a second balancing over all the connections that are available The same connection and queries per second parameterization can be expressed in alternative ways For example, a low gen might depend on internal or external HTTP libraries, which will not prefetch connections This could in turn mean that only a single connection is actually involved where one might not easily think That four would have been involved The specified amount of connection serves as a maximum in this case, but not as a minimum A low generator may also rely on a pool with prefetching capabilities When testing low latency replies four connections will be created in this scenario, but only one will be used for sending requests Connection pools tend to use a most recently used strategy when picking a free connection from the available ones Depending on the specifics of what is being tested a single or a few large pools behaving like this may or may not be desirable When realistically trying to simulate browsers it is probably better to have lots of small pool instance that behaves like this When simulating a downstream proxy One or only a large few of these pools may reflect reality So this diagram visualizes yet another way of releasing your request as observed in the wild Timing request releases would occur in a bit of a synchronized fashion across connections Note that this still reflects four queries per second, but it may not be what someone has in mind Now in the previous diagrams We had to request reply pair fast enough to never have connections saturated when it is time to release a new request This diagram visualizes a situation where all connections are busy and it is time to issue a new request Low generators diverge here and how they handle this Close looped low generators may just block and release when a connection frees up Some track time being blocked others will try to use math to correct for missed requests in histogram output Open loop generators may report connection and stream overflows Regardless of closed or open loop methodology, some may allow queuing What happens with the request release pacing varies to a low gin may or may not try to make up for missed request releases By increasing the pacing when possible depending on implementation Here's an example of night talk warning about time spent being blocked on resources Connections in this case when it ran in closed loop mode We advise to consider results as invalid when significant blocking is observed as latency histograms will be useless for most purposes The next slide shows it reporting pool overflows in open loop mode Likewise when significant connection or stream overflows are detected those numbers can be checked as part of verifying test expectations through structured output So let's dig into two low generators for a bit. Um, let me try to explain this plot for a bit On the vertical axis bottom to top we have 4tio and wrk ramping up from 50 to 4000 qps This execution used a single connection and was assigned a single CPU We plot a series of p50 latency measurements on the horizontal axis to get a sense of the numbers and the spread What stood out here is that wrk Generally reports an order of magnitude lower latencies compared to 4tio and that its stability across test executions was much tighter The next slide is similar but uses 200 connections and four CPUs and it tracks p99.9 which is a little bit more ambitious What seems to be a quirk in wrk emerges that the 50 and 250 qps runs yield no output and the 500 qps one looks pretty weird Here we show the same plots but now from invoice access logs which served as a control measurement in the experiment So wrk was sending traffic where we missed the output and it doesn't look dead out of the ordinary Manual investigation of request arrival timings at the test subject showed that 4tio would yield traffic that was bursty in comparison to wrk This turned out to be a dominant factor in the observed divergence in measured latencies Later on a jitter flag was added to 4tio to desynchronize request release timings with which reduced divergence This may suffice until one wants to define tests in terms of random distributions around request release timings Anyway the learning here is to know your tools and have well defined tests Different tools come with different weaknesses, strengths and implied characteristics Some other gotchas observed in the wild Latency measurement tooling may diverge when it comes to handling implied requests post execution Some tools may include status and or latency in their reporting Also the matter around tracking rolling standard deviations and histograms isn't trivial Floiting point math is used here And it may be subject to a phenomena called catastrophical cancellation when using a naive approach And through that tooling can produce distorted numbers When comparing baselines of different servers and meshes or even across low generators It is good to be aware that diverging TLS siphers and or session reuse characteristics in setups May translate into significant latency divergences So when it comes to reporting results it is common for a tooling to produce a single histogram Sometimes however being able to dive a bit deeper is useful This example shows a single globally aggregated histogram on the left Its underlying data comes from measurements across multiple distinct workers which run on different threads The workers all use their own connection pool The plot shows a broken knee which is kind of odd On the right we visualize the output once more but now we also include the per worker plots Those all look as regular latency plots and now it's much easier to see how the aggregated view obtained its odd shape The plots displayed in the previous slide often point to a phenomena called back-end hotspotting This diagram tries to visualize the phenomena The connections from the low generator didn't end up in a balanced way at the test subjects processing capacities One listener has three connections while the other one has just a single one In perfect keep-alive tests this may not automatically rebalance And in setups with chain proxies like meshes this may even cascade through the system Fortunately Envoy has some mitigations around this today But your mileage may vary depending on the technology you are running tests against Unless you are specifically testing this aspect of a system It might be good to explicitly limit connection reuse to allow for balancing opportunities when running into this Here's a list of things that may complicate reasoning when measuring latency Noisy neighbors in virtualized setups may sometimes cause instability between results of distinct test executions Competing processes may introduce noise too If you are after getting things as easy to reason about as possible And obtaining the most predictable results the CPU frequency changes and hyper-treading may not be your best friends Especially when testing surface meshes In my experience it is good to start out small and simple and build from there when trying to load test large and complex meshes Low generators aside following the earlier observations it might be good to define load tests a bit more narrowly than just queries per second over a number of connections There seem to be many implicit choices across low generators and from the outside it's not always obvious what those are Talking about test definitions Google recently contributed an adaptive load controller feature to Nighthawk That feature makes it easy to perform SLA based testing to get answers to questions like How many queries per second can my system sustain given these resource constraints under set latency while observing at most X errors The adaptive load controller will then repeatedly re-parameterize test execution to see convergence on the optimum QPS that fits this So this concludes my part of the presentation Thank you for listening Ritika, off to you Thanks, Otto As a continuation of the bottleneck analysis we would like to propose some next steps Let's look at the Ingress connection on the left hand side of the diagram here That can be distributed among multiple worker threads The label one So worker threads that are waiting for traffic versus threads that have traffic to microservices that are active but disbalanced Look at the bypassing of some of the Linux scheduling here and instead use core-based balancing So multiple cores being used to balance out multiple worker threads being scheduled Maybe have a priority among them The other bottleneck to analyze is that associated with memory copy Between proxies, proxies to microservices labeled as do the And how do we accelerate some of those bottlenecks? For example, DMA copy is using vectorized instructions like AVX 512 For small memory copies Larger copies going on frame processing or TCP stack flows that can be offloaded using DMA offload engines in the CPU Directly receiving the traffic packet from the NIC to a CPU port at the proxy As shown in the right hand side of the file here And then by passing the kernel and then using an acceleration Once you receive the packet, how do you use DMA engine to offload some of the copies that are happening from the proxy to the microservices So in summary, achieving a QPS of more than 50k and latency of less than 5 milliseconds in the current setup is one of our goals This performance within a CNI plus proxy in our virtualized environment is a formidable goal without targeting If you want to join some of our analysis, let us know Connect with us and we should be able to achieve some of these goals and present in a future session Thank you