 So I'm giving this talk on behalf of myself and Dr. Atanas Atanasov, he's my lead engineer on our team. Initially we had been looking at in-node work placement which has big impact for many workloads including cloud native applications. In this talk we'll look at some examples from the domain of cloud native benchmarking, HBC, telco and discuss what questions tools and techniques can help with this problem. So first I'm going to flash up this picture. Dr. Atanas and I, because we started off in the HPC world, realized that the current set of Kublet default managers are potentially leaving a lot of performance on the table. They do not allow the level of fine-grained control we are accustomed to in HPC. We are currently two to three weeks away from releasing the CPU control plane, which you can talk to me about later, to allow users to mix shared and pinned cores and to have new affinities to workloads. We will later add in isolated cores as an option as well. These examples come from our experience in attempting to benchmark new hardware platforms at Intel using this control plane. And now that we had this fancy new tool, we needed to prove that it was useful. So benchmarking server platforms in HPC is the custom. We want numbers. So we had a lot of confusion in how to do these benchmarks. So we started looking for toolkits. So first, do we want to execute manually? No. We want reproduction to be simple. Engineers get things wrong, including us, and we wanted to automate. Batch tools play a big role to increase the ROI of benchmarking. We wanted to enable users to study application regressions and save everyone time. So we then had to go and hunt down tools. We wanted to, we could use batch scripts, but that seemed hard. We also wanted to be able to schedule benchmark execution, maybe similar to slurm. Cloud-native queuing frameworks can be a valid alternative. We used Ansible to provision underneath. In benchmarking, we also used Ansible to handle workload deployment, validation of the workload properly executed, and automatic error detection logs. Further, benchmarks as well to handle the post-processing of the benchmark. So now we have the tools. Now we needed to figure out what benchmarks we were using. We could use synthetic benchmarks and evaluate system performance, but this does not well represent the workloads we are running in the cloud today. Cloud users have complex applications consisting of multiple microservices connected over the network. There may also be availability requirements, such as P95 latency access to services. So we did some research on workloads that could be more realistic. The examples are still rather simple cloud applications. At the end, let's discuss what other benchmarks people are using to evaluate cloud systems. So this will be a question I actually post to you on what you guys are doing. The first workload we looked at is the microservice-based Google microservices demo. This application is the classic three-tier app, including a front-end to receive the request from the clients, business logic services, and a database store, where we store the transactions for the customers. In our benchmarking, we evaluated the throughput of such a system on a distributed system with four machines, and only four machines. We placed a load generator on a separate machine, and we had three worker machines to front-end business logic and transactions. Our goal was to optimize the throughput of such systems under the given latency constraints. The second benchmark we used was the Duststar Bench hotel reservation system. Again, this is a microservice-based software platform that provides a search recommendation capabilities for hotels. We see here a clear split in three tiers. The difference in this case was that we had separate databases for different parts of the data model of the application and the caching layer. It turned out that these two applications had very different scaling behavior and reacted very different to pod placement strategies on the cluster. Hotel reservation had a clear bottleneck of the database components, and if you run just one instance of the database, it's turned out to be a bottleneck. Best effort QoS, the business logic, was able to handle the increasing number of front-end requests. We observed, hold on, to fix the bottleneck issue, right? We executed two instances of the workload with two database layers. Still, this was not the final optimization. We also had dual nicks. So we further isolated each workload instance on its own socket with careful network configuration. We were using Multif, but it was very hard-coded. So this isn't available easily today. This is hard. For Google microservices, the best quality of service did not provide us any benefits. The workloads suffered under noisy neighbor problem, which we managed to fix by pinning of the services and again isolating on a separate socket. All unused cores on the socket we used the remaining group of services which were not sensitive to the cache-related issues. So this particular piece is a summary of why extending these primitives is so important. So we went from a 40% utilization to a 78% utilization. This was true with both of them. Both of them had similar performances. One was not better than the other. And we really want to be closer to 90%. So there's probably more we could do regarding the database as the bottleneck. We're not sure what other optimizations we can do to get that core utilization up. These two workloads are still very far from actual applications and what the users deploy on the systems. If we look at some examples, genomics, AI, HPC, and apps, and more, users start to use the cloud platform to run those workloads. These applications are performance critical. And usually, they are optimized for a certain placement model. HPC and AI applications apply pinning and affinity. Configuration mechanism to get out the max available compute from the hardware. So currently, a lot of these applications are still using slurm. That includes internal data intel for fine-grained performance. Last but not least, we are also looking to measure other things. So it's insufficient to analyze them on one to two nodes. We do this on four, but it's still very far from reality. The customers are running on hundreds or thousands of machines. So what can we measure? We're starting to look at throughput, latency, throughput under latency constraints, but we'd also like to understand how these metrics behave at scale. Does latency go down if we add pods or do we increase the throughput? What happens also with the system are we using all the available resources. So similar to prior, we're not using all the compute available. If not, we can pack multiple workload instances, workloads on the given VMs. What about the memory? Are we accessing memory in an optimal way or are we racing cycles due to the wrong placement? It's about where we are. I have some questions on what people are using for benchmarks if they have better ones than what we were using, but this is the best we could find. Is anyone using better benchmarks? That's a no. The talks, did we have any questions? Thank you. Hi, very nice talk. A multi-part question. When you're running your benchmarks, are the nodes that you're running on virtual instances or are you running on bare metal? We're running on bare metal. Okay, do you have plans for running on virtualized nodes? And if so, how do you plan to manage the core placement with the hypervisor? Different questions. It depends on the hypervisor. So we need to know more about what our users are doing. So if people are using something like Mesos with virtual cooblets, that's gonna look a little differently than if they're doing it in a VMware type platform. So we have to know more about what we're gonna be running on but isn't the plan. We just haven't gotten there yet. Thank you. Any other questions? This kind of goes along more with your benchmarking questions. I'm more putting this, I guess, to the Kubernetes community, but I was wondering how, what kind of infrastructure is in place for performance testing for Kubernetes? And if, like you, I guess it's just because for, is it or you're shaking your head, but. So that was the problem, right? Is there isn't a lot in place? So there's the Google microservices and there's that star bench and that's about the extents as far as that sort of workload. There are people who've run Linpack across Kubernetes, but those are very targeted, they're targeted benchmarks. They're not meant for real performance for your application. I cannot also because I used to work on scalability of Kubernetes. There are also benchmarks ran for evaluating the supported scale of Kubernetes workloads. But that's not running real applications, that's what Marlowe showed here. It's more like you create a large cluster of 5,000 nodes and this is being done regularly and then like try to run synthetic, but very fully synthetic load, like throw lots of containers on it and sort of stuff. And there are also some benchmarks that like test other dimensions, but like on 100 node configuration. But you try to, for example, throw every significant put density or the issue, see if the network traffic, but these are very synthetic benchmarks. Any other questions? I was wondering if you were also thinking about multi-tenancy, like running multiple applications of different nature. So that has to do with your scheduling components. So we have telemetry where we're scheduling where we can actually monitor behavior on the node. And so if you have a second tenant, there's no way today to guarantee your tenant's going to behave with regards to bandwidth as an example, right? So you actually are going to either have to have something working externally that's checking to make sure things are behaving and then either throttling it or throwing it off the node, or you have to have some assumption that if your latency starts to go down or your throughput isn't what you expect, that you're going to be rescheduling that particular service. So that has to do with, yeah. So that's really on your scheduling side. That's less on your node resources side, but you still need something on there monitoring. Is that an answer? Any other questions? We're just throwing up on that in the cooking piece. I mean, network is a compressible resource like CPU, right? Right. You can give it to a pod and you can take it back. Yep. And there are mechanisms like Linux in the Linux stack to set these things up. I'd like to basically set up for sharing between. So you don't really necessarily have to have like an external component to monitor traffic. You can just set them up in a similar way that you set up sharing for CPU cores. Maybe, but you still have jitter. So when you're looking at HPC applications, you're assuming not having jitter because you don't have other workloads. But you do end up with jitter switching back and forth between your two processes. It mostly controls basically throughput, but not. But it doesn't help jitter, so you still end up with the processing issue. Thank you so much.