 Hello. Yeah. I'm Abed. I work on accelerator solutions and architecture at Marvel Semiconductors. So we provide the accelerated compute networking storage and security solutions for the data infrastructure. And we've been doing this for quite some time. So major market segments that our solutions find use in is like 5G carrier, enterprise networking, and security, automotive, and data center. So a recent trend is that all our data infrastructure-related market segments have started transitioning and converging with the cloud. So the enterprise has become borderless. Automotive and 5G autonomous driving, they all need a cloud strategy. Carrier infrastructure is going towards virtual ran. And data center, of course, is leading this transition. So we have been addressing this convergence. And we are trying to enable what at Marvel, we call the cloud optimized silicon era. So I'll start with the DPU first. So actually, we are coming from the infrastructure side. Most of the talks here are targeted towards application. So I'll just introduce what we call the DPU. So modern DPUs or data processing units, these are a successor to what we call the SmartNIC. They consist of some general purpose CPUs, usually ARM in most of the DPUs that are available in the market today, as well as a set of workload accelerators. The accelerators are targeted towards accelerating data infrastructure workloads, usually all the packet transformation functions, flow processing, connection tracking. And there are other accelerations for compute intensive tasks like cryptography, EIML inferencing. And these continue to grow. Initially, we used to have just cryptographic accelerators. Now we have AI accelerators. Now we have specialized accelerators for other workloads, for carrier workloads. We have some accelerators that are specialized. So and on the general purpose ARM course, and these are also becoming very powerful. So the oction consists of like around 6 to 36 N2 cores on the chip, on the card itself. And you can use these cores to run general workloads that can run on Linux. So they all boot Linux, so you can run the Linux workloads. And to give an example, we are already able to run CNI and STO on the DPU itself. So that's something that we have been trying out. But that's not the intention. The intention is to use these accelerators, these specialized accelerators. And to give an example of what these accelerators can do today is that even before any packet that comes on the wire, even before it reaches the ARM core, they can be decrypted, detunneled, inner classified, and then they are given to the workloads. So there's already a lot of work that is done in line. And these are very power efficient. So just to give an example that we are able to do 50 gbps of IPsec traffic in less than 15 watts and irrespective of packet size. And we also do L2 to L4 processing in line. And a lot of TLS related workloads, a lot of we already have. So Oction has been used in a lot of enterprise devices for SSL gateways, for intrusion detection systems, for routing solutions, for the 5G solutions. So it's already used in those segments. So why service mesh? Service mesh, one of the definitions that's on the internet is like it's a dedicated infrastructure layer for making service-to-service communication safe, fast, and reliable. And it aligns with Marvel's goal, which is to move, store, process, and secure the world's data faster and more reliably. And ambient mesh is very interesting development for us. So we have been looking at Cloud Native for some time. And what the sidecar was able to do was it was able to isolate the infrastructure workloads as part of the sidecar processing. Ambient mesh has gone one step ahead. And it's able to isolate it per node. So because it's already isolated as part of specific compute paradigm, so it would be very easy to transition it to separate compute element also. In this case, we are talking about DPU. So just to give an example, I picked up this diagram from the Isolven blog and used this to make this. So what we envision is that all the infrastructure common workloads can actually move or transition onto the DPU, which are more specialized to deal with the specialized kind of processing. And the processing can be anything. It can be policy lookups. It can be the cryptographic workloads. All those which the DPU is more geared to more efficiently and more power to handle in a more power-efficient manner, those can be taken up by the DPU. So we have done some experiments in this regard. And Shatakshi here, as well as Garwit and Akhil in India, they have run some experiments with ambient mesh. And we have come up with a prototype. So Shatakshi will go over that prototype for the rest of the talk. Hi, everyone. Good evening. My name is Shatakshi. I'm a part of Marvel's Accelerator solution team. And I'm looking into P4 programmable data planes, Kubernetes-related solutions, CNI, load balancers, and et cetera. So coming to the slide, we know that ambient mesh is a layered architecture where stereo functionalities have been distributed across the layers. And all the data plane functionalities previously were handled by the sidecar deployed per pod. But now they are being shifted to per node and per namespace bases. So the secure overlay layer zeternal is present on every node. And layer 7 has been handled by Waypoint, which can be one or more per namespace and can be scalable according to the traffic demand. So as seen in the Gregory study published on solo.io blog, this architecture has eliminated almost 90% of the service mesh offload. So we also ran some tests. We executed an ambient mesh setup with some basic functionalities and gathered some CPU utilization metrics. So these are our profiling and synthetic test details. So then a virtual cluster, traffic type was HTTP. We took two packet sizes, 1M and 1K. We enabled all the default features with MTLS. And it was a pod-to-pod packet transmission scenario. So now this is the case of when layer 4, only layer 4 is enabled. So this graph is showing us. So actually, this is the case where we have server pod on one node, client pod on another node, and both are in the same cluster. So here we can see that roughly up to the server node zeternal using up to 0.4 CPUs, client node zeternal using 0.35 CPUs. In the second case, when packet size was 1M, we see 0.85 CPU utilized by the server node zeternal and 0.75 by the client node zeternal. This is the case when layer 7 and layer 4 was enabled. So similar case, two pods, two nodes, one cluster, one waypoint proxy on that same cluster. So we saw that waypoint proxies resource utilization was up to 0.74 CPUs. In the first case, when the packet size was 1K, and the second case when the packet size was 1M, it was 0.69. So we have these numbers with the basic functionalities enabled. Now in future, when the functionalities and features of zeternal and waypoint will increase, these numbers may see further growth. So we are looking and exploring multiple architectures and models and ways to migrate these layers and their functionalities to the DPUs because modern DPU supports some functionality which are directly overlapping with the features supported by zeternal. And DPUs have the capabilities to accelerate those functionalities. So our target is to eliminate the resource demand on the server nodes, freeing up them for the, so that they can serve additional application workloads. We started similar kind of study earlier with Selium. And we are using learnings from that model to come up with a similar design for the Istio Ambient Service Mesh. We created a similar off-road architecture for Selium where we were able to shift the CNI, data path, and the proxy totally onto the DPU. So first we'll go with the same architecture. We'll go with the detail of this one. We are able to achieve this with Selium. This is the detailed diagram of the full primary network offload on DPU. We have introduced some plugins to transparently offload the CNI layer and its components onto the DPU. No changes to the Kubernetes N application port spec as soon as the port is being deployed. So let's go through the flow of this diagram. As soon as you deploy a user application port, CRI will call the CNI in the same fashion as it does today to create an interface on the port. Now our CNI offload layer, we're going to intercept that traffic and we're going to allocate an interface to your application port. Now it will going to send details related to that interface to the plugin deployed on the DPU. Now that plugin will going to send the data to the CNI plugin in the exact way CRI does. And then CNI plugin will going to allocate the other side of the connection. We're going to attach it to the EBPF data path. And this connection is via virtual function pair. And this model is POC ready. We came up with a similar model for ambient mesh. This model is still in progress. We were able to run this model on virtual clusters by manually doing all the things. And we will going to automate the model. So similar to the previous diagram, we have introduced plugin to offload the architecture. So in this one also, as soon as you deploy the user port, the transmission of data between the CRI and the CNI will be intercepted by the CNI offload layer. Then this layer will going to allocate one interface to your application port. And we're going to send the data to the DPU via GRPC connection. So plugin will receive that data. And plugin will then give all that data to the CNI plugin. And according to that, CNI will going to attach the other side of the interface to the data path. And everything will work in the same fashion. So we already, so it's like the last diagram which I showed is to shift Zeta along to the DPU. So why to leave Waypoint then? So Waypoint does all the layer 7 processing work we know. And let's see how DPU can help in that. So DPUs have hardware blocks that can do policy enforcement. Layer 7 policies can be accelerated using these specialized lookup blocks present on DPU. And similarly, DPUs provide crypto accelerators for any security and authorization protocols. So this is the final ambient mesh offload architecture where the whole data path and its functionality are being offloaded to the data processing units. Well, we can not only run CNI, but also Zet Tunnel, but also Waypoint Proxies onto the DPU. So instead of having Zet Tunnel on servers and dedicated nodes for Waypoint, we can offload all these components on DPU, which will be more beneficial for scaling them according to the increase or decrease demand of the traffic. So Abed, he'll come for the key takeaways. Key takeaways, one, increased capacity. So as of today, you can run CNI on the DPU as well. If you attach a DPU card, you can actually run application workloads also if you want. But if you are actually moving, transitioning the infrastructure workloads onto DPUs, you have more capacity on the servers to run those. There will be higher performance because DPUs are tuned for these workloads and lower power and costs. Initially, we can use just the same EBPF data path as well as the IP tables data path also on the DPU to have day 0 have this up and running. And slowly, we can transition into using those accelerators, which can be the protocol accelerations for TLS, IPsec, more crypto workloads, policy lookups. Those can be transitioned one by one. And the target is to be ready for the future upcoming compute requirements. Mainly, there will be enhancements like quantum cryptographic algorithms will come, which will have a higher compute requirement thrown at us. The new Ethernet standards that are being defined require a lot of hardware acceleration that would take a big toll on the server resources if they're not offloaded. And of course, like AI ML, we are seeing a lot of new traction. So maybe in the next few years, there will be a lot of AI ML related workloads that would come on the networking as well. Having specialized accelerators take care of those will free up our server workloads and give us a more optimized cluster. Thank you. So if there are any questions, we can take them. Hi, thank you. This was very exciting. I was wondering, what is the current state sounded like some of these things are planned for the future? So where are you at with the current state? And is this something that's going to be publicly available? Can I get one and play with it? Or is it going to be sort of private work that Marvell is doing? No, we intend to open source all the stuff that we do. In fact, these are models that we are developing. And we would love to hear the opinion of the community on that. So we have already have a POC for Selium. And hopefully for Isti also in the next few months, we'll try to come up with one POC, which we'll try to present and we'll solicit comments on that. And the DPU also is available. There will be a Quartz hardware that will be available that can be tried out. All right, if there are no questions, thank you. Hi, do you compare the performance and networking Istio-company Selium? So we have not reached the stage where we can do those studies. I mean, it's in a very POC and very initial stage where we are first targeting the functionality move. And our initial aim is to actually get some framework where we can actually utilize the DPU. And that's what we are working towards. Once we have that, we will definitely work on more power and performance comparisons. And I think what we have seen from standalone, because we have been running these workloads, all these workloads, whether it's routing or IPsec or TLS, we see a huge, huge performance difference when we utilize the DPU. Yeah, of course, there are challenges in the models. The software models are not mature enough to utilize all the accelerators as of now. And that's the target that we continuously work with the community. We've been trying to open source. Initially, DPDK, VPP, these are frameworks which are able to utilize the accelerators to its full potential. Linux as well as the cloud native are not. I think there's still some software frameworks missing. And I think we will work to enhance that so that all the accelerations can be utilized. And we'll start seeing those benefits. Awesome. Any other questions, folks? If you don't have a DPU laying around at your workplace and you want to simulate this, try it out, what is the easiest path, is there virtualization for testing, or do we have to meet the hardware at hand? So currently, when we are doing development, we do it using virtual clusters. So I think the model can be dried out without the hardware as well. But to utilize the accelerations, the simulation is not available. You'll need the hardware. There are in-house simulators, but it's not publicly available. Artie, awesome. Thank you, everybody. Thank you all. Thank you.