 Hello everyone! I'm Ismo Pusten. I'm a software engineer at Intel Finland, and it's like pretty late in the afternoon already, and I have some jet lag, so if I start speaking Finnish or something, just please interrupt me as soon as possible. So this is about hardware activated service mesh, or actually, I even have a talk outline here, so first we are going to talk about hardware enabling in general, how it works like in a cloud world. Then there is some specific technologies already present in unwind as contripe extensions. And then finally some thoughts about unwind contripe process, because I have been doing some work in that space and I might have some ideas there. So what I mean when I speak about special hardware enabling, so briefly there are many industries which already use special accident devices and such a lot. Like for example gaming, you have your GPU in your desktop, then you have scientific computing which kind of lives with different accelerators. If data analysis, if you want to do let's say train a neural network, or even do very fast inference, then you need some sort of acceleration. Then you have security, security guys are using HSMs and various technologies. But in a sense the good thing in all these extra devices is, well I put first a better use of resources, and by this I mean that if you are accessing something on some special hardware, it means that you have more resources left to do whatever you need to do on your CPU. And faster speed, well this is kind of the reason why acceleration is usually done. But this is actually maybe a bit tricky because the faster speed doesn't always mean that the faster speed is for one operation only. But the faster speed happens when you have certain circumstances where you actually can utilize your acceleration properly. Then security features like the HSM and such and also like if you are running a trusted execution environment or similar, then you can't do that without having a suitable hardware for that. Also these bad sides of special hardware they are particularly at the first is added complexity. If you have let's say ever tried to use a GPU from a cloud instance, you know that it's not necessarily always very simple thing to do. And then there is maintenance burden. So if you start adding code to let's say Unvoy, which utilizes this special hardware, then suddenly you find yourself having to test that code. And if you don't have the hardware available while you are testing it, then this really can be a problem. Then you have more code and dependencies. Typically how the hardware accession is done is that you have a device driver somewhere, then you have a library which knows how to use the device driver, and then you have an application which is using the library. And the library will become like a new dependency to the application, which is using that. Then finally you have deployment burden. And this is especially in the cloud world where you don't really control the hardware that much. Then this is like a real problem. And we will have later on in this presentation, we kind of show how this can be done. And then one extra thing here is the performance measurement burden. So what that means is that when you have to accession in place, when you have everything done, you somehow need to convince yourself that this actually makes sense. Are you getting more performance out of this? And this is also not necessarily always the simplest thing. And here is maybe the problem domain where I'm working myself. So we have the cloud vendors and let's say Kubernetes where you have resources, the resources they are CPU and memory. And in a sense that's like on a good thing that they are there, but that's on a pretty high level in a sense. But that's also the way how Kubernetes wants to have them because the idea is to abstract the hardware way to some extent. But then you have hardware vendors who have cool features, they have complex devices which have all kinds of tuning knobs and such, and they actually somehow want to expose them to the workloads. So the workloads could do what they need to do the optimal way. And then there's a gap. And I think it's kind of my job to try to bridge that gap. I'm both talking to let's say at Intel to the library developers, the driver developers to kind of make them understand what is required of them and what is required of the libraries, what kind of complex environments they must support, what kind of dependencies they are able to have, how the documentation should look like. And also talk to cloud vendors and try to tell them what is available there and what is possible to have. And then let's start going to the hardware itself. So of course hardware can be very different in complexity. I think the easiest case that we are excavating, let's say in unvoiced ABX 512 instructions, they are being used in a couple of places. One of them is these are hyperscan extensions and another is the crypto-MB extension. And in order to use them what you really need to have is just like a CPU which can run those instructions. And that's it. But when you start having like FPGA devices, then it's like a completely different ballgame. Then when you start doing hardware documentation then you will find this tricky hardware and software interactions also. Like well I had as an example the crypto-MB. So how the extension works is that when RSA operations are coming in, then we are putting the data which we need to sign or decrypt into an array. And when the array which has eight slots, when it's full we run this single instruction, multiple data instruction and process all of that at once. And that's where the performance improvement comes from. If you don't have a buffer full or at least let's say nearly full then you are not going to see much improvement at all. But if the connects are coming in in a slow pace then you find that you don't have time to wait for the whole buffer to become full. And now since in Unvoy every worker thread will have a separate array of size 8, then if you see like scale out your Unvoy size to let's say to run on 40 cores with 40 worker threads then suddenly you see 40 times 8 what is it, I can't think straight. And that's actually like a pretty much data to be added before it can be processed. And then you start seeing latency problems and then you start seeing maybe even throughput problems which is counter-intuitive because you suddenly started doing stuff like on a much bigger machine. Okay so this is really this slide. This is where the hardware enabling stuff should be added in cloud world. At the bottom there is the kernel. That's kind of the obvious place and people ask me why don't you just add in kernel the support to make it completely transparent. Why do you need to have something else? The problem with kernel is that it works in some cases but in some cases it doesn't because typically many things don't happen on kernel level which need access. IPsec tunnel is one example of a case where we actually can access stuff without applications having to know anything about it because the IPsec can just be made fast. Then you have libraries. This is like a better place to do access because here you actually have more control about what you're going to do in the user space. Also library is the place. Is there some feedback? Also library also knows something about what it's trying to do and it runs in process context and the idea of a library is that it needs to do some stuff. The problem is that it's also like about how actually the library is built because for example OpenSSL is one example on how you might do cryptographic accession. OpenSSL has this concept called OpenSSL engines which are shared objects which can be dynamically loaded into any application which is using OpenSSL. You can configure it from the outside using an OpenSSL configuration file and it actually works really well. The problem is that in order to get performance benefit from that you need typically to run your application in asynchronous mode and we will come to talk more about this in a moment. But how many OpenSSL applications run by default in asynchronous mode or even have a switch to turn that on to answer is not many. Then comes the middleware. This is maybe the most interesting part because this is where Anway comes into picture. The idea here is that the middleware kind of does things for many other applications like Anway which was called today like Swiss Army Knife of Network Operations so here it is working as a Swiss Army Knife of Network Operations for many other software like if you are running a service mesh if you can accelerate Anway then you will get the service mesh accelerated and it could be that another service mesh uses Anway also so it will get accelerated too. So this makes Anway like a very promising place to do hardware acceleration. Another place is container run times because they run containers so that many places or many opportunities intend to accelerate or do this hardware-related stuff. Then finally at the top you have applications but that's kind of the in a sense that's a typical place if you have a computer game then that knows how to do the acceleration using the GPU because it knows what it needs to do and you can get real performance data out of the application which you typically can't get from the other pieces in the bottom but the application itself you can run applications on specific benchmarks and see how much the user or end user would actually see benefit. Problem theory is of course that there's a lot of applications in the world and they're written in all possible programming languages, frameworks, whatever so it's not really like a realistic goal to add hardware access or support to even like a subset of them. Yeah, then to actual technology. So Intel quick access technology. So this is something that is present in 4th gen Intel Xeon scalable processors and what that means is that this is a feature that is coming in the next generation of processors so this is not yet available in those processors that you can buy from the shop but during the summer this TLS handshake acceleration using QAT was merged into ANWI as a contra-filtra, contra extension and there is support also for using QAT for accelerating compression but that's not part of ANWI yet that's a bit more experimental code so what does QAT really do or how it is used by ANWI so ANWI has a concept called private key providers it's an extension class and what private key providers can do is that they can turn synchronous handshakes asynchronous so typically if you think of boring SSL which has SSL do handshake function you call it, it runs in synchronous way and then it returns you like a success or error value then that's not very good if you want to do hardware acceleration because it depends so much on the possibility of running things in parallel that's how the QAT gets its speed and that's also how this script MP works remember we were talking about this array of size 8 where you need to gather cryptographic data to be operated on so what the private key provider does is that when the handshake comes in it's actually boring SSL when you register this private key methods to it it's a part of the private key provider it actually calls the function within the private key provider and then it executes and it can return a value in progress and that means that then the SSL do handshake function returns also in progress and Anvoi stays waiting for the handshake to operate work a threat can at this point do some other work be useful in some other way then there's a big black box of processing offline which means in this case that the special hardware is doing something and when it's done it calls a callback then Anvoi asks for the handshake again there's a second call of this SSL do handshake a completion function is run the private key provider returns that okay handshake is complete and then it's done so this is also about the open SSL engines do the open SSL engines can do a bit more stuff because they can also do this key exchange operations and they can do symmetric encryption meaning AES but for both crypto ambient QAT what is done is mostly just RSA signing and decryption operations okay and now comes the kind of the full Kubernetes stack part and how it is and how actually a QAT device or any other like hardware accelerator device could be exposed into Anvoi content and how Anvoi could use it so first of all we have the control plane control plane has two jobs here so one of them is that it needs to create resource extended Kubernetes resource of the type of resource that needs to be added to the workload and then it needs to configure somehow Anvoi Anvoi knows how to use this particular resource so if you look at the stack below you have the QAT driver first that's the bottom part of this and after that you have a Kubernetes device plugin so let's do a quick show of hands then how many of you have used device plugins in Kubernetes okay well a short recap so the device plugins they are run typically as diamond sets so they run on every node or you can of course just run on them on the node which happen to have the hardware but what they do first is that they enumerate the hardware there they know how to talk to the device driver or they know how to get information from sys file system or somewhere and they add to the node description an annotation to the resources list but how many of a particular resource the node has for example the node could tell that hey I have 32 QAT encryption resources then the next step is that when you have this resource request in the Kubernetes deployment the Kubernetes scheduler then knows based on this data on the node that it actually can schedule stuff there so the workload is then scheduled to the node which happens to have the QAT resources available then comes the second part of how the device plugin works so Kubelet talks the device plugin and tells that now you need to allocate one QAT virtual function to this particular workload and then device plugin does it and it tells to the runtime or actually Kubelet tells runtime the data which the device plugin tells it that now you need to create this particular device file into the container, set these environmental variables set these volume mounts do those things and after that you get the green box the inter QAT virtual function device there in the container and it's actually available for unvoided use the second part of the control plane is that it needs to understand that which contents actually have these QAT resources available in them so that they have the device files there so that it can configure unvoided to actually take those resources in the use so control plane needs to synchronize these two things so in synchronize the Kubernetes deployment of the workload and it needs to synchronize the QAT configuration so in this case it would send over the SDS protocol the private key provider specification and after that unvoi would start or maybe it was running already then it would use QAT lib which is the library to use the QAT device file to actually do the acceleration file so this is the complexity which is added and there really is some but it kind of falls to the control planes to actually implement all this we have in Istio we have this support for these templates which you can set for the different workload so you kind of add just an annotation to the workload and the annotation says that now you can just set up QAT for this particular workload and after that it sets the required resources and then does the configuration as it should okay then this is about acceleration so what affects let's say DLS acceleration for QAT and cryptocurrency the answer is that many factors but kind of the key thing is that you need to do a lot of cryptography to see a lot of benefit so if you just have like few new TLS handshakes coming in every now and then you don't need this really but if you have a lot of operations which you need to do if you have a lot of new TLS handshakes coming in then you start seeing the benefits because there just is that many more operations to be done that's also the same thing for the TLS keys so if you have a long RSA key then you find yourself doing a lot more cryptography and then you stamp to get a lot more benefit from having the acceleration done and another thing is that if you run Envoy on a smaller CPU set means that you actually you can see the performance benefits more clearly because the Envoy CPUs will become saturated weaker when compared to the accelerated case okay, then I have some Intel QIT performance numbers and now we have a test setup but the important thing that I was asked to say is that the numbers that you will see then they are done on pre-production hardware the real hardware might have different numbers and especially important is that these are Envoy QIT numbers so you can't kind of extrapolate from these QIT performance might look in a different setup so this is measured on with a K6 load generator and the K6 is said so that it just generates new TLS connections like in an open fashion Envoy is running with RSA 2K key then we have three test configurations one of them is that we have QIT setup using a single QIT VFI or instance then we have CryptoMV running with AVX 512 instructions and then we have default which is no accession and Envoy is pinned to a CPU set of a limited size and finally Envoy is configured to return just an empty response whenever it handles the request and how to Envoy is pinned to the CPU sets is that we are using hyper threads and the hyper threads are coming from sibling cores meaning that for example if you are running on four hyper threads then it's two real physical cores which are being used so this is the request per second throughput so the X axis is the number of CPU threads meaning the hyper threads and then the Y axis is the amount of resources of new handshake which could be processed so maybe the takeaway here is that well first the reason why this gray bar of size 1 is not linear because Envoy actually scales linearly here with just CPU processing is that because these are the sibling cores meaning that the case of having two CPU cores is kind of the same thing as having one CPU core because they are coming from the same physical CPU but QAT is kind of performing well but Group2MB is starting to take it over when we reach the hyper thread count of 8 I think here the more important than the actual the absolute numbers are maybe the relations and how the trends go here for the bars but I think this is in a sense the more important graphs so on the left side you have latency and on the right side you have CPU utilization so this data is from the same set as the previous slide for the case of using four CPU hyper threads so what you see here is that for the latency then the last value is kind of like for this non-accurated and Group2MB case they are kind of just rising high so in a sense you should never measure latency on a fully loaded system it says so in the Envoy benchmarking guide so you should only look at the numbers below this but what we can see here is that the acceleration works pretty well so you are getting better latency like I think for a long time before this QAT and comes to the same level where this no acceleration was but I think the real kind of interesting case here is the right hand side where you see that the Envoy without acceleration it runs in a very linear fashion so it just when the number of requests per second increases then the CPU usage scales in a linear way but if you are accelerating then you don't need to use that much CPU but instead you can use the CPU for doing something else such as running HTTP filters processing the data in some way and that's kind of maybe the key benefit of uploading things but if you get more speed maybe you get better latency possibly but you get more space on your CPUs to do other process yeah the next technology is Intel DLB Dynamic Load Balancer that's also an Envoy Contrip extension and how that works is that it's hardware managed producer consumer queues so it's a load balancer for without having to do that in software so if you set in Envoy the exact balance connection balancing it means that when a connection comes in the worker thread which happens to have it there are some steps before it goes to this worker thread but when it comes to worker thread it takes a lock then it finds out that which worker thread has the least amount of connections it's processing and then it gives that connection to that worker thread and then it releases the lock and the idea with DLB is that instead of having to do the locking then there is a hardware device it's a PCI device which is there on the processor and it's actually doing the load balancing so what the worker thread needs to do it just writes to a socket it goes to the DLB device it decides using its own algorithm that which thread should now have the connection and then it sends it to a socket of that thread which can then take the connection processing so the idea is that you are saving on locking and the idea is that if you have a number of connections which you need to balance then this might make sense then at the end few words about the Envoy Contrip Contrip experience so the CryptoNB extension was the first one that I wrote and it came to the world at the point where the Envoy Contrip process was being introduced and first I was thinking that okay now it's like a second class citizen but then I found out that actually it is a pretty good deal to have it in Contrip it turned out that many of our customers they are like compiling their own Envoy anyway and even though the CryptoNB wasn't in the main Envoy images then it was easy enough to just take and compile in and test out and take also into production use and also the Q80 and DLB extensions were put in the Contrip in the same way however reason that there has been like changing plans because as I was talking in the beginning then having these hardware specific extensions in Envoy Contrip it's actually difficult and the biggest reason for that is that it's very difficult to test those features see if they actually do work and what kind of current cases there could be because they're just easy to hardware available in CA which could run them but right now there isn't like a really maybe good solution to this problem especially because the hardware needed for DLB and Q80 it's not yet out so Intel started a new Envoy fork that is in hitup.com slash Intel slash Envoy the idea is not like fork Envoy and run with it but rather provide like a place to give some sort of like time for these extensions to crew more mature and also like to get our customers and other interested parties like a chance to try them out out if they like however the plan is in the future that everything that we have fork away then add it to our private fork at some point we would even like drop them if they don't find users or if they are somehow like remain incompatible to Envoy but at the end the rest of the stuff we are going to then contribute to Envoy when we get that we get kind of the real work testing done done there with customers so upstream contrib is a lot better place than having a private repository because the reason of course is that the users they want to have stuff in upstream repository to contrib directory doesn't really matter it's good but private repositories they are a bit suspicious because if there is let's say security issue in Envoy then it will take some time before it lands into the private repository and this is kind of not that good and also since you know that real Envoy maintenance have done review for the code which is in Envoy main repository and that's like a big plus even if you have stuff in Envoy counter it is then included to the Envoy counter Docker images and that's way better than having them in some other like my own personal namespace these images if you want to actually ask somebody to try these things out and then this kind of selfish thing is that if you have code upstream then doesn't break that easily because it will be taken into account when interfaces are changed when things are in motion upstream then still my code remains working and some notice and disclaimers because of the performance results you saw and this was the test setup so thank you everybody any questions please I didn't follow the question was that ABX has a known feature that it has a core frequency drop when it is dependent on longer instructions are being being used and that's due to many reasons that thermal is definitely one of them and the question was that was it seen in this or what was the drop in this particle test case and the answer is I didn't measure it so the tests were done by me I take an old responsibility for this so the question is about the processor topology that what happens when you run ABX workload on the same CPU and some other workload also and the idea is that the processor has internal topology I'm not talking about this particle processor but let's say the current third generation processors and the frequency drop doesn't it affects mostly how many of these longer instruction are being executed there at once or how dense it is and then it goes on some license level depending on that and not all of the cores go to that level so the answer is that it might affect the workload or it might not there are some attempts to mitigate this for example by introducing CPU pools so that you would run something on ABX pool and something else on the other pool but they didn't catch up I think it was part of this maybe this gap between the cloud and hardware thinking any other questions the question was about our performance analysis that how you know that how much time you are spending on cryptography compared to something else and where your CPU cycles are going and there has been working in ANFOY for enabling the symbols for the per measurement you can do flame graphs which show you that I'm now using 80% of the processing time and that's actually pretty close to the actual number so if you are just doing HTTP requests you are running them very fast, they are all new ones then in my experience 80% is about the pool part that ANFOY is using for doing the RSA