 Hello, everyone. Welcome to our public. Today, my colleague Zhang Fai and I will introduce their trust cloud VPN as Don VPP and the white guard. My name is Ni Hongjun. I'm a software engineering from Intel. I'm working on VPP cloud networking and network security. This is today's agenda. First day I will introduce industry challenges for the network security and then we will introduce their trust network access architecture and key functions. Then we will propose our reference architecture for the secure gateway and then illustrate the details for each component. Then our colleague as a fan will introduce the VPP and the white guard. And then optimization and optimizations, then we will summaries or the topics here. First, we will introduce the industry challenges for networking and the security. You know, with the epidemic. More employees are working from home. And as a result, more users devices application services and the dates moving from the traditional enterprise primate to different locations, such as the age and home in the center. Here is a some numbers to represent the trend. You know, more than 50% of companies have more employees working at home and remotely. And then more than 60% of workers, you know their devices when they are working at home. More than 80% of devices are not managed by the enterprise and the IT administrators. And you know, more than 50% of company application services are running on the cloud right now. But let's say 10% organizations have reported that they have no aware of what these devices are accessing their networking. As a lot that this propose many challenges to the enterprise network is and the industry networking. So we need to propose some getaways and solutions to address these problems. Okay, this is their trust network access architecture to fix this industry challenges is proposed by some guys 10 years ago, but recent recent years, they are become more popular and become the predominant industry. In this architecture, there are some key components here. Firstly, it's a secure getaway. It will be able to secure channels with the remote employees and the remote employees will access the Lexi services or vendor services through the secure getaway. And for the new applications or services, it can be running directly on the private services and the private services can communicate with the employees directly. And for some untouched client, they can access the load balancer through a plan way and then the load balancer will provide services to these clients with the application servers between the load balancers and applications that build the secure channels each other. For the control panel, it's a central point, which provides access grant for the clients provides service discovery and key management. It also provides a key distributions to different parties. Okay, here is the third function split for the third trust networking access. There are two main paths here, one is the control plan on the top, another is a telepen on the bottom. OX plan, explain the control plan first. For the control plan, it provides more rich features for this OX plan each one by one. For the user identity, it's get some IDs from the third party, ID providers, which IDs are leveraged to verify the client is a valid one. And for the artifact authentication, you know, you know to build a secure channel between each other. Firstly, they need to between the client and the plan to authenticate each other to make sure the peers are valid one. And the trust execution environment is used to provide a secure environment to running applications such as rewards and to store and to build keys. The message security is used to build a secure control channel between the control plan and the data plan. And you know, to automatic service distribution, we need to leverage the service discovery component to discovery each parties automatically. And the policy control is used to provide policies for different data plan components. And also to verify the customers. And the key management is used to store the keys from the data plan and key distribution is used to to to build the keys to different data plan. Okay, on the bottom side is the data plan. It also provides features. And except as a normal forwarding and routing features, it also to provide the message security, which is used to build a secure control channel to the control plan, and to exchange keys with a control plan. And then the service discovery is also used to provide automatic service discovery for different data plans. And then for the DDoS protection is used to provide anti DDoS attacks. The data security feature is used to build a secure data channel between the different networking functions so that the traffic can be delivered in the data channel. And then the policy enforcement is working as an enforcement point, which we accept the policies from the control plan and then execute on each flows. And the data expectation is used to identify the applications to provide visibility for the data plan. And the threat detection is used to detect the malware and defense in them. Then we will propose a new secure gateway reference architecture for the Zero Trust network access. There are three networking functions here. We will explain it one by one. On the left side is the client. The key features is the Y guard in the operating system and also the WSG, which means the Y guard service discovery component. The client can be wrong on the Linux, Windows, iOS, Android and etc. different operating systems. On the right side is the secure gateway. Here we leverage the VPP, which is an open source project to provide the rich features for the secure gateway. And I think the one key feature is the Y guard and to provide the secure channel between the client and the controller. And then we will leverage the CADK to provide the data expectation and also the threat detection. And we need to integrate different control plan demos to provide keys exchange such as WSG client to exchange information with the controller. And then we will use the FFR routing demo to provide routing features and start to provide the data expectation and also the threat detection. On the top side, which is the controller, there are some key features you need. These will leverage the Linux kernel Y guard features to build a secure control channel between the client and the controller and also the secure gateway and the controller. And then the call DNS and WGSD server is used to provide the DNS service to provide the service discovery and also the keys exchange and distribution between the client and the secure gateway. And then we will leverage the NTSJX to provide a trust execution element. And then on it, the water will run and get some signed keys, certifications from the CA and also get the IDs from the third party ID providers. So that's the reference architecture for our solution. Here's the roadmap for the secure gateway. There are many layers on each or we are explaining them from bottom to the top side. On the bottom side is the hardware layer. Here we will leverage the CPU, the NIC, the IPU and the QoT, Tobino switch drivers to provide high performance data plan. On the second operating systems layer, the secure gateway can run it on different operating systems such as the DBAN, Ubuntu, Red Hat, Suzy, etc. On the third layer is the common libraries layer. Here we will leverage some open source subjects such as the DBTK, the VPP, the TDG to provide the basic features and for the secure gateway. Then on the network layer, we will provide the IPv4, IPv6 routing and integrate with the FR routing demo to provide complete routing features. And we also provide elephant flow detection and distribution to detect the elephant flow. And then on the connectivity layer, we offer different secure protocols such as IPsec, YGOT, SSL, QIC, etc. We also integrate different control demos such as WGSD and the StrongSWay to provide the control plan message exchange. For the security services layer, we leverage the TADK to offer data inspection and also the thread detection. The slot is also being integrated with the data plan. For the management layer, we provide the different northbound interfaces such as the NetCon, RESTCon. We also provide the agent to integrate with different controllers and orchestration projects such as VORTS, QTS, or we integrate with openness to provide the agent services. And also the Kubernetes to provide the cloud integration and also open delet for the SDIN controller. Okay, we will explain the details for the detailed data flow. Here there are many steps here, we will explain them step by step. For the step one here, we will leverage the Intel SGX to provide trust execution environment and the VORTS were running on the trust environment, trust execution environment. And then the step two, the VORTS will get some signed keys and certification credentials from the CA. And the step three, the VORTS will get the detailed IDs from the external third party providers such as Activity AD, G Suite, WeChat, Dintock, etc. And the step four, the Caught in Yes and Why God Service Discovery Server were running on the controller. It provides the service discovery and also the keys exchanges to different data plans. For the step five, the Why God Service Discovery client were running on the client and the security gateway separately. Then for the step six, the client will build secure Why God controller channel with the controller. And also the security gateway also builds separate Why God channel with the controller. When the secure Why God controller channel builds the client and the controller, the secure gateway and the controller will exchange some messages through this Why God channel. For the step seven, through the Why God channel between the client and the controller, the client will send some DNS discovery query to the controller. Then for step eight, the controller gets this information from the client. It will verify the client ID from the message through the external thread or ID providers to make sure the client is a valid one. For the step nine, the controller checks the policy control to make sure the client could have the right to access the request services. If the result is okay, then the controller will distribute the Why God information between the client and the secure gateway. For the step 11, you know, here the client has no peer information from the secure gateway. Then it will initiate the handshakes to the secure gateway to build secure debt channels. Finally, for the step 12, the client and the secure gateway can deliver traffic between the client and the secure gateway in the secure debt channel. Here, we will explain how the watch is running on the SJX. We will take an Auth0 as an example for the ID providers. Here, there are two parts. Two areas. One is the uncharted area. Here, it's on the bottom. Here, the host operating system is running here, such as the Linux. Then on the top, on top of the host Linux, it's the trust and the product area. It's running within the SJX in order to provide customers with the best SJX applications. Occlum is used to provide ThreadSafe and ThreadSafe library operating systems. It can enable legacy applications to running on the trusted execution environment with little or no modifications of the source code. So customers applications can run it with little effort on it. Here, we will take the watch as an example. The watch can run it on the Occlum to try it, apparently. Here, we will explain the details for this. For example, if the customer wants to verify it, it will send a message to the watch to authenticate it with Auth0 credential. Then the step two, the watch will communicate with the external third party ID provider, Auth0 here. To verify the credential is valid one. If the credential is valid, then the watch will access the policy controller to get policies and attach these policies to a token. Then the watch will return the token to the client. Okay, that's all from part. Then welcome my colleague Zhang Fan to illustrate more about the VPP, why got in implementation, why got optimizations, and other features. Welcome Zhang Fan. Thank you. Thanks, Hongchun, for the nice speech. Now is my turn. My name is Fan Zhang. I'm a network software engineer in Intel. And I've been working on crypto related work for DPDK and FIDO VPP for over six years. So today I will be introducing their FIDO VPP WireGuard Optimization. So before entering into the details, I want to shortly describe what is FIDO and what is VPP. For who you don't know, FIDO is an open source project on Jardinx Foundation umbrella. Within the FIDO, it has a number of open source projects within. VPP is the biggest one of them. So what is VPP? VPP is the acronym for the vector packet processing. It is a universal layer two to their full network stack data planning application. So it has the Linux and the FreeBSD support and it supports a great number of user space kernel bypass network interface card drivers. Also to also make it running well on container virtualization environment, it has the kernel interface such as NetMap and TAP. So it can be easily configured as a network appliance, network infrastructure, VNF or CNF application easily. So FIDO VPP is architecture. Let me zoom myself over here. As you can see on the graph on the right, it is a pluggable and it is easy to understand and extend architecture. So as you can see, for all the functionalities, each sub-functionality block is organized inside the VPP as a graph node. You can treat a graph node as the entry for packet input or other medium-layer input. And you can have them processed by link the graph node of certain functionality below the packet input graph node. Which make this graph node architecture very easy to reconfigure, to reorganize and it is easy to plug in your new, your own graph node. So for user, you can easily implement your own plug-in to have full control to replace a existing graph node by your own plug-in graph node, or in search of your own graph node in between any existing graph node to reorganize the processing pipeline for your own purposes. So the plug-in graph nodes are equal citizens as all existing graph nodes, which has quite a big and abundant flexibility. But the performance is utterly important for network stack processing. So with the VPP, you can easily to achieve L2 cross forwarding by over 16 million packets per second per CPU core. With the latest Isolac CPU, you can easily achieve over 20 million packets per second per Isolac core under the L3 forwarding application. Also, it is the deterministic. So it can achieve zero packet drops and the latency is roughly 15 microseconds. And we ensure the performance and the latency results to be optimal at all times by utilizing the daily, weekly and the monthly test under another open source project called CCIT to run those tests constantly. And it also has greater scalability. VPP can achieve linear scaling with the core and thread accounts, which means by adding a core, you can get fixed amount of extra throughput. It also supports millions of concurrent L2 or L3 tables entries, which means for a million flow use case, you can easily make it working perfectly under VPP. It also has a very friendly interface for the developers. You can check the runtime counter for literally everything inside the VPP. You can check the cycle cost for each graph node to process a package. You can get the throughput easily and you can get the IPC data. You can get the error counts for any package coming into VPP. Also, you can get a full pipeline tracing facilities by adding a tracer from the input node, which means you can see the whole lifetime of a package into the VPP and exit from VPP. Also, it has the multi-language API binding implemented inside the VPP, making it work nicely with other languages. In the end, VPP has its own command line introspection, which means you can use the command line inside the VPP easily to check to reconfigure the whole VPP processing pipeline. Coming to the VPP wire guard. So, VPP wire guard as autograph nodes has the API interface and the COI command interface. It allows you to do all kinds of wire guard configuration inside the VPP. On the right-hand side, all the data structures of wire guard is abstracted in the three yellow boxes. So, it has a timer controlled by each and every peer's lifetime. And it has the peer data structure, which contains all the IP and key information for a peer. Also, it has the interface data structure that you interact with the other graph nodes. In the blue boxes are the implementations of the counterpass of wire guard. So, you can do IP binding cookies easily with VPP. And all the key generations of using the equilibrium curve algorithm and using the black two for the hash can be done inside the VPP easily. And the key exchange on top of the noise protocol can be done automatically inside the VPP. Those helps you to communicate with the other peer with the wire guard easily. So, when the peer handshake is finished and the key are exchanged, it comes to the red box of data pass processing. So, VPP wire guard currently supports the outbound in-cap and in-bound de-cap. It also supports all the stack-related operation including the crypto. On the VPP wire guard, it utilizes the VPP's crypto infrastructure. Crypto infrastructure inside the VPP is a shim layer to bind different crypto engines for different crypto algorithms. Currently, to support wire guard, we have three crypto engines. So, we have IP second multibuffer OpenSSL and Intel QAT, which is the quick assist technology dedicated crypto accelerator card. So, under the help of IP second multibuffer and QAT, we can do Chacha 20 encryption or decryption. And we can do Polly 1305 authentication, tag generation and verification. So, we have been done through their code investigation of VPP wire guard implementation. We find a few places probably you find interesting as well. So, first, VPP wire guard implementation is ported from OpenBSD wire guard implementation. Which means this implementation is with the kernel packet processing techniques in mind. It uses the heavy OpenSSL library to process the crypto operation such as Chacha and Polly encryption and authentication respectively. And it has no batching processing or packet buffer prefaction to reduce the data or instruction cache misses, which have some performance penalty. Also, it doesn't have the hardware crypto acceleration support such as Intel quick assist technology card support. So, with those findings in mind, we started to optimize the VPP wire card. So, what do we did? So, first, we enabled Intel AVX-512 instruction to accelerate the Chacha 20 Polly 1305 crypto processing. So, compared to AVX-2, which is 128 bits resistance, AVX-512 supports 512 bits of data length. Which allows us to push double the size of the AVX-512 data to process in a single CPU cycle. Which means double the throughput compared to the AVX-2 implementation of Chacha instruction, Chacha and Polly operations. But AVX-512 is CPU architecture dependent. Which means if you write a code with AVX-512, it probably won't work with the CPU architecture doesn't have AVX-512 supported. And also for AVX-2, it probably doesn't work on those CPU architecture which doesn't have AVX-2 support enabled. To conquer this problem, we don't want a VPP that can run on one CPU architecture doesn't want the other. Or we don't want it runs the best on one CPU architecture and the performance pool in the other CPU architecture. So, to overcome this, we adopted Intel IPsec multi buffer library into VPP. So, we abstracted Intel IPsec multi buffer libraries API wrapped by Intel IPsec multi buffer crypto engine inside of VPP under the crypto infrastructure. So, in VPP 21 release, we added the Chacha Polly supported to Intel IPsec multi buffer engine. So, the beauty of this is Intel IPsec multi buffer has the best optimization per crypto algorithm per CPU architecture that is supported. Which means the best performance you can squeeze from crypto for one CPU architecture, you can do the same for the other CPU architecture. Which means it runs equally the best in all CPU architectures. Of course, the throughput won't be equal but the efficiency wise it is. Which in the end helped us to gain the performance by 20% by running the Chacha 20 and the Polly operations. But we didn't end up there. What we did next, as we just discussed, the VPP wire guard existing operation doesn't have batch processing. The batch processing lacking means when you process a package for Audubon, for example, we are first processing the wire guard in-cap. Then we're processing the crypto. Which means when you're processing the next package, the instruction cache, which already stored in the CPU cache to process crypto will be flushed out by the wire guard in-cap processing for the next package. You CPU needs more time to reload those stack processing again from memory. To batch those things up, we can split the whole wire guard implementation to by two stages. First with processing stack in a full burst of up to 256 package. Then we process the crypto for the same full burst of up to 256 package. Now we have the best instruction cache utilization for those operations. Then we do prefatches. So prefatches help get the package data to be in the data cache of the CPU before it's needed. So when we process the current packets, we prefetch the next package into the cache. Which means when the next packet is to be processed, the data is already in the cache. The batch plus the prefetching, we have to optimize the wire guard performance to another 50% more. Again, we didn't stop there. So when the batch is enabled, we defined the performance could be higher by enabling the QET co-processor assistant processing. So AVX-512 and the IP second multiprocessor help you to perform the best when you use the CPU instructions to process the church of poly crypto operation. But it still requires some cycles. A CPU has a fixed amount of cycles available per second, which means if you use those cycles on crypto, you get less cycles to process the package. Then if we can offload the whole crypto processing to a dedicated hardware that's specifically processing crypto, such as Intel QET card, we can help squeezing more cycles to process stack, which means most of it. So we offloaded the crypto processing to QET or other CPU threads. We can help save more CPU cycles to process more packets. All we did is the graph on the right. So first, when the packets come in, we do a flip lookup. So when we find out all these packets needed to be processed by WireGuard outbound, we push those packets into the WireGuard outflow tunnel graph node. Within the graph node, for every package, we first do a peer lookup. When you find this peer is alive, the case not expired. Now we in queue the packets into the crypto processing into a crypto frame and through this VPP crypto async infrastructure to call an in queue handler inside the infrastructure, which in the end will be assembled as crypto request into the QET crypto recognizable request. And to do that, we use the DPDK crypto def API under VPP to basically enable this crypto offload. So when the in queue is finished, we basically can forget about this package. We start the process and neck the next packet in the burst. So when all packets are in queue to QET, we have a crypto dispatch input node, which is running in the pooling manner. We'll constantly pull from this QET equals queue. So what it does, one QET has some jobs finished. The crypto dispatch will call is crypto infrastructures DQ handler, then get those packets back. When the packets is getting back, we push them into the wireguard output post graph nodes, which means all the wireguards processing is finished by this point. Then we can send the packets out to internet. By enabling the QET acceleration, we achieved more than three times more throughput than the IPsec multi buffer accelerated wireguard throughput. Okay, so what have we talking about today? So we have zero trust networking, which is a promising domain on enterprise network transformation. And we have the controllers acting as a central point to service discovery, to access grant and to key distribution and etc. And we have a VPN to act as a gateway, which performance building secure channel with the clients. We have TEE leverages at the once the CPU feature to build a secure execution environment. So to accelerate to the data pass, we have the wireguard to enable the IPsec and multi buffer libraries AVX 512 vector instruction support. And we also leverages hardware accelerated crypto processing, such as Intel quick assist technology card to accelerate the wireguard data pass. And this project is, we cannot finish that alone. So we think we think the people least on this graph on this page for helping and supporting this project to the current stage. And that's it. Time for question and answers back to host. Thank you.