 Hi, everyone. Let me start from short introduction. How did we come to the kernel TLS handshakes? Our business is custom-structured development, and we had requests from postive technologies to develop a web application firewall. That firewall was mentioned in the Gardner-Magic Quadrant in 2015 as a vjourner. Web application firewalls are typically built on top of HPS Proxies like NGNX or HA Proxy, and they provide you extended security features like filtration of web attacks or L7 DDoS attacks. During that project, we realized that while the modern HPS Proxies are good for content delivery, they were never designed for filtration of massive malicious HPS traffic. And while we have an excellent firewall in the Linux kernel for TCP and IP protocols, why not to extend TCP-IP with HPS and get a similar firewall for HPS? And this is how we come to TeamPest.fw project. This is a hybrid of HPS Proxy and the firewall, and it's directly embedded into the Linux TCP-IP stack. TeamPest.trs is a part of TeamPest.fw, and TeamPest.fw is considered as an open-source alternative for such applications like F5, BigIP or Fortinet, application delivery controllers. TLS handshakes is important how many TLS handshakes an application can establish per second is an important measure for such kind of applications and appliances. Why is this important? First of all, because this specific DDoS attacks against TLS handshakes. TLS handshakes are very computationally expensive process, so this is a good target for DDoS attacks. The second one is performance. And being in the Linux kernel, we provide much faster TLS handshakes even for abbreviated TLS handshakes, and also we employ a lot of modern research in the Riptic proof to get even faster handshakes. Not only about performance, but also about security. This beneficial for security point of view to split TLS layer from application layer. TLS layer manages all security sensitive data like private key or session keys, and application layer like a worker process for Varnish or NGINX is a typical target for Hacker attacks. And if the worker process is compromised, then security sensitive data from TLS layer doesn't leak. This architecture for Varnish process and Varnish splits a worker process from a hitch which just terminates TLS connection. And the similar architecture in Cloudflare happens them to not to leak any security sensitive data during cloud bleed attack. During this presentation, I will be mostly focusing on the NIST proof for 256 bits, mostly because it's the most widespread proof and the most used proof. And while we have much faster proof 25.519, which is faster than NIST proof, the Benstein proof 25.519 cannot be used for certificates. If we look at what the problems, performance problems with OpenSSL, then during massive TLS handshakes, we'll see that while Mathematics for elliptic proofs is a theoretic bottleneck for the handshakes, in real life memory allocations and copies takes the most of the time during TLS handshakes. Our code is based on Mbit TLS. We made a lot of work to make it faster. We made it more than 30 times faster than the original Mbit TLS version. Also, we use parts from VolfSSL library. And during our work, we reported security vulnerability in VolfSSL. All the library have built-in tools for cryptography performance measurements, this OpenSSL speed and VolfSSL benchmark and Tempest TLS benchmark tool. The tools are directly comparable between OpenSSL and VolfSSL, and we see that VolfSSL shows much better numbers than OpenSSL, mostly because it doesn't use so many memory allocations and copies as OpenSSL. However, for Tempest TLS, direct comparison isn't fair, because in Tempest TLS, we benchmark the full operations used in TLS handshake. This means that we also include ephemeral keys generation for the benchmark operation, and OpenSSL and VolfSSL don't do this. For more fair comparison, let's make a demo. Let's compare Tempest TLS, NGNX 1.14 with OpenSSL and NGNX 1.17 with current VolfSSL version. During the demo, I will use TLS 1.2 in full handshakes and abbreviated handshakes. And also, I will use our open source tool Tempest TLS, which just establishes many as possible TLS connections using OpenSSL. Let's switch to the terminal, and first let's run a benchmark against OpenSSL NGNX with abbreviated handshakes with tickets on. In this benchmark, I use 2,000 concurrent connections in 4 threads, and now let's try VolfSSL with NGNX, and the benchmark works against which machine uses two CPUs. So, 4 CPUs for benchmark and 2 CPUs for the system under the test. So far, we have results for OpenSSL, 5,000 handshakes, BTS for VolfSSL, and 12,000 for Tempest TLS. This is more than two times more than OpenSSL. Also, pay attention to tail latency, this is the 95th percentile latency, and Tempest shows generally much lower latency. Now let's try the same benchmark, but in full handshakes. Here we go, and lastly for Tempest. We see that in full handshakes, VolfSSL is about 50% faster than OpenSSL, but Tempest TLS is still much faster than both of the implementations. Okay, so let's switch back to the presentation. If we compare the numbers with proprietary vendors, then you can find a video from the F5 engineer about performance comparison between Big IP and NGNX OpenSSL over DPDK. F5 is about 30 to 50% faster, however, in this comparison, Shurely OpenSSL is a bottleneck, and it doesn't depend whether you run over DPDK or playing Linux I.O. Because Netlock I.O. isn't the bottleneck in the benchmark. However, if we compare it with AV Vantage, AV Networks acquired by VMware, we see that they provide relatively the same performance numbers as basic OpenSSL. Why Tempest TLS is faster? First, because we have no contact switches, we have no memory allocation, and have an optimized network I.O. You can find more about Tempest TLS architecture in one of our NDEF papers, but also we use some of the cutting-edge research from elliptical cryptography. A NST-256 is defined over a finite field with a prime. Pay attention on the form of the prime, which is just a sum of powers of 2, and this will be important for Montgomery modular reduction in our latest slides. In this group, we define operations for point addition and point doubling. These two operations are used to express point multiplications by a scalar. For ACDSA, we multiply fixed point, known point, by multiplying with a secret random value. For elliptical DPDK we use a point as a public key and a scalar value as a private key. In this case, each time we multiply a known elliptic curve point. In the most straightforward case, elliptic curve point multiplication on a scalar could be expressed as a loop over each bit of the scalar. In this case, there will be a loop in 256 iterations. Each time we double the value and if the value is one, then we make an addition. This is just what we have in binary multiplication. Since in ACDSA we use a fixed point, then we can pre-compute point-doubled values and store them in some table. Fixed point multiplication is always faster than unknown point multiplication used in elliptical DPDK. However, elliptical DPDK with ephemeral keys also need to generate ephemeral keys on each handshake. It means that CDHE also uses a fixed point multiplication. In total, in each TLS handshake we have two known point multiplication and one unknown point multiplication. Well, we express this point multiplication on a scalar via point addition and point doubling. While we deal with points, the points can be defined in different coordinates. The basic affin coordinates x and y aren't there only possible and aren't there most efficient. Typically, TLS libraries work in Jacobian coordinate systems and Jacobian coordinate systems provide more efficient formulas. Here we have a cost-estimated formula in M and S and M and S are essentially multiplication and squaring. On this slide we see a formula for addition in Jacobian coordinate. In this formula, we get point C coordinates by point A and point B. And there are plenty of squaring and multiplication. And the estimation for the formula is 11 multiplications and 5 squaring. Multiplication is the most basic operation. Squaring is usually estimated via multiplication. In the fastest Montgomery multiplication, typically square takes about 80% of multiplication. In slower multiplication with FIPS modular reduction, square increases about 90% of multiplication. If we express the model in a very slow operation through multiplication, the typical estimation is 100 multiplications. Well, all the operations are performed in a prime field. And we have a modular arithmetic. On the slide on the top are examples for modular operations. And typically, if you compare the right sides of operations, we always have positive numbers between 0 and 29. Typically, we use Montgomery modular reduction as the fastest modular reduction operation. The only reason why we have a FIPS modular reduction in the presentation is because MBTLS regional code uses this type of modular reduction. We started from the same code base and we did our best to apply all the research which we could find to make the reduction faster. And the best result which we achieved is only 65% slower than Montgomery reduction. And we end up with transition to Montgomery reduction. The idea behind Montgomery reduction and Montgomery multiplication is that we combine reduction operation, very cheap intermediate reduction with arithmetic operations like multiplication, subtractions, and so on. First of all, we multiply all our appearance by a large power of 2. In the current implementation of TLS library, this 2 in power 256. Just one first bit after 256 bits. This is the first largest smallest power of 2 after our prime. And we can perform all the intermediate arithmetic operations with modular reduction by the power of 2 which is much faster than modular reduction by the final prime. So, if we consider what we discussed so far, we actually have a mathematical answer. At the top, we have a multiplication of elliptic curve point for scalar, which is expressed for point addition and point doubling. The algorithms have different formulas in different coordinate systems. And at the below layer, we have scalar multiplications with modular reduction. And actually, we can apply different algorithms and we have to solve some trade-offs to build the fastest balanced system on all the layers. So, we need to find the balance between all the layers. For example, if we use more advanced point multiplication algorithms that we considered on the first slide about elliptic curve multiplication, then the algorithm uses 52 point additions for unknown point multiplication. If we use Jacobian coordinates, then we pay 11 multiplications and 5 squareings. However, there is a more efficient coordinate system of mixed coordinates. One point is in a thin coordinate and the second in Jacobian coordinates. And if we do addition in mixed coordinates, then we pay only 8 multiplications and 3 squareings. However, if we work with mixed coordinates, we have to use our slow point inversion to normalize coordinates. We know that point inversion is quite slow operation and if we use basic point inversion like work with cell use or open cell use, then there is absolutely no point to use a thin Jacobian coordinate system. Because thanks to slow point inversion, the implementation will be always slower. However, if we use a faster model of inversion like Bernstein model inversion used in TPS-TLS, we can consider mixed coordinates point addition. And unfortunately, in this case, mixed coordinates and Jacobian coordinates work in the same time. However, if we move from FIPS model reduction to Montgomery arithmetic, then we get much better performance. If we talk about performance, then we should consider security. The thing is that this side channel attacks, which means that attacker can observe the execution time taken by the algorithm for point multiplication and depending on different input data, the algorithm can use different time. So, attacker can recreate security values, secret values by the time required by the algorithm to perform the operation. We saw that the most naive point multiplication algorithm has a condition, so it's not constant time. So, most of the libraries use constant time algorithms. MBT-LS goes farther and combines fixed constant time algorithms and point randomization. Point randomization allows you to use non-constant time algorithm. But when you start your cryptographic sensitive computation, you multiply your input data by some random value. Next, you execute the non-constant time algorithm and after that you revert the number by division of by that random value. And thanks to the randomization, attacker cannot recreate original secret value. So, another point is to introduce diamond operations to non-constant time algorithms, but I didn't see any real implementation using this. If we use point randomization with non-constant time algorithm, then, for example, on model inversion, we can use almost three times lower number of iterations. This is a significant difference. The good news is that modern CPUs provide very fast RDRAM instruction, providing access to very fast random number generation. The bad news about this is that the SIDBS attack and the mitigation almost kills the hopper performance. And the last good news is that the last versions of Intel processors aren't vulnerable for the don't have the vulnerability, and probably you can trust the last versions of Intel CPUs. So, at the end, if you trust your CPU vendor and current CPU version, then you can use point randomization in tempest TLS. However, if you don't trust, then you can compile tempest TLS with constant time cryptography and be safe and secure. One more aspect of SIDE channel attacks is the precomputer table. As I mentioned before, fixed point multiplication uses a precomputed table. The table is very small for MBTLS, but quite large for open cell and void for the cell. In their case, the table is much larger than level 1 CPU cache. It means that if you directly access different entries of the table, then attackers can still measure different access times thanks to cache misses. That was the subject for the security report for Wolf SSL. After the fix, they scan the whole table each time they access it, just like open cell does it. The last thing about mathematics is that modern TLS libraries use big integer arithmetics and special wrappers. The arithmetics is well described in a book by Tom Dennis. On the slide is an example from the Linux kernel, this descriptor of big integer. Big integers work with limbs. Limbs are typically unsigned long variables and arrays of unsigned long. For example, for the proof of 256 bits, there are four limbs and each limb is unsigned long of 64 bits. However, MBTLS overuses the wrapper and pays a very significant cost for memory allocations, copies, and so on. More efficient TLS libraries use a per size specialized code for each of the apparent sizes. And actually, all the efficient TLS libraries use assembly implementation. As an example, consider a simple function to add two big integers of two limbs. First, we sum the less significant limbs. Next, we sum the more significant limbs and add the carry from the previous addition. That's maybe not so straightforward code and you need some time to think whether the carry operations are actually correct. However, if we move to assembly, the modern CPUs provide instructions which deals with carry automatically. In this case, we add the less significant limbs and we use a specific instruction ADDC, which automatically sums the carry from the previous addition to the result of the new addition. Sometimes, assembly code looks much more straightforward than C code if you deal with carry. There are some to do in the research for us. First of all, I think CPU family significantly proved with AVX 512 extension and now you don't have to pay significant downlocking for the operations. And there are a lot of research for cryptography operations using the CPU extension and we're going to explore the algorithms in Tempest TLS. Also, we're going to go upstream with Tempest TLS. So, Tempest TLS is supposed to be proposed for the Linux kernel inclusion. And if this happens, then nginx, htproxy, whatever application layer htpproxy can benefit from faster HTTP TLS handshakes. We discussed the proposal in our last NDEF conference and this GitHub issue with description of proposed API. We proposed to include only server-side operations and since our TLS implementation is very restricted in the kernel, we need a special mechanism to file back to user space implementation if we don't have enough features to establish current client connection. There are also more to do before we go upstream. First of all, we need to improve performance of Tempest TLS. Our performance optimization is still un-done and while we see that Tempest TLS are already much faster than VoIP and open-sales, there are plenty of room to improve performance even further. Also, we need to implement TLS 1.3 at the moment we have only 1.2. There are some issues with asymmetric keys management API with the Linux kernel and unfortunately some of the crypto algorithms in the Linux kernel aren't efficient as we should. For example, SHA hashing is much slower than we have in open-sale and the hashing is crucial for TLS handshakes and this one of the most hotspot at the moment for Tempest TLS. You can find more about Tempest TLS in our NDEF papers and that's all. If you're interested in the fast kernel TLS handshakes, we definitely want to have from you. Also, since there are plenty of work about performance and about TLS 1.3 things before moving to the kernel upstream, any support or contributions are very welcome. So, thank you.