 Let's get started my presentation. This piece can look out the validation and partner then see hash. I'm Kunuki working for AWS in Amazon Linux kernel team. My mission is to make Amazon Linux kernel stable, performant, and secure. We usually maintain our own kernels and package it for our customers. And we usually troubleshoot kind of related issues reported by external and internal customers. And now we talk about TCP PAN at NSE hash that improves some of our walk roads. And what I want to talk about is basically three parts, how TCP SCADs were managed, TCP SCAD look out the validation, and TCP PAN at NSE hash. Let's move on to the first part. When application communicates over TCP IP, that code will look like this. At first, both server and client call the SCAD system calls. And then the server application called bind to assign IP address and port to the SCAD. And the following system call makes the server start. And then client calls connect to establish a connection to the destination IP address and port. In this part, I will walk you through these four system calls to describe how the kernel manages a SCAD. The first is SCAD system call. Depending on the arguments, it allocates various struct that represents the specified protocol. And they have the same base struct as the first member, that is struct SOC. It has enough fields to represent a network SCAD layer. After that, the kernel returns corresponding file descriptor to the user space. Then the application can access the SCAD via the file descriptor, but other processes cannot, unless the same file descriptor is shared. Next is the bind system call. When we call the bind system call with an IP address and port, the kernel set them to the SCAD. And then it recaps conflicting SCAD in B hash. B hash is a hash table. And all binded SCADs are linked in the B hash. Here the kernel checks if a SCAD in B hash conflict with the specified BLSN port. And if the BLSN port can be shared by other SCADs. For example, if a SCAD is bound to a Y lookout of less, like 0.0.0, on the same port, the SCAD bind SCAD conflict with the bind request. So the bind system call will fail with E add less in use error. As B hash is a hash table, the kernel, when kernel looks up the SCAD in B hash, it computes a hash with a local port and networking name space. And it is calculated by iNet B hash fn. And B hash is composed of two different lists. The kernel first looks up a bucket of SCADs with the same networking name space and local port. Then it relates the SCADs and checks each SCADs so-so-less, so-so-sport, and so-so-so-less option, so-so-sport-option, and et cetera. Then it checks if the bind system call is valid or not. Finally, if there is no conflict, the kernel put the SCAD into B hash. This time, the SCAD is prepended in the corresponding bucket and the bind system call returns zero, which means the SCAD is bound to the IP add less in a port and also linked in the B hash. Next is the list in system call. When we call list in system call, it changes the SCAD state to TCP return. But at this moment, the SCAD does not actually work as a server. This is preparation for the next step. And the kernel checks B hash again for conflict. It's because another SCAD could have listened on the same port and IP add less. Let's say two SCADs have the SRVs add less option, then they can be bound to the same add less in a port, but they cannot listen on the same port. And if there is no conflict, the kernel put the SCAD into L hash. Literally, L hash is also a hash table and the kernel calculates an index from the SCADs net NS and local port. Unlike B hash, L hash packet is directly linked to the SCADs. And the new SCAD is also prepended to the SCAD, to the list. And then the SCAD actually starts as a server and incoming SIM packets will be distributed to this SCAD. And the next is connect system call. When we call connect, the SCAD is not bound to any local add less in a port in this case. So the kernel selects an IP add less based on the destination add less. And now the SCAD has three tuple, destination IP add less, destination port and source IP add less. So the kernel searches for an available port in B hash. And if an available port is found, the kernel copies the four tuples to the SCAD. And finally it put the SCAD into B hash and E hash. And E hash's key is a five tuple, net NS, local port and falling add less in a foreign port. As well as L hash, E hash each packet is directly linked to the SCAD. And this time also the SCAD is prevented to the list. Then the kernel starts Lee Ray Hunt shake to initiate a TCP connection. The client sends a SIM package to the server and the server rips up a SCAD in E hash first. However, this time no SCAD is found, the match is the five tuples. So the kernel looks after listening SCAD, the match is the packets, destination of less and a port. If a proper listening SCAD is found, the kernel creates a mean SCAD, which has less field than the struct SOC to mitigate SIM fradding. The kernel does not allocate a full SCAD for clients until three-way hunt shake complete and the connection is proved not to be malicious. So when receiving SIM packet, it allocates a mean SCAD and put into the E hash so that it will be looked up against the following ACK again from the client. And the server respond to the client by SINNACK. When the client receives SINNACK, it finds the corresponding SCAD in E hash, which is initiated the three-way hunt shake. Then it sends out the ACK for SINNACK. Finally, when the server receives the ACK, it finds a mean SCAD that was created at receiving SIM packet. Then the kernel creates a full SCAD and copy all attributes from the listening SCAD and put it into E hash. But this time there is a mean SCAD that has the same five tuple. So it will be replaced by the full SCAD. And after three-way hunt shake complete, the connect system code returns zero. To wrap up this section, the kernel have three hash tables, B hash, L hash and E hash. All bond SCADs exist in B hash to validate the following bind system calls. And L hash and E hash have listening and establishing or established SCADs respectively. And they are used to look up SCADs that is responsible for incoming packets. But this is a story before we next for 0.16. And there were some problems found in kind of heavy workflows. That is what I talk about in the next section. This B SCAD lookup degradation. B hash, L hash and E hash are basically all these of linked list. So the longer the list gets, the longer it looks up or progress takes. It means the performance of such lookups can drop along the length of the list. So we have to know when and how the list gets long and it depends on each hash key. Let's take a look at L hash first and whose hash is computed by netNS and local port. For a hash, the hash is calculated by netNS and local port but in the same networking namespace, it is just a one-toppel hash. Let's say we create multiple SCADs in the same networking namespace and they are bound to different atlases with the same port. Then all of the SCADs are put into the same buckets in a hash. This happens on the famous port like 80 port or 440 SLEE and SRE spot option makes the list even longer. SRE spot option enable multiple SCAD to listen on the same port. This is often used in NGX or Apache web server. In this diagram, SCAD listening on IP1 at a port is placed at the tail of the list, which is the oldest SCAD because we usually prepend the SCAD into the list. So raking up an old SCAD takes longer than others. In such a situation, looking at the listener for same packet takes much time and thus the counter gets prone to same fratting attacks. Actually, this issue was reported by meta and they added changes that introduces L hash two. In 4.16, its hash key is calculated by net NS local port and local atlases. It's three tuples. Adding additional element into hash breaks down the long list into small lists and they are in the different buckets. So let's say multiple SCAD listening on the same port. If the local atlases are different, they always exist in different list in L hash two. The L hash two is injury that places the same processing usage and later as a usage like proc net TCP or something is the bracelet. And finally, in 5.19, L hash was retired as L hash two compute a hash with the SCAD at less. So SCAD requires at most two root caps to find out the corresponding listener. If the listener is listening on the wild card at less. So L hash has a problem due to the it's pro hash key. And it'll have to resolve the issue by computing a hash with local at less. And here you will arrive that the B hash has the same hash key with L hash. And there could be a similar problem. And yeah, there are some same kind of problem. B hash also gets longer if multiple SCADs are bound to different IP at less and the same port in the same networking namespace. And in the case of B hash, in addition to the SO root spot option, the SO root at less option could make the things worse. With SO root spot, with less or use at less option, multiple SCADs can be bound to the same at less and port. The option is primarily for client use case, like FTP server. When we have FTP server running in the active mode, it creates a client SCAD for data connection and its local port is always 20. So they have to bind SCADs on the same local port by SO root at less option. But I think this problem usually happens with the cyber workers with SO root spot. And also there are two differences between L hash and B hash. When it's lighting over the list, we use RCU for L hash, but we have to acquire spin lock for B hash. And another point is that we have to it relates all over the list when the bind system call succeeds. So if B hash of the list gets longer, the spin lock will be held so long, the process context in, so the basically B hash is used in the process context, like bind, listen, connect, system call. So holding the lock does not seem to cause a problem, but it does as I predicted by meta two. If B hash has a long list, let's say on the AT port, then a new bind system calls for AT port traverse the list, then it's spin lock is held longer. What if a new connection completes the slurry handshake on the port 80 at the same time? Then the kernel has to add a new SCAD into B hash. This was done in the first bus, so it's done in the interrupt context. So bind system call could block the interrupt context for a long time. And finally, the interrupt ends up in CPU soft lockup and the connection will be delobbed. And this was fixed in 6.1 with B hash two. So as we, as with the B, L hash two, the hash key is calculated by net NS and port and IP address. This distributes SCAD with different addresses to different hash buckets. However, B hash two does not completely replace B hash. When binding a SCAD on a wide lookout at list, we have to check all SCADs on the same port, but different addresses. So we still need a list that has all SCADs on the same port. This is why we can't remove B hash. This way B hash and L hash has improved hash key. Then the list is E hash. However, it uses five table for the hash. The five table is full identity that we can get from an incoming packet. So in E hash, all SCADs are distributed enough over the whole buckets. And there is no tendency that a specific list gets longer. If E hash is long, it means E hash is just appropriated by too many SCADs. Then we will see some problem. Why the lookups for the B hash and L hash is basically one shot, but we do look up for E hash against all incoming packets. So the performance of TCP connection degrades continuously along the list length. If the connection is long-lived connection like Websocket or database, so then the effect is not negligible. And there is another problem related to networking namespace. Let's say two processes are communicating in different networking namespace. And yet another net NS puts a ton of SCADs in E hash. As E hash is global hash table, all networking namespace shares it. And so net NS cannot isolate each worker in this case. This can be caused by a single noise neighbor, but on a mulch tenant system with thousands of networking namespace where each processes have few TCP SCADs, we could see the same issue. Also, there is another problem. Net NS dismantle cannot catch up with satellite growth. When a left count of networking namespaces leaches zero, it will be queued up for a single counter threat that frees that net NS in batch. When the counter destroy networking namespace, INET TWSK page is called, which traverse all over the E hash and the free time wait sketch in the networking namespace. So if E hash has too many SCADs and net NS is created at the high frequency, the counter cannot catch up with this kind of work growth. And I did a simple test to measure how E hash would come from us in the days. The grace, first learn IPart3 as client and server in different networking namespace and just add a ton of SCADs in different networking namespace that does not transfer any data. The size of the E hash is usually calculated automatically at boot time based on the memory size. And size can be checked in Dmessage. Also, the size is specified by a boot parameter, T hash entities. In the test, I configure the size as one me write and measure the performance. When the length is one, the slope rate is 50GB BPS, but when the length is 64, the performance drops along 10% and the length overs 300, the throughputs will be down to 50%. This is improbable though, but 10% degradation can easily happen on multi-tent system. And so that's why I added TCP Parnet NS E hash. Parnet NS E hash is available from 6.1 and it implements E hash for each networking namespace, which is not shared by another networking namespace. And it resolves the problems described before. The hash key is the same five table to reuse the same infrastructure in the TCP stock. And we can control the size of Parnet NS E hash by ccdl. I added two ccdl nodes related to Parnet NS E hash. One is TCP child E hash employees, which is the size of Parnet NS E hash in child net NS. This node has to be set before creating a new networking namespace. And the following cron or unshashes call see the values and create a new net NS with the size of Parnet NS E hash. The default value is zero by default and then there is no behavioral change. So the global E hash is used for the child net NS. This is because depending on what growth, the best size varies and there is no one size fits all value. And another node is TCP child E hash employees, which shows the size of current networking namespace E hash. If the value is negative, it means that net NS does not use Parnet NS E hash and use the global one. This can be used to test if the net NS use the Parnet NS or not. Let's see how to use the ccdl nodes. First in the initial net NS, we can see that TCP E hash employees matches the values in the message. And the child E hash employees is zero by default. In this case, the TCP child E hash employee in child net NS is negative value. And if we set a positive value to TCP child E hash employees, the net NS has its own E hash. And the maximum value is 16 megabyte. And if the value is not a power of two, which will be rounded up. And if the system running out of memory, it falls back to use the global E hash unless the all M care time nets process. And when it happens, we can check it by ccdl or roughly in the message. And I originally assume the user of Parnet NS E hash was a bunch of small work roles in milestone system and a fixed size is enough. But there may be a user who want to set the different sizes in different networking namespace. In this example, two system control commands, Lacey and process L will not see an unintended size in the new net NS. If you want to use the node this way, an additional unshare is needed that create the dedicated system control nodes. And this is not modified by other networking namespace. There is another notable thing. So the default value of two sets control node is calculated based on the Parnet NS E hash. So this requires care for tuning. And another thing, another notable thing is Numa policy. The global E hash is allocated at boot time and spread it over all available Numa nodes. But Parnet NS E hash is allocated based on the current Numa policy. So for example, in a test machine with two Numa nodes, the global E hash is spreaded over the first and second nodes. But by default, the Parnet NS E hash is allocated on a single node until you change the Numa policy with Numa control or like something. So this is the summary of Parnet NS E hash. It's an optional sketch hash table of connections sketch for each networking namespace. And we can control the size by 6 CGL and that requires additional tuning for related to this control node and Numa policy. And it will improve sketch lookup performance when you have too many sketch in E hash or won't leave the connections with short-lived connections. And yeah, the same feature is available in the UDP from 6.2. At last, I would like to thank you for networking maintenance. They gave a great feedback and labored slowly. I appreciate their help. Okay, that's all for my presentation. Thank you.