 Hello, everyone. The title of our presentation is Towards Low-Latency Impletation of Linear Layers. This is joint work with Wang Weijia, Fan Yehong, Wu Lixuan, Sun Ling, and Wang Meiqi. The motivation of this paper is the current demand for devices with limited resources, such as the internet, things, and radio frequency identification tags. Usually, because the virus restrictions lead to new security threats, we should use lightweight cryptography to ensure secure encryption and expand cryptography applications to these devices. There are many criteria for designing lightweight primitives, and the most popular one should be the gate equivalents. GE requires to implement cipher because it nicely approximates the complexity of the circuit area. Meanwhile, another criterion, latency, is also crucial and has been attracting more and more attention because it plays an important role in the low-energy conservation of ciphers. Generally speaking, we have two directions for this. On the one hand, we can design new lightweight ciphers. Simon, Spack, and Midori are several well-known block ciphers. On the other hand, we can optimize components in ciphers. It plays an important role in the applications to devices with limited resources, and it has more practical significance. Lightweight components are more popular choices to design lightweight primitives. For the non-linear layer, Xbox plays an important role, and for the linear layer, we already use the MDS matrix or the near-MDS matrix. This paper follows the second line of work and focuses on the hardware implication of linear layers that provide a field for many cryptography primitives. Before introducing our heuristics, we should make clear the metrics that are helpful in the proposed solvers to optimize the linear layers. There are two metrics that we focus on. The first metric is the circuit area. We use the number of XOR operations when implementing the linear layers to represent it. The more the number of XOR operations means the larger the circuit area is. Therefore, we focus on the reducing the number of XOR operations. To better estimate the number of XOR operations, we have three kinds of XORs. They are proposed to estimate the imitation on a battery matrix. The first metric is a D-XOR metric. The D-XOR metric counts the hamming weight of this matrix minus the row number of this matrix. The hamming weight is the number of ones in this matrix. In addition, the S-XOR metric counts the minimum number of operations like xi equals xi plus xj. And the D-XOR metric counts the minimum number of operations like xi equals xj1 plus xj2. Let's take the 3 times 5 matrix as an example. Its hamming weight is 11, and the row number is 3. Therefore, the D-XOR is it. Actually, it is the worst circuit because we waste many XOR operations. The second XOR is what we call the S-XOR. However, it has been proven to be NP-hard to find the minimum number. Therefore, it's hard to find the optimal solution, especially for the matrix with a larger size. For this 3 times 5 matrix, we can turn the implementation with 4 S-XORs. The third XOR is the D-XOR. D-XOR is another kind of XOR. It can generate new values to save the intermediate values. And for the 3 times 5 matrix, we also need 4 D-XORs to implement it. D-XOR is straightforward to use the actual count to estimate the cost. For further optimization, the X-XOR and the D-XOR are used to evaluate the matrices. For S-XOR and D-XOR, the D-XOR can generate new intermediate values. The X-XOR always renews original values. And the renewed value cannot use the original value. In addition, the S-XOR circuit can always be transformed into the D-XOR circuit. Therefore, we always use D-XOR in this paper. Another metric is latency. More latency leads to more time to execute the encryption. We use the depth of the circuit of the matrix to compute it. The depth is defined as the maximum number of XOR operations of the path in the circuit. In this paper, we always focus on the minimum depth of the circuit. The circuit with the minimum depth may execute the encryption faster in hardware. And our experiments show the feature. We give more detailed definitions. Given the matrix, the input is x0 to xn minus 1. And the output is y0 to yn minus 1. For each output value, yi, we have y equals ai0 times x0 to ai n minus 1 times ai n minus 1, which is associated as a vector ai0 to ai n minus 1. The vector is called the node yi. Then we can compute the minimum depths of the node and the matrix using the following equations. The first equation can compute the minimum depths of the node. And the second equation can compute the minimum depths of this matrix. We ensure that the depths of our implementation of this matrix equals its minimum depth. It can lead to the best latency. We also use the 3 times 5 matrix as an example. The depth of the first row is 3. The depth of the second row is 2. And the depth of the first row is 1. Therefore, the depth of this matrix is 3. Next, we propose the backward framework formulae. The backward framework is an approach to search for a solution that starts from target nodes, chooses a node interactively and splits it into two ones until all the nodes are unit ones. The target nodes are the output values of this matrix. And the unit nodes are the input values of this matrix. And the backward framework returns a directed graph. In the graph, the indegree of each node is 0 or 2. Every unit node has the indegree 0. And every non-unit one has the indegree 2 and can represent an XOR operations. Then we solve two fundamental problems. The first is how to split nodes with respect to the minimum depth. How to split Y0 into T0 and T1? The second is how to ensure the output of the framework returns can have the minimum depth. Proposition 1 can help us to execute the splitting process and solve the first problem. For any node Y, we always find two nodes with less depth to split Y. We give an example. Suppose that we have Y equals 1, 1, 1, 1, 1, 0 and have the depth is 3. Since the minimum depth of one procedure knows it's 2, the hamming weight is 3 or 4. We can choose 3 or 4 randomly and we choose 3. The new node is A. A equals 1, 1, 1, 0. Then another procedure knows B can be computed 0, 0, 0, Y1, 0. And the depth is 1. For solving the second problem, we first define two sets. The first is the working set. The working set contains the nodes that we need to split. The predecessor sets contains the nodes that we do not split in this state. Know that the nodes in the predecessor sets can be reused. Then we can see that given the matrix, we can always find a graph from target nodes to all the unit nodes. Because the framework in right ensures that the output have the minimum depth. For a given matrix, the working set contains all the target nodes. Then we split the nodes in the working set interactively until the working set is empty. Next, we split all the predecessor nodes. We repeat the procedures until the working set only contains the unit nodes. It means that we finish the split. The example shows the procedures. The depth of the working set is decreased from 3 to 0. Each time we pull the nodes from the predecessor sets into the working set, the depth decreased by 1. When the depth of the working set is 0, we finish the framework. Then we propose the heuristics. The heuristics are necessary because we cannot achieve the exhaustive search. And in the graph, every non-unit nodes represents an XOR gaze. Some nodes in the graph can be reused. Therefore, we hope that we can reuse nodes as far as possible. We give five rules in splitting nodes to reuse the nodes in the predecessor sets. Rule one deals with nodes with less depth. If in the working set, some nodes have less depth, we pull them from the working set into the predecessor set. Rule two deals with nodes that can be split directed. It means we don't need to generate new nodes. We just use the nodes in the predecessor nodes to split nodes. Rule three requires us to generate one new node to split the node in the working set because the nodes in the predecessor set cannot split nodes. Rule four requires us to generate three new nodes to split two nodes because no nodes in the predecessor sets can be used. Rule five is a default method. We can use two new nodes to split one node. It is the worst way, and we try to avoid it. The cost is shown in right. We can compare them and give the order. Now we compare two frameworks used to optimize the linear layers. We first introduce a forward framework. It is very intuitive. The forward framework begins with the input values, x0, x1, to xm. Then they chose two values combined and generate new value, like p0, p1, to pm. We do this procedure interactively. Suppose that we generate a q0, q1, qm. Finally, every apple value yi is generated, and we obtain a circuit from input values to apple values. The key procedure of the framework is to search for the circuit beginning with the inputs and how to generate outputs is unknown. Then we introduce another framework proposed by this paper. The backward framework has another aspect. We first deal with every apple value yi. The goal of the backward framework is to split every apple value into input values. Generally, we cannot obtain input values directly. Some intermediate values are necessary. Therefore, we use qi to split yi. Then we also use pi to split qi. Finally, we use input values xi to split pi. And we obtain an implementation of this matrix from apple values to input values. Of course, because the xor operation is invertible, we can convert it into a circuit from input values to apple values with the same number of xor operations. Next, we compare these two frameworks. The forward framework uses a base and chooses two values to combine directly. When every apple can be computed, stop it. Finally, we obtain a circuit combined from inputs to outputs. The backward framework uses a set chooses a node from the set and split it into two nodes when no values can be split, stop it. Finally, we obtain a circuit split from apple values to input values. In addition, we introduce the advantage of the backward framework with respect to the minimum depth. Suppose that we have the input values. A problem is how to decide the values to combine. According to existing here is case, we only select the choice that can reduce the distance to apple values. Although many heuristics perform well without the limitation or depth, it is difficult to judge which choice performs better with respect to the minimum depth. Because if the new value exceeds the minimum depth, we will throw it. Even though it is a good choice to reduce the number of XOR operations, for the backward framework, it is easy to control the depth of every node because we can split one value using the values with less depth. And the circuit generated by the backward framework has always the minimum depth. We give an example. Consider a matrix to be implemented. The minimum depth is 3. If we use a forward framework like BP algorithm, we first generate T1, T2, and T3. Because they are the apple values, it has a higher priority in BP algorithm. The depth of T3 is 3. This means that T3 cannot be used to generate any values. Finally, the circuit needs 11 XOR operations. However, in the backward framework, we can obtain a circuit in which the depth of Y4 is 2. Then Y4 can be used again to generate Y0, Y1, and Y2. Finally, we can generate a circuit with only 9 XOR operations. We also apply our algorithm to many matrices in literature. Some of them are used in other linear layers. And 4,256 MDS matrices proposed in LS19. The results are as follows. This table shows the implications of our algorithm. Our algorithm can find better implications for many cases. And for every result, the depth can reach the minimum depth. This table shows the results of many MDS matrices. We can optimize more than half of the matrices than before. This table shows the results of AES mixed columns in hardware. We find that the power is relevant to the latency. Although some limitations have fewer XOR operations, the power and latency are larger. Our limitations have the smallest power and latency. That's all. Thanks for your attention.