 Hello, everyone. The title of our presentation is Optimizing Implementation of Linear Layoffs. This is a joint work with Zheng Xiangyong, Lin Da, Bao Zhenzhen and Zhang Sha Sha. The motivation of this paper is the current demand for security in source-construed scenarios. Basically, we have two solutions for this. We can either design new lightweight ciphers, or optimize the implementation of existing ciphers. Even though we do not know exactly what the lightweight is, a lightweight design in hard wheel usually has a low-circuit area. Seamen, spack, present, rectangle, light, midori are several wearable block ciphers. Apart from dedicated designs such as the seamen block cipher, lightweight components are more popular choices to design lightweight primitives. There have been a lot of designs for lightweight MDS or near-MDS matrices. However, if an S-box is used as the nonlinear layer, it seems that the 4-bit S-box is the most popular choice. On the other hand, optimizing those already existing ciphers has practical significance, such as AES, because these ciphers have been standardized and deployed extensively. Gladman proposed a search algorithm to find bit-sliced implementation for S-boxes. A set-basic method was proposed by Staufflin to find the optimum implementation of S-boxes for virus criteria. Jim presented an exhaustive search method for small S-boxes. As for the linear layers, there are two methods for optimization, the so-called local optimization and the global optimization. The local optimization focuses on optimizing the elements of a block matrix using different bases for a finite field or reuse some intermediate values. However, in the global optimization, we first convert the block matrix into a binary matrix and explore some harass sticks to reduce the cost. Following this line of research, in this talk, we focus on the global optimization and present a harass stick search algorithm. Before introducing our harass stick, we should make clear the matrix we can use to estimate the cost of linear layers. Since linear layers can always be represented by linear matrices and implemented by XOR operations, it is straight forward to use the XOR count to estimate its cost. Generally, we have three kinds of XOR. The first line is the so-called direct XOR. The direct XOR count of a matrix is the homing weight of this matrix minus its low number. Under this kind of XOR, a lightweight matrix is a sparse matrix. Let's take this 4 times 4 matrix as an example. Its homing weight is 10, minus its low number 4, so its direct XOR is 6. The second XOR is what we call the general XOR. It's equal to the minimum number of operations like xi equals xm plus xm. However, it has been proven to be NP-hard to find the minimal general XOR. So it's hard to always find the optimum solutions, especially for large matrices. Take the 4 times 4 matrix as an example. We can find an implementation using only 5 gXORs. Sequential XOR is another kind of XOR. It counts the minimal number of in-place operations like xi equals xi plus xj. This corresponds to the optimum purporting in Gauss Jordan elimination. Let's take the 4 times 4 matrix as an example again. This matrix can be implemented using 5 sXORs. In addition, the sequential XOR matrix has several advantages. First, it might be efficient when considering bit-slicely implementation. Due to the self-destructive feature of XOR instruction, one gXOR requires two assembly instructions. That is, copy in the first step and XOR in the second step. However, one sXOR requires only one assembly instruction. Another advantage of sXOR is, it's quite friendly for quantum computing. Because an sXOR operation can be implemented by a thin node gate. However, it would be much expensive to implement a general XOR. In this talk, we focus on the sequential XOR in the following. We know that there are three kinds of elementary operations in linear algebra. And if an elementary operation is performed on an identity matrix, we can get an elementary matrix. Besides, if we perform an elementary row operation on a matrix, it is equivalent to left multiply the corresponding elementary matrix. And if we perform an elementary column operation on a matrix, it is equivalent to right multiply the corresponding elementary matrix. The first elementary operation is interchanging two rows or columns of a matrix. If we interchange the first and the second rows of the identity matrix, we will get E1 double arrow 2. So we call this kind of a matrix a type 1 elementary matrix. The second elementary operation is multiplying a row or column with a nonzero number. Because we only consider matrices over the binary field, the only nonzero number is 1. So type 2 elementary operation doesn't change a matrix. So we will not consider this type of elementary matrices in the following. The last elementary operation is adding a row to another or adding a column to another multiplied by a nonzero number. Similarly, this is equivalent to adding a row to another. E1 plus 2 is an example of a type 3 elementary matrix. Now we consider the cost of each elementary matrix. Consider this type 1 elementary matrix as an example. Multiplying this matrix with an input vector. The resulting output is just a rearrangement of the input bits. So this has low cost in hardware implementation. Similarly, consider this type 3 elementary matrix. Multiplying this matrix with an input vector. The output vector is just adding the second bit to the first bit. And this requires one Sx row. From linearizable, we know that any invertible matrix can be transformed into an identity matrix using elementary operations. So any invertible matrix can be decomposed as a product of elementary matrices. Because we only consider matrices over the binary field, this theorem can be adapted as any invertible binary matrix can be transformed into an identity matrix by applying a series of type 1 and type 3 elementary operations. So any invertible binary matrix can be decomposed as a product of type 1 and type 3 elementary matrices. Note that the type 1 matrices have low cost, and each type 3 matrix consumes one Sx row. So the Sx row cost of a matrix is equal to the number of type 3 matrices in its decomposition. In this case, if we can reduce the number of type 3 matrices, we can reduce the cost. Before we try to reduce the cost, we have to compose a matrix at first. It is easy for us to get a matrix decomposition by Gaussian elimination. That is to say, we can perform elementary row operations to transform a matrix to an identity matrix. Also, we can perform only elementary column operations. However, experimental results show that these two ways are not so good, because the initial decomposition contains too many type 3 matrices. So we present a hybrid method. Each time we try possible elementary operations and pick the one that reduces the most number of ones. This is because we want to get the identity matrix in the end, and the identity matrix contains the least number of ones. So if we can reduce the number of ones as many as possible in each step, we can expect a shorter matrix decomposition. However, this hybrid method may trigger two problems. The first one is we may have multiple traces in a certain step. If this happens, we pick a random choice. The other problem is we may encounter a case that the number of ones in the matrix will increase, no matter which elementary operation is performed. In this case, the procedure will fall into an infinite loop. If this happens, we turn to pure elementary row operations or column operations for the following steps. After we get a matrix decomposition, we can always rearrange the order of elementary matrices. This property allows us to swap the positions of a type 1 and a type 3 matrices. But the type 3 matrix is slightly modified. So based on this property, any invertible matrix can be decomposed as a series of type 1 and type 3 matrices. And we can usually put all the type 3 matrix on the left and put all the type 1 matrices on the right. For the sake of simplicity, we will not consider all type 1 matrices because they have low cost. And we always assume that an invertible matrix can be expressed as a product of type 3 matrices. Now we have a matrix decomposition with only type 3 matrices. And the cost equals to the length of the decomposition. In the next, we present seven reduction rules to reduce the length of the decomposition. Each rule listed in this property can be used to reduce the cost by 1 x all. The proof of this property can be found in the paper. However, we present an example here to illustrate the first rule. If we multiply the matrices on the left side of the equation with a column vector, since E i plus j will add xj to xi. After this step, xi is updated to xi plus xj. And E k plus j adds xj to xk. E k plus i adds xi to xk. So we can see xi is updated to xi plus xj. And xk is updated to xk plus xi in the end. This is equivalent to adding xi to xk at first and then add xj to xi. These two steps corresponds to E i plus j times E k plus i. So rule 1 holds. Since matrix multiplication does not generally satisfy the commutative flow, we can only choose three or two consecutive matrices in the decomposition and check if they match the reduction rules. But it may not reduce the cost to a large extent. This property presents three special cases that are matrix multiplication fulfill the commutative laws. We present an example here and it shows how we can combine the three commutative cases with the reduction rules. In this example, we assume that m is decomposed as four type three matrices. If we look at the first or the last three consecutive matrices, they do not match any reduction rules. However, the first two matrices are commutative with each other. So we can swap the first two matrices. After this step, we can find that the last three consecutive matrices in this new decomposition match rule 1. So we can use rule 1 to update the decomposition and reduce the cost by 1 x all. We present a framework here. This illustrates our redundant algorithm. This algorithm takes a matrix decomposition as input and pick three matrices from the decomposition. We denote the white, the gray, and the black bars the three matrices. Then we check if these three matrices match one of the reduction rule. If this is the case, then we check if the white matrix is commutable with the matrices in the orange block. If so, we can move the white matrix backward to the bottom of the orange block. At last, we check if the black matrix is commutable with the matrices in the blue block. If so, we can move the black matrix backward to the top of the blue block. Now we have three consecutive matrices in the decomposition and they match the redundant rules. So we can easily reduce the cost by 1. Using the redundant algorithm as a subroutine, our search algorithm can be devised. We first decompose the given matrix M using the hybrid method. Then we pick a segment from the decomposition. We use an orange block to denote this segment. This segment contains several type three matrices. We can multiply these matrices to get an invertible matrix and then decompose these matrix using the hybrid method again to get an equivalent decomposition. So the green block is an equivalent decomposition of the orange block. Then the reduction algorithm is applied to this decomposition to reduce the cost. And we can repeat this procedure many times until we find good implementations. This table shows the applications of our search algorithm. Our search algorithm can find better implementations for most cases. Particularly, the cost of AES mixed columns is 19 to X XORs. Here we show another advantage of our search algorithm. Since our search algorithm is based on the matrix decomposition, we can easily check that if M is decomposed as n type three matrices, the inverse of M can also be decomposed as n type three matrices. So the cost of a matrix and its inverse are equal. As a direct application, we can implement the inverse AES mixed columns with 19 to XORs. That's all. Thank you.