 Hello, my name is Zhang Zhendong. I'm a PhD student of Professor Liu Peng from Zhejiang University. Today, I'm glad to share our work about a hybrid CPFJ-based solution to the password recovery of SHA-256 CRIPT. SHA-256 CRIPT is a widely used key derivation function adopted by many Linux distributions. The key function of SHA-256 CRIPT is to prevent the clear-text password from being easily got by malicious attackers. To this end, the clear-text password is processed by the KDF and stored as a password hash. By comparing the password hashes, the system will know whether the user's password is cracked. Meanwhile, as SHA-256 CRIPT is a one-way function, attackers cannot directly get the clear-text passwords even if they got the password hash. This feature makes brute-force attack the only way to recover the password from the password hash. So how does the brute-force attack work? This is a diagram of the brute-force attack. The original password P is encrypted by the KDF generator password hash H. The attacker applies KDF to all passwords in the searching space, then he matches every generated hash with the password hash H. If HX matches H, then the original password is found to be PX. The time consumption of brute-force attack depends on two factors. The first is the KDF execution time, which won't be verified since KDFs are usually designed to be time-consuming. The second is the size of the searching space, which is usually huge in most scenarios. For example, if we want to achieve a recovery probability of 80%, about 1 billion passwords tests are needed. Therefore, attackers have to pay enormous time and energy cost, and it is the same for the user who want to retrieve their forgotten passwords. That's why we need a fast and energy efficient accelerator. However, shuttle craped has some features that makes it difficult to be cracked on special purpose hardware. First, it has random execution passes for different passwords. For example, password 1, 2 and 3 may have a different and unpredictable number of message blocks. With such features, shuttle craped is hard to be accelerated with the pipeline technique. Second, it has a complex data access pattern, which makes it hard to design a high bandwidth data path from the RAM to the compute unit. As a result, current solutions are limited to the low bandwidth scheme, typically 32 bits, like GPU or shuttle craped accelerator in John the Raper, which means at least 16 cycles are needed to generate a 512 base message block. In this paper, we analyzed the structure of shuttle craped and demonstrate the difficulties of accelerating shuttle craped with special purpose hardware, which includes the data dependency that stores the pipeline, the random execution pass and the complex data access pattern. We also found that there is a structural weakness lying in the shuttle craped, which makes it possible to remove the randomness. Based on the difficulties and the weakness we found, we proposed a fast and energy efficient shuttle craped accelerator with several techniques, such as group schedule, look ahead execution and data pass prune and multiplexing. Compared with the state of the art works, our accelerator achieves significant improvements on performance and energy efficiency. To find out what is so special in shuttle craped, we should first take a deep look into the details of shuttle craped. A shuttle craped is specified as follows. It takes a password string, a random sort string and an iteration counter n as input and outputs the password hash h. Inside the shuttle craped function, we first need to generate 4 messages and calculate their shuttle digest. For the sake of illustration, we assume that the password is 6 bytes and the sort is 8 bytes. The message A is generated by appending the password, sort and password. The number marked on the top of each segment represents the size of that segment. Then we calculate the shuttle digest of the message A and get a digest A. Message B is generated by appending a password, sort and the first 6 bytes of digest A. Then we convert the length of the password into a spanner format. From the least significant bit, if it is 0, we append a password string. If it is 1, we append a digest A string. Then we calculate the digest and get digest B. Message P is generated by appending the password string 6 times. Then we calculate the digest P and split the first 6 bytes of digest P to get a temp P. Message S is generated by appending the password string 16 plus db0 times, where the db0 is the value of the first byte in digest B. Then we calculate the digest S and split the first 8 bytes of digest S to get a temp S. The next step is to iteratively generate the message C and calculate the digest C. First, the digest C is initialized with digest B, then the digest C, temp P, temp S, and iteration counter i are processed by the cryptpad function to generate a message C. Next, we calculate the digest of the message C and update the digest C. After each round, the iteration counter i is increased by 1. If i equals n, the iteration is finished. The last step is the generated password hash. This is an example of shell2crept hash passwords. There are four segments. Adjacent segments are separated with a $ character. The first segment is the id for shell2crept is 5. The second segment is the total number of iterations. The default value is 5,000. The third part is the sort string and the last part is the base 64 encoded digest C. This step can be easily reversed. It will not be implemented in the accelerator. To accelerate the brute force attack on shell2crept, it's important that the accelerator has high parallelism because brute force attack is easy to be parallelized. Typically, there are two kinds of design to improve the parallelism. One is to build the accelerator with many relatively simple processing calls. For example, the GPU and the IPG-based shell2crept accelerator in JTR. In such design, each call has its own control logic which makes it easy to schedule, but it will cost more hardware resources. Another kind of design is to build the accelerator with deep pipeline. By using a deep pipeline, the schedule of different passwords is unified. As a result, it needs less control logic than the multi-course design and thus more efficient. This kind of design has been adopted by many researchers and achieves good performance and energy efficiency. For example, the WP2 accelerator presented in CES 2016 and the AR5 accelerator presented in TC 2019. However, the pipeline has its limitations. Almost all pipeline-based accelerators are targeting PBKDF-based algorithm which has regular data access pattern and unified execution paths for different passwords. These features make it easy to schedule the pipeline. However, for some KDE apps like shell2crept, the complex data access pattern and the random execution paths make the schedule of the pipeline very difficult. To further understand why, let's build shell2crept in another perspective. As we can see, the operations in shell2crept can be classified into two categories. First, we generate a message by some rules. Second, we calculate the shell2 digest of the message. We also want to review the structure of the shell2 function, which is shown here. The input message is first padded with one followed by several zeros and a 64-bed message length. Then the padded message is divided into several 64-bed blocks. Each block is processed by the block-transment function and generate a temporary digest which is used as the state of the next block-transment function. If we combine the details of shell2 function and the details of shell2crept function, the operations in shell2crept can be further classified into the following categories. First, we generate a 64-bed block. Second, we calculate the digest of the block with the block-transment function. By this way, the execution process of shell2crept is abstracted as the generation and transformation of a series of message blocks as we shown here. So now we can figure out why it is difficult to accelerate the shell2crept with pipeline. The first is the data dependency between the adjacent message blocks. The generation of a block depends on the digest of its previous block. If we have a 64-stage pipeline, the pipeline has to be stored for 64 cycles before the computing of the digest finish, which extremely slows down the accelerator. Another problem is the random execution path. As we mentioned before, the number of blocks is decided by the length of the message. However, there are too many factors that affect the length of the message. For example, the length of the password, the length of the sort, even the random number db0. If the passwords have different number of blocks to be processed, the schedule of the pipeline would be very difficult. The last one is the complex data access pattern. The generation of the block is similar to select 64-beds from the data buffer. However, there are no simple rules for hardware to know which bed in the input sources should be placed in which bed in the message block. If we want to generate the block in one cycle, a total of 64, 281 multiplexers are needed, which will introduce great overhead to the hardware resources and increase the critical path latency. Now we come to the part of how to solve these problems. By classifying the operations in Shadow Crypt as generating and transforming blocks, we found that this process is like a factory where there are two walkers and a warehouse. The first worker is responsible to select 64 raw materials from the warehouse and the second worker transforms the raw materials into products. Then the products are put back to the warehouse as the new raw materials. This factory inspired the design of our accelerator. The core of our accelerator is mainly composed of three parts, the data buffer, the data dispatch unit, and the block transform unit. The data buffer is like the warehouse. It stores all variables in Shadow Crypt. The data dispatch unit is like the worker one. It generates a 64-bed block each cycle. The pipeline block transform unit is like the worker two. It transforms the block and outputs the digest. Then the digest is stored in the data buffer as the new input sources. With this basic architecture, we will apply several techniques to solve the difficulties mentioned above. To solve the data dependency problem, we use the group schedule technique. Noticing that there are no data dependency between the blocks in different passwords, we can process a group of passwords together. Assuming that we have a three-stage pipeline, to avoid the pipeline store, we can group three passwords and feed the pipeline in the following order. In our implementation, we choose a 64-stage pipeline. Before we apply the group schedule, the pipeline is stored due to the data dependency, so it will only process one block in 64 cycles, which is only 2% utilization rate of the pipeline. And if we group 2048 passwords together, the pipeline could process 2048 blocks in 2112 cycles. The utilization rate of the pipeline would be close to 97%. Look-ahead execution technique is proposed to solve the random execution path. As we mentioned above, Shuttle Crypt has a random execution path because for different passwords, the number of blocks to be processed is different and random. The inconsistent of the number of blocks comes from two aspects. For message A, B, C, and P, the inconsistent comes from the variation of the length of the password. So it can be easily removed if we sort the passwords and group the password by length. For message S, the inconsistent comes from the randomness of DB0. To remove this kind of inconsistent, we propose the look-ahead execution, which is based on the following observations. First, there's only one sort when cracking one password hash. Second, there are only 256 possible values of DB0, which means we can calculate all 256 possible values of digest as in advance. The hardware implementation of look-ahead execution is very simple. We first calculate all possible values of digest as on the CPU and store them in the LE buffer. Once the calculation of digest B is finished, the DB0 is used as the address to access the LE buffer. Then the corresponding value of digest as is read out and stored in the digest as buffer. By using the look-ahead execution technique, the calculation of digest as is skipped on the hardware, thus the randomness of the execution path is removed. To solve the problem caused by the complex data access pattern, we provide an efficient design of the data dispatch unit. Let's consider an intuitive design where each byte in the message block is connected to all bytes in input sources by a 218 to 1 multiplexer. And each multiplexer is controlled by a control signal from the finite state machine. At each cycle, the state machine gives a group of control signals, then 64 bytes are selected from the input sources and compose a block. Considering we need 8 bits for each control signal and there are 815 states if we want to support the passwords from 6 bytes to 16 bytes, a total of 52 kilobytes memory are needed for the finite state machine. We also noticed that there are about 14,000 connections, which cost about 50% LUTs on our experimental platform. Such a big overhead is obviously unacceptable. As a result, we propose the data pass pruning and spatial temporal multiplexing technique. The data pass pruning technique is based on the following observations. First, some variables could use the same buffer, for example the digest B and digest C, because they will not be accessed at the same time. So we reuse the buffer for multiple variables and reduce the bytes in the input sources. Second, a byte in the block doesn't have to connect with all bytes in the input sources, which means the scale of the multiplexer could be reduced, so we customize the size of each multiplexer so that only the possible candidates are connected to the multiplexer. The data pass hot map showed the number of connections for each multiplexer before and after applying the data pass pruning technique. The total number of connections is reduced from 14,000 to 3,000. We also found that when we fix the password length to 6 bytes, the total number of connections in the data pass will be reduced to about 400. This fact drives us to explore the temporal locality of the data pass and propose the spatial temporal multiplexing technique. This technique is based on the configurability of the IPJ. For passwords from 6 to 16 bytes, we prune the data pass and generate a bit stream file for each password length. Every time the length of password changes, the IPJ is reconfigured with corresponding bit stream. With this technique, the whole data pass is separated and distributed into different bit streams, and the overhead of data pass is reduced on each bit stream. Here we show the complete design of the core in our accelerator, which combines the technique we proposed above. Our experimental platform is based on the Zinc ZC703 SoC. On each node of the system, there is an ARM Cortex A9 CPU and a 7-series FPGA and an SD card. The CPU is responsible for top-level schedule and the calculation of the digest S. Based on the available hardware resources, we placed two shuttle-clipped accelerating cores on the IPJ, and each of them works at 220 MHz. The SD card is used to store the bit stream files. First, we compare our work with the IPJ-based non-pipeline implementation, the shuttle-clipped accelerator in General Rapper. For fair compression, we reproduced the shuttle-clipped accelerator in JTR on our platform. Our accelerator achieves 1.74 times block throughput and 1.64 times energy efficiency. The RUTs used as computing logics in JTR accounts for about 66%, while our work has a proportion of 88%. Which means more hardware resources in our design are used for computing other than block generation or control logic. The resource efficiency of our accelerator is 1.69 times better than JTR. We also compared our work with HashCats running on an NVIDIA GTX 1080Ti GPU. The result has shown that our accelerator achieves 2.54 times improvement on energy efficiency than GTX 1080Ti. Although the block throughput of a single node is only 0.15 times of GTX 1080Ti, when we test our design on the 16 node cluster, we achieve a 2.41 times block throughput and still have 2.54 times energy efficiency. To validate the adaptability of the tasks proposed in this paper, we also implement SHA-5 trial grid on the same platform and compare it with the GTX 1080Ti. Our work also shows a significant improvement on performance and energy efficiency. Finally, let's draw the conclusion. In this paper, we proposed a hybrid CPU FPGA-based SHA-2CRAP accelerator. It adapts the pipeline to improve the parallelism. Group schedule is used to remove the data dependency that starts the pipeline. Lookahead execution is used to eliminate the non-unified execution paths. Datapath pruning and spatial temporal multiplexing are applied to reduce the result overhead. And we also found a structure weakness lying in the SHA-2CRAP algorithm where the calculation of digest as can be finished in advance, then it can be reused for all passwords. Attackers may leverage this weakness to build more efficient cracking hardware. That's all. Thanks for listening.