 Hello, everyone. Welcome to this presentation. My name is Chien Chien Tan. This presentation is prepared as a pre-recorded video for Eurocrate 2021. The title of this talk is a deeper look at machine learning-based cryptanalysis. This is a joint work with Adrian Benamira, David Geron, Thomas Perian, and myself. I will be presenting for the first half and Adrian will be presenting for the next half. This talk is divided into 3 main parts. The first part will be the preliminaries where we will talk about the background work required to understand this presentation. In particular, a previous work in crypto 19 by Aaron Gore. The next part will explore the intuition of Gore's neuro-distinguishers from a cryptanalysis perspective. Lastly, we will focus on how we can mix a machine learning pipeline and traditional cryptanalysis to replace the neuro-distinguishers. This is the wrong function of the speck cipher. Speck is an ARX cipher and has a physter-like structure. The S-functions over here represents the bitwise rotations. L4 and betas are parameters that changes depending on the block size. The bulk plus over here represents the modulo addition by 212N. And since for the sake of this presentation, we are only focusing on the smallest cipher in the speck family. We will fix the block size to be 32, and therefore the alpha and beta to be 7 and 2. Note that for speck cipher, the DDT is usually too large to be constructed. One autumn div will be to approximate a DDT experimentally by keeping the higher probable differences only. In crypto 2019, Aaron Gore publish a paper called Improving Attacks on Wrong Reduce Speck 32 using deep learning. We will summarise the main points that are relevant to this presentation. Firstly, Gore created a new level architecture and trained them to be neuro-distinguishers for video rounds of speck. Essentially, the aim of the deep-level distinguisher is to distinguish ciphertex that come from plain text pairs with a fixed input difference as opposed to those that come from a random input difference. Gore did a comparison with his pure differential distinguisher which are actually large DDT that spans for n rounds. The neuro-distinguishers did a better job for all 5s to 8 rounds. One notable result that Gore has is that he improved significantly the time complexity of the 11 round key recovery attack compared to the previous best formed by Denier in sec 2014. This is the structure of Gore's neuro-distinguisher. The input to the distinguisher is a pair of ciphertex encoded in binary. It passes through multiple convoluted blocks before moving on to the prediction head and eventually outputting a single value in between 0 and 1. This is also the score that the neuro-distinguisher is giving the ciphertex pair. If the score is more than or equals to 0.5, it will be considered as rear or comes from the fixed input difference. Otherwise, it's considered as random. These are the results of Gore's pure differential distinguisher PDD and neuro-distinguishers ND. You can see that for each round the NDs are actually performing better than the PDDs. Now, we move on to the next part. In this part we aim to explore 2 questions. The first would be what type of cryptanalysis is Gore's neuro-distinguishers learning? And the second can we actually replicate the results without using the neural network but using techniques that cryptanalysis are more familiar with? To answer the first question we have to conduct multiple experiments in an attempt to reverse engineer what the neuro-distinguishers are looking out for. For the ciphertex pairs Gore used and neuro-distinguishers they come from plaintex pairs with this particular input difference having just a single active bit on the left part. While this is the best input difference for differential characteristics for 3 and 4 rounds of the spec round function it is not for the case of 5 rounds. For 5 rounds the best differential characteristic starts from this particular input difference. As a first tribe we retrain Gore's neuro-distinguisher using this particular input difference. The accuracy was around 76% compared to 93% in the 5 round case. This is actually quite counter-intuitive as we will expect the best differential characteristic will yield the same if not better results. The next experiment we did was to ensure a fair play between the pure differential distinguishes and neuro-distinguishers. Since a DDT only assess the difference and nothing else then the neuro-distinguisher should only have access to the difference only as well. Therefore we retrain the neuro-distinguisher with just the difference instead of the entire self-tax pair. The accuracy in this case fell by a few percentages. This means that while the bulk of the cases may be explained by the output difference distribution some of them have to be explained in some other manner. With that we decided to do some sort of reverse engineering based on the scores that the neuro-distinguisher is giving. In this case we want to know if a strong difference would mean everything to the neuro-distinguisher. In this case we will send multiple self-tax pairs into the neuro-distinguisher and then repartition them based on the scores. In particular we will focus on the ones as well as the ones that are very low score into the buckets G and B. Next we copied the difference for each self-tax and then we record down the top 1000 common differences. For each of these differences we will again create a set of 1000 random pairs with this particular difference and we put into the set Di. Then for every one of them we send them through the neuro-distinguisher once again. Essentially at the very last phase we are sending a total of 1 million random pairs that come from 1000 unique differences. Here are the results for the experiment. For each given difference that is if you look into each Di about 3 quarter of the self-tax pairs have a score that is more than 0.5. However we also do note that there are some exceptions that only have about 38% having a score that is more than 0.5. Intuitively we will be expecting a decreasing percentage trend as we go from the most common difference to the least common difference. However this is not the case that we observe. Therefore we cannot really say that if a difference is more probable then the neuro-distinguisher is more likely to recognize it. Next we repeat the experiment with some changes. The changes are also highlighted in red over here. Basically before we complete the difference we first decrypted by 2 rounds using the actual key. Then we rank and split them into sets based on their difference once again. Lastly we encrypted for 2 more rounds with another random key before we actually send them for evaluation. We also did another one except that we only decrypt by 1 round over here and therefore we encrypt 1 round over here as well. In both experiments that we decrypt by 1 or 2 rounds we notice that almost all of the paths have a score that is more than 0.5. And if we compare the true positive rates in both experiments the experiment that we decrypt by 2 rounds matches most similar to the lead in wishes. Therefore, we decide to venture further into it. Now, we would like to find if there is any biases at this particular line over here which is the difference after we decrypt by 2 rounds. After evaluating the biases of those cifotex paths in G and B respectively this is the plot that we have. We can see that there are really bias for the cifotex paths in G when we compare to the ones in B. We locate the positions that the G is the most bias and we form this particular truncated differential and we call it TD3. This TD3 can also be explained when we trace the biases starting from round 0 difference all the way until the end of round 3 as shown here. The B positions that the TD3 fixes have a low probability of having a carry propagated to them. With that, we make our first assumption that the neural network has the ability to determine the difference of certain Bs at round 3 and round 4 with very high accuracy and we make our conjecture that the 5 rounds neural distinguisher is actually tested. To verify our assumption 1 we retrain the neural distinguisher with cifotex paths that actually satisfy TD3 and we obtain an accuracy of 96.57% and in terms of the true positive rates it's almost 100%. Based on what we have learned about what the neural distinguisher is detecting, we present our distinguisher that similar properties with the neural distinguisher will present the average key-ring distinguisher. In this distinguisher we use a DDT however, we require the DDT to be at n-1 round for an n-round distinguisher and also we only need some of the positions based on this particular mass itself. Therefore, it's actually really a mass DDT instead. Also, note that this DDT is really generated using a data set of 10 above 7. The idea of the distinguisher is first we decrypt the last round using all possible subkeys. Note that this all possible subkeys is actually less than above 16 because we are only interested in some bit positions and the bit positions will be given by the mass over here. In this particular case that we have we are only requiring 2x12 different subkeys. After we have decrypt using the last round subkey we look up the difference at this red line over here. Then, we find out the probability of this red line using the mass DDT that we have previously prepared. For each ciphertext pair we compute the average probability of the differences over here. If the average probability is actually higher than that of a uniform distribution the ciphertext pair come from a real distribution. Note that the real distribution indicates that a ciphertext pair comes from a plaintext pair whose input difference is the fixed input difference that we require. Otherwise, we say that the ciphertext pair comes from a random distribution which means that ciphertext pair comes from plaintext whose input difference is basically just random. There are actually several considerations we take into account when we're crossing our average key when distinguisher. Firstly, we want to use the same amount of data set as Gauss-New World Network. So, the Gauss-New World Distincture took 10 to 4 of 7 self-attacks pair in order to train it. And that is why in our preparation for our mass DT, we also use 10 to 4 of 7. Next, we will like to match the new distinguisher's time complexity. In the case for the new distinguisher, the deep new network essentially takes up to the power of 17 multiplications. For hours, we will have to take up to the power of 13 one round decryption for the spec round function with a total of to the power of 13 table lookups to the mass DT. These are the results of the average key when distinguisher with the new distinguisher from 5 to 7 rounds. In all the cases, we actually have the results that are better than new distinguisher's. Another interesting thing would be this degree of closeness to see how well both distinguisher's actually agree with each other. And if you look at the diagonals, they made up almost 98% of the results. Now, we can go back to answering our main questions. Can we actually replicate the results? Yes, the degree of closeness between the two distinguisher's is extremely close which convince us that they are actually testing for very similar properties. As for what type of cryptanalysis is the new distinguisher is learning, we are expecting something along the lines of differential linear. However, unlike traditional cryptanalysis which relies on independencies among characteristics, the new distinguisher is able to take all of them and all of the correlations into considerations. For the next part, Adrian will be taking over. Welcome in the second part of the presentation in which we will focus on exploring the neural distinguisher from a machine learning perspective. Now, in this part, we aim to explore two questions. The first one is, can God's neural distinguisher be replaced by a strategy inspired by post differential cryptanalysis and machine learning? The second one is, can this new strategy be applied to more rounds or to another cypher? To answer this question, we need first to analyze the neural distinguisher architecture. It is composed of three blocks. The first one, here. Text as input, the two cypher text C0 and C1. As the first block is a one-layer CNN, convolutional neural networks with kernel size 1, we suppose that it just do an input transformation that we need to characterize. The second block is composed of 10 sub-blocks. Each of the sub-blocks is composed of a two-layer CNN with kernel size 3. This part is the hardest to explain. At the end, we have a vector F, which is the fitter's vector, and each element of this fitter's vector is a highly nonlinear function of the new input. Finally, the last block takes as input the fitter's F and outputs a score of the neural distinguisher. It is composed of two dense layers, also called MLP4 multi-layer perceptron. Our objective here is to replace each of these individual blocks by a more interpretable one coming either from machine learning or from the crypto-analysis point of view. So we are going to start by block 1 and block 3. So the block 3 can be replaced by any other ensemble classifier. For example, the MLP block can be replaced by random forest or gradient boosting. The first block actually can be replaced by a linear combination of the input. We choose to fix our choice on Delta L, Delta V, Visio, and V1, and the definition is given above. You can formally prove this transformation by establishing the true table of the first layer and therefore exhibit the linear input transformation. What we have done so far so we managed to replace the first block by the linear input transformation and the third block can be any ensemble machine learning classifier. That's why we call this pipeline a machine learning pipeline. Now our objective is to approximate the highly nonlinear vector F with a crypto-analysis point of view. We are going to take the time to explain how we managed to replace the block 2. The first interesting experiment that we have already seen is that if the input of the neural distinguisher is C0, C1, the difference between C0 and C1, lack for the PDT, the model, the neural distinguisher performs like the PDT. Therefore, our first assumption is that the block 2 is able to do an approximation of the DDT but with different input instead of having the difference between C0 and C1, it takes delta L delta V B0 and B1. But there is a limit. Of course, because the new entry is 64 bits and there are 10 to the power 7 tempur, it's not tractable. Moreover, the neural distinguisher actually is pretty small, only 100 Km. Therefore, we think that the neural distinguisher is able to compress the oddity. Sekarang, kita percayakan mask M yang mempunyai HW. Sekarang, setelah mengambil full input delta L delta V B0 dan B1, kita menghubungi mask pada input dan kemudian kita menghubungi oddity pada input mask ini. Dan kita dapat kemungkinan untuk mempunyai real tahu input ini. Limit kedua adalah jika kita mempunyai hanya satu mask, tentu saja kita mempunyai hanya satu kemungkinan. Tetapi kemungkinan F yang kita mahu menghubungi adalah lebih daripada hanya satu parameter. Oleh itu, kita perlu mempunyai mask M. Dan kemudian kita mempunyai F tilt dan F tilt adalah sebenarnya sebuah kemungkinan yang baik untuk F. Bagaimana kita dapat mask itu? Sebenarnya, kita menjadikannya daripada sebuah kemungkinan. Dengan kemungkinan kemungkinan dan kemungkinan kemungkinan seperti bagaimanapun kemungkinan kemungkinan. Oleh itu, kita dapat mempunyai kontrak final. Jadi, bagaimana kita dapat menghubungi kemungkinan 2, yang adalah kemungkinan 2.1 dan 2.10. Kita perlu menghubungi sebuah kemungkinan 4 dan kita perlu mencuba masak yang sangat berlainan untuk menghubungi masak ini pada kemungkinan. Kemudian, kita dapat mempunyai M-O-DT. Kita berhasil mencuba sebuah kemungkinan dan ia berjalan. Jadi, akhirnya, apa yang kita ada adalah kita sudah melihat bahawa kemungkinan 1 kemungkinan kemungkinan kemungkinan pada kemungkinan. Bagaimanapun 3 adalah kemungkinan sebenarnya kemungkinan kemungkinan non-linear dan kita dapat menghubungi 10 sebuah kemungkinan sebenarnya 20 sebuah kemungkinan sebuah kemungkinan dengan M-O-DT. Kita mencuba untuk menghubungi sebuah kemungkinan ini dengan sebuah kemungkinan yang lebih berlainan. Jika kita menghubungi kemungkinan 1 dengan kemungkinan non-linear kita akan menghubungi 0.3% dan kita sangat dekat dengan kemungkinan non-linear dan jika kita menghubungi kemungkinan 1 dan kemungkinan 2 dengan M-O-DT sebenarnya kita dapat 92.3% dan sebenarnya 92.3% adalah kemungkinan sebuah kemungkinan overall. Jadi, M-O-DT menghubungi sebuah kemungkinan dengan sebuah kemungkinan. Di sini anda dapat melihat perjumpaan dengan kemungkinan non-linear dan 5 sebuah kemungkinan dan anda dapat melihat kita sangat dekat dengan kemungkinan non-linear dan lebih tinggi daripada kemungkinan non-linear. Table 3 kita sekarang mencubungi kemungkinan antara kemungkinan non-linear ialah kemungkinan yang jauh jika anda mempunyai bagaimana sebuah kemungkinan kita mempunyai kemungkinan yang sama jadi kita mempunyai kemungkinan yang sama dengan 75.5% yang sebenarnya sangat tinggi kemungkinan lain anda dapat melihat dalam pemerintah yang kita menghubungi kemungkinan yang berlainan dan juga kita menghubungi kemungkinan yang berlainan Pemilihan Sekarang kita dapat menjawab pemenang pertama Bisa menggantikan sebuah kemungkinan non-linear mencubungi dengan strategi dengan pemerintah yang berlainan dan memenangkan secara sempurna kita menggunakan kemungkinan yang berlainan dengan kemungkinan yang berlainan di sebuah kemungkinan non-linear dan Bisa menggunakan strategi ini menggunakan kemungkinan yang berlainan atau kemungkinan yang lain kita telah menunjukkan bahawa Aropa menggunakan kemungkinan yang berlainan dengan kemungkinan yang berlainan Terima kasih banyak untuk kemungkinan anda