 Hello everyone, I'm Tian Yunpeng from Tsinghua University and here I will first introduce our teams and here is the list of our team members. Okay, go on. And we get, I will first introduce some work of our team, we get three first places in all the three tracks in last year's NEAPS competition and NEAPS is the top conference of machine learning. We have published three papers about our virtual examples on top conference of machine learning and computer creation. Besides, we also have a paper and a review of NEAPS this year and all the three initial review scores are accept. Here is a list of our publication and today I will introduce these two works. And the first work is posting a virtual attacks with momentum. This is a attacking method all we've used in last year on NEAPS competition. Our virtual examples are maliciously generated examples to full machine learning model but they are very similar to the original examples. I will briefly introduce some private attack methods and later introduce their advantages and disadvantages. Finding a non-target virtual examples can be solved as an optimization problem. We want to maximize the loss function of our virtual example with this true label subject to the required maximum distance between them. And the most famous one posted by a good fellow is the fast gradient sum method which approaches to FGSM and it calculates the gradient of the loss function with respect to the input and apply the sign of the gradient to the input. The iterative variant of FGSM applies the sign of the gradient multiple times. Optimization based method directly optimize the distance between the virtual examples and the original examples minus the loss. And our virtual examples have demonstrated that we have demonstrated good transferability across models. And that is our virtual examples generated for one model can also full another model. The transferability of our virtual examples enable black box attacks which may cause real security issues in real world applications. Also our virtual examples have cross data transferability which is a universal universal perturbation. However there are some limitations of practical black box attacks. We found that existing attack methods cannot attack a black box efficiently. For FGSM it is fast and generous. A lot of examples with good transferability by the linearity assumption may not hold for large distortion and it lacks efficiency to attack a white box model. That's the success rate of black box attack is low. And iterative method has low transferability because it gradually moves virtual examples to the gradient direction to maximize the loss function which may overfit to the model's decision boundary. So the tradeoff between transferability and attack ability makes black box attacks less effective. Here we have an experiment. We will attack inception with three with different numbers of iteration. We measure the success rate of white box inception with three, the black box inception with four, with two, and it's now 152. We know that the success rate decreases when increasing the number of iterations for black box attacks. Another way to attack a black box model is to build a substitute network to characterize the behaviors of the black box model. But this method requires full prediction confidence and tremendous recoveries. It is hard to deploy for models trained on large scale data sites. And we can now use this method in the competition because we are not allowed to call it the defense of missions. So our solution is to elevate the tradeoff between the transferability and attack ability. Recall that finding a virtual example can be regarded as a constraint optimization problem. So we can apply some useful techniques in optimization to the virtual setting. Momentum method is well adopted to accelerate gradient descent and help to escape from poor local optimal. And SGD also helps to stabilize update direction. So we add the momentum term in the iterative method. And we propose a momentum iterative fast gradient descent. This algorithm is very simple. We also calculate the gradient of the loose function with respect to the input in each iteration. But we calculate our vector gt in the gradient direction across iterations. In each iteration, we apply the sign of the vector to the virtual examples. Mill is the decay factor to control how much we trust the previous gradient. And the carrying gradient is normalized to by the L1 norm because we notice that the scale of the gradient in different iterations vary in magnitude. Here are some results. We are attacking in separate three, that's a lot of models. And the success rate and MI FGSM, which is our method, can take a white box model with nearly 100% success rate. Just like FGSM, but a full black box model with much higher success rate. We also study the effects of the number of iterations on black box attacks. We also attack in separate three and measure the success rate on several black box models. We can see that the success rates for attacking a black box model does not decrease, increasing the number of iterations. So our method can, in some sense, alleviate the trade-off between attackability and transferability. And another thing that is crucial for obtaining good results is to attack an ensemble of models. The main assumption is that if a virtual example remains a virtual for multiple models, it is more likely to be misclassified by other black box models. We propose to attack an ensemble of models whose lodges are filled together. We also compare the results of other ensemble methods such as attacking an ensemble of models whose predictions or looks are filled. We show the results of attacking ensemble models. Ensemble in lodges consistently outperforms the ensemble in predictions and ensemble in loads. And by applying the momentum iterative fast gradients of the attempt to ensemble models, we get very high success rate for black box attacks. For example, for attacking inception rate four, inception rate nine, rate two, and in rate nine, one hundred and fifty-two, the generator of the virtual example can fill inception rate three with very high success rate. Now, I just finished the first work, and this work is, and this work is a drawing work with Chao Du, him, that's him, and a professional professor, Zhu Junzhu. And in our work, we propose a new network architecture which can return robust predictions in the virtual setting. We name the new network as Max Mahalanobis, linear discriminant analysis networks, and which abbreviate to MMLDA networks. And this work just published at MML, yeah. First is our motivation. Our motivation one is that almost all popular networks suffer from virtual attacks, where human imperceptible perturbation can mislead these high-accuracy networks. And our second motivation, not that our typical faithful world deep nine consists of a nonlinear transformation part that maps the input to the hidden feature and a linear classifier part acting on the hidden feature together output. However, most of the work focus on designing power from nonlinear transformation part like VGG and rate nine and so on. And by contrast, the linear classifier part is under explored, which is already by default your sub-max regression. Therefore, our goal is to design a new network architecture for better performance in the virtual setting. To achieve this, we decided to substitute a new linear classifier part for some mass regression. And by our method, it comes from two inspirations. And the first inspiration comes from Afron Ato, they show that if the input is tributed as a mixture of Gaussian, the linear discriminant analysis which abbreviates to LDA is more efficient than logistic regression. Here, more efficient means that LDA needs less training data that LR to obtain certain error rates. However, in practice, data points hardly distribute as a mixture of Gaussian in the input space. To stop this, we get inspiration to from the fact that neural networks are powerful. Power's work on deep generative models has demonstrated that a deep net can learn to transform a simple distribution, for example, a mixture of Gaussian or a decimal Gaussian, to a complex data distribution. So we think the reverse direction should also be feasible. And this is actually what our method does in its nonlinear transformation part. Based on a Bob analysis, in our method, we model the feature distribution as a mixture of Gaussian and then apply LDA on the feature to make predictions. Now, a naturally raised question is how to trace the Gaussian parameters. One at all also models a feature distribution as a mixture of Gaussian. However, they trace the Gaussian parameters as actual trainable variables. By contrast, we trace them as hyperparameters calculated by our algorithm which can provide theoretical guarantee on the robustness. And induce a mixture of Gaussian model is named the mass Mahalanobis distribution abbreviated to MMD. And intuitively, MMD makes the minimal Mahalanobis distance between two Gaussian components maximum. For example, samples in different classes are separated the most when the distribution is MMD. When there are three classes, the Gaussian means all MMD are the three were taxes of the equilateral triangle. And how does MMD relate to robustness? Here we first give the definition of robustness. The robustness on a point with a label I is defined as a local minimum distance of h to a root, for example, with different label. And we further define the robustness of the classifier as minimum expected local distance. Then we get the relation between the expected local distance and the Gaussian parameters. Here delta ij is the Mahalanobis distance between two Gaussian components of a label i and j in the mixture of Gaussian. And further, we find that the robustness can be approximately represented in this simple form. And recall the property of MMD, we can conclude that if the feature input distributes as MMD, then the approximate robustness is maximized. Finally, it's our experiment's results. We first test our method on normal examples. Our method achieves comparable performance with typical networks. Now that here we denote specially phantom anneatronic hyperparameter for our method. Here we show the Gaussian e-visualization on the test side of C14. We can find that our method results in, it's more overly distributed in the feature space. Here we show the accuracy and or iterative attacks. We test and or three different values of perturbation. And the result shows that our method can largely improve accuracy compared to a virtual training baselines, even when much less computational cost. We also test the optimization based CNW attack. The results show that the CNW attack has to add much larger perturbation to successfully attack our method. And here the first rule is a normal example. And the second rule is all virtual noises are crafted on the traditional SR networks. SR networks means sub-mass regression networks. And the third rule is the noise. Noise is crafted on our method. We can find that our method can learn more about features such that the optimal attacking strategy that the CNW method finds is to weaken the pixels of the normal examples as a whole, rather than adding meaningless noises as for SR networks. And besides, our method can have better performance on class batch data sites, which may be helpful in other areas like fairness. And finally, in conclusion, our method doesn't introduce actual computational cost. It can largely improve robustness with no loss of accuracy on normal examples. And our method is quite easy to implement with only a few lines of code. And finally, it is compatible with nearly all popular networks. Thank you. So, any questions? Yeah, you know Gaussian distribution, right? It has a power term. And it has a covariance matrix, and it has a mean. So, the Mahalana-based distance, actually you can wiki it. It has a wiki pdf. And it's, that's some kind of distance. That's like all that distance or something like that. It's to measure some similarity of two Gaussian distributions. So, any other questions? No, no, no. Actually, we want to use this, but maybe this is still a secret because we need to attend another computer, but we can see something about it. It's not very scalable. I mean, if you want to scale it to a very large data set like imagine it, you need to, you need a lot of tricks on funtioning training parameters, hyper parameters, like learning rate or wiki case, but we are not very good at it. Yeah. So, we only test them on some small test data set or middle data set like Cephalton or Amnist because we do not need to funtion any hyper parameters for our method. It can, it can get a state of the art performance. Yes. So, any other questions? Okay, thank you.