 Many thanks to Rohan and Amok for your excellent presentation and now let's welcome Wei Lin from Baidu X Lab. He will give us the magic tricks for self-driving cars. Let's welcome. Thanks for the introduction. Good morning everyone. Thanks for attending our talk. My name is Wei Lin Xu. I'm a PhD student in computer science at the University of Virginia and I'm currently an intern researcher at Baidu X Lab. So today I'm not going here to defend Baidu's deep speech against Berkeley attack. I'm going to show you some interesting magic tricks for self-driving cars. And this is a joint work with my colleagues at Baidu X Lab, Dr. Zhen Yu Zhong and Dr. Yuan Han Jia. Sorry, is this working? I'll show you just your keyboard. Before my presentation, I would like to do several clarifications. First, this is just a proof of concept. Second, we are not targeting any autonomous vehicle vendors. And instead, our target is the general computer vision technique that could be used on many self-driving cars. So we are not going to make any big news or cause PR crisis for anybody. But I believe our attack is very practical and has some implications about self-driving cars. So we ask you not to reproduce our magic tricks against your neighbor's self-driving car. And we are not responsible for any good or bad consequences. So first of all, let me briefly introduce a typical autonomous vehicle framework. So a self-driving car does not necessarily look very different from the other cars we are driving every day. It may have the same cabin, the same views, and the same power chain. It has extra sensors and actuators as well as a brand which consists of three major components. The perception module, the prediction module, and the planning module. So for a self-driving car to drive, to finish the driving task, it requires some sensors such as radar, LiDAR, or camera to perceive the surroundings on the road. For example, it needs to recognize the drivable areas and it needs to recognize the other objects on the road, such as the other cars or pedestrians and something like that. And the perception module recognizes those objects, but it only knows the situation at that moment. That's why we need a prediction module. So using the prediction module, we can predict the location of those objects in the future seconds. And the planning module could make a plan to drive the vehicle smoothly to avoid the obstacles and to reach the destination. So in this work, we focus on the camera-based perception techniques. So it should work like this. Using the camera input, the object detection model should be able to recognize the size and the location of those objects, such as the cars. And the objects could be very close to the camera and it could be very far from the camera. So the detection model should be able to recognize all those important objects to make the driving smooth. And here we want to show you some metrics. So here we have the Defconn flag. I think everyone here recognizes this as the Defconn flag. And so we want to put this in our thing. So here we just put it on the floor. You can see that the target perception module just recognizes the Defconn flag as a car with very high confidence. So we can change the viewpoint of the camera and the prediction is still very confident. So I will introduce how we implement those attacks. Okay, so this is our target. It's the YOLO V3 model, which is very famous in the computer vision. And I believe some self-driving cars use similar architectures, even though they use different Chinese sets to get a model on their cars. And this model takes the input of 416 by 416 and in three channels, RGB channels. So it's a color image input. And the model V3 model is huge. It has 147 trainable layers and 62 million trainable parameters in total. So it's a very large and complicated model. And the YOLO V3 model can output 3,549 bonnet bosses. Of course, we only show the bonnet bosses with the high confidence here, such as the stop sign on the corner. And the target model we use here, it was chimed with the Coco data set, the MS Coco data set. So it has 80 classics in total, but here we only focus on those important classics, such as the poison car trap bus and bicycle motorcycles, because those are more related to self-driving cars. And in order to attack the YOLO V3 model, we need to understand how the model does the inference. So for an input image, the YOLO model first is spread into several grids. So the YOLO model actually uses three different grids. This is the first one, the 13 by 3 grid. And it has 26 by 26 and 52 by 52. Here we will just use this 13 by 13 as a running example because it's more visible on slides. So for any unit on this grid grid, the model will give some prediction about the objects at the center point within this yellow unit. And this prediction will not be wild. It should have some references. So the 10 YOLO V3 use is called anchor boss. So anchor boss is like a reference for a specific prediction of the object detection. So for each unit of the grid and then for each anchor boss, we will have a prediction about a specific object. And it has, it is a very long vector. So I will explain how should we interpret those vectors. And the first three, the first four scalars about the bonnet box, about the detected object. And first we need to determine the center point of this object. We use the location of this yellow unit. So here it's 11 and 2 on this grid. And we need the output of tx and ty to calculate the center point. For example, in this example, the center point is at a green point. And next, we even need to use the third and fourth output to calculate the object size. So it is calculated with one of the anchor bosses, the first one, the pw, the width and height of the anchor box. So here we got this, the detected object should be in this size. And so this is the location and size of the bonnet box. And at this point, we do know which object class is the prediction. So we have 80 scalars about the prediction on the class 6. So yellow v3 uses 80 different sigma outputs to do this class prediction, instead of the more commonly used soft method function. And here we can tell that the stop sign should have the highest probability if the model is a good model. But there are some other cases that a bonnet box may not contain any objects. So the yellow v3 model uses an output called object next. So it is also a sigma output. The higher object next means that the higher confidence there is an object in this bonnet box. So we do this multiplication and get the final confidence for these bonnet boxes. So we know that stop sign has the highest confidence. So we know that the yellow v3 model predicts the stop sign at this location. That's the yellow v3 model. And let me introduce our third model. I think it's different from many adversarial machine learning work because we don't assume that the perturbation we add is invisible to human eyes. So actually, in our third model, we can put any image patch on the surface of any objects in our scene. So in this case, we put it on the floor because it's more visible to our camera. And of course, we view some perturbations calculated by our algorithm and then it would be interpreted differently to human vision. I will introduce the views in this attack. First, we need to implement this input construction pipeline because we need the algorithm to calculate it to get the input that would make our objective function to the destination we want. First, we need to resize the picture to a specific size so that it would fit the location we wanted it to put on. And then we need to do this perspective transformation so that we can get a credit viewpoint from the camera. And then we remove that mask pixels of the scene picture and then we paste the image patch to that location. So that's how we construct this input. And the whole pipeline is differentiable. In our attack, we can directly calculate the image that we should have for this successful attack. And the second thing is about the objectives. So for different attacks, we should define different objective functions. And the first example is the production attack I just showed you as a demo. And there are many ways to design objective functions for a specific goal. And first, we can do it in an easy way. We just want more certain objects on the whole image. So I'm not going to show you any equations here. I think you have seen a lot these days. And I just showed you some pseudocode. So here, we just wanted the model to predict more car objects from the input. So we just get the index of the car class and then in this white box class, probably this matrix. And we use this lot of code to tell the algorithm that we just want to maximize the probability of the car class for every great unit. And remember that we have the object index output for the Euro V3 prediction. So we also add that in our loss function. And then we sum up the two loss. And that's our final loss. And so this is very easy to implement. But it could be difficult to optimize because we have so many outputs. And the result might not be very good because it doesn't look intuitively correct. It predicts two cars from the perturbations. And so we should refine those objective functions. So because we know how Euro V3 model calculated the prediction, so we can just make a reverse. Because we know the exact location, we want the model to predict objects. So we can just use a calculator to get the prediction vector for the Euro V3 model. And then we use the mean square error to make the prediction of the input to get as close as possible to the calculated results. And in this way, it could produce the result like this, which we prefer. And in other attacks, for example, object vanish, we also have many different ways to design an object function. And this is just a very close version. And so somebody wouldn't like others to recognize his car, so they might exploit the load in California to release the car every six months, and they don't have to put a license plate on it. And if they want to do it more aggressively, they can even put a specialized license plate on it so that the other self-driving cars wouldn't recognize it as a car at all. And the code is also very simple because this is a close object function, and we just get the car index, and then we take out the minus sign from the two logs variables, and then that's the loss funnel. So the result will look like this. The model couldn't recognize the object as a car with that license plate. And the other interesting magic trick would be the transformation. So we can make a certain object class to transform to other classes. For example, we can make a car look like a trend to the YOLO v3 model, and we just add different class probabilities to the loss function, and then we can get a result like this. So it's a different license plate and it's like the transformer in the middle stage because the object looked like a trend and a car to the model. It's a similar problem. So we have discussed how we can draw the input and how we design the object function, and the next is about how to get exactly those inputs. So we need some optimization techniques. And we have found that these two tricks are very effective in our attack, which is first introduced by this equivalent paper by Nicholas Kalini. And the first trick is to use the change of variable because in the pixel space, we have this interval constraint from we normalize it to zero and one. And if your input has to be minus, then it's not going to be an effective pixel in the physical wall. So you're not going to realize the attack in the physical wall. So this interval constraint is very important to realize this physical attack. We use the change of variable trick to convert the input to the 10-edge space so that it would encode this interval in our object function, and then we can use many of the shelf optimizers such as ADAM to do this optimization. And the second useful trick is to optimize the loges instead of the modal output. So in Kalini's paper, they found that if we can skip the functions at the last layer, like sigmoid or softmax, we can avoid vanishing gradients and it helps to get better results. So far we have the methods to generate the successful digital attack, but image sensing is not an identity function. It doesn't mean we could do it in the physical wall because you need to print out the image patch and you need a camera to take a picture of the scene and then you get an input for the YOLO V3 model. For printer or cameras, they have several venuses. For example, they have very limited resolutions. So even if you calculate that, the specific pixel should be in a specific value, but you may not be able to do that after you print it out and use the camera to take a picture. And the printer and camera could have some distortions and they both have the random noise. So we need to consider all these possible factors to realize the physical attack. And we found that several methods, techniques, first introduced by researchers at CMU is very useful in our attack. For example, for the limited resolution and they introduced a regularization term to smooth the patch with the total variation regularization. And for the distortions, they have developed manual color management and designed a non-printable loss to encode this color management in the optimization. And in the physical world, we might not be able to put the image patch to the exact location to match the pixels in the digital image. We use this trick in every iteration of this optimization, we make some random transformations. So the generated patch would be robust to some movements of the image patch. Okay, so here comes the conclusion. We have showed that the magicians can fool the object detection models and so can attack us. So we should be very cautious with the self-driving cars that rely on the computer vision. Thank you for listening to my talk. I can take any questions if you have. Okay, so we were not allowed here. Show the logo there. If you want, like we can sell that location for you, you can put your name there. Okay, I think that's the part of the Euro V3 model itself. So it's not about it, actually about the attack. It's just a way to interpret the results of the Euro V3 model. If you want to attack the model, we have to follow the design pipeline for that. Okay, so here what we show on the screen is, I think it should be called a semi-physical attack because we show these pictures on the screen. But we use the actual camera to take the picture and we change the viewpoint of the camera and show you the video. It's a screen record on my iPhone. Yes, so I run the Euro V3 model on my mobile phone and I use the camera on my mobile phone to take this video. It depends because this is the proof of concept. I think there are many other factors that could change the result of this attack. Okay, so good question. I think this depends on the model because different models have a different input size. I mean, if the input size is larger, probably a smaller patch would be effective because it would cause more different pixels in your input. Okay, David. Reducing the objectness output or by reducing all the specific... You can do both. Do you know which one? It's a design choice. You can try both and try to find which definition is the best for your optimizer because for this attack, if it's a non-convex function, you don't have a good optimizer that can always get the best solution. You should just try. This is a great question. We just tried YBOSS attack here, but we tried other models on my mobile phone to test these outputs and we found that they are still effective probably because the model is similar enough to reproduce the attack. Okay, so for this specific example, because we didn't add large movement steps in each iteration, it might have some limited robot snakes to the viewpoint change, but if you really like that feature, you can add a larger moving distance in each iteration and the output should be more robots to that movement. That gentleman first. Here we use perspective transform. I think it's a more powerful method to represent this transformation and often it's the subset. Okay, so this is offline attack. Actually, that sounds... We are not... What we are showing here is not online attack. It is optimized to maybe be recognized as a car in our objective function, but we didn't limit the amount of pixel values we can change in our attack. But for this specific example, we just run 10 iterations on my MacBook Pro CPU. Update. So we attack a pretrial model and the 10 iterations here means the attack iterations. So we get the gradients from the input for 10 times and update it. Update input. Yes, image. Okay, great question. We have tried to print the image patch out and put it on another image. So in that way, we didn't take the video on the screen and it's an actual paper. We print it out and we still observe the similar results. It would be recognized as a car sometimes, but the success rate is not as high as this one, because we didn't... For that one, we used the non-printed laws to take off this printing distortions. Okay, do we have more questions? If no, thanks for telling my talk. Thank you.