 And let me welcome our guest today, who is Adrian Gaiden. He studied mathematics and computer science in Grenoble and then who gained his PhD from INRIA and Microsoft Research in Paris in 2012. Today, he is the head of machine learning research at the Toyota Research Institute, an institution dedicated to exploring the potential of artificial intelligence for autonomous driving and robotics. His work revolves around applying machine learning to robot autonomy, which, among other things, involves building a robust understanding of the robot's environment. Today, Adrian will talk about exactly that problem, extracting the maximum amount of information from sensor data and then using these data to understand the world around the robot, which at the end of the day is not that different to what we try to do in science. And with that, Adrian, please, the stage is yours. Awesome. Thank you, Philip. Thank you, Peter, for the invitation. It's my pleasure to be there. I have a soft spot for Oxford because I remember during my PhD, I was invited by Andrew Zisserman to spend a couple of weeks there and it was just an amazing experience. Really helped me a lot during my PhD. So it's a pleasure to give back a little bit and share the progress we've made at TRI. We do a lot of stuff across the board, as Philip mentioned, on robotics, autonomous driving. But I thought I would talk mostly about the computer vision work we've done. At the end, if you have questions about the rest or about this, just feel free to just ask. So the main, I would say the main research thrust we have in the team beyond other areas like prediction, planning, et cetera, has been computer vision and in particular, self-supervised 3D vision. And so we've done a lot of research covering this big problem that Philip mentioned about understanding the world and making sense of it for a robot. So that covers seen understanding but that covers also behavior, modeling and prediction, learning for planning and control. And one of the key uniqueness of our approach at TRI is we're not going for the very cost-intensive manual supervision because we don't think that's just gonna work at the scale. We're looking at self-supervised learning and other ways to scalably supervise robots and self-driving cars, including simulation and auto-labeling. And we've been quite productive at this with a lot of papers in recent years and actually some of this technology that you're seeing today used in production. So one provocative way of thinking about our research agenda is and in the trend of machine learning papers that end with is all you need is one eye is all you need. Of course, that's not true, you need more but it's an interesting scientific question. Can we learn robot perception from raw videos only? Can we get robust 3D depestimation from a single monocular camera in particular is something that has been exciting us scientifically and has had some positive impact in applications of the line. So just as a reminder, you've here, you've maybe heard sometimes the term pseudolidar. It's the same thing as monocular depestimation which is about taking a single RGB image and outputting a depth map per pixel range estimates. How far is the object that was captured in that pixel? So again, naturally the question in the age of rich sensor suite, stereo, lidar, you name it, why a monocular depth and in particular from the angle of driving and robotics. So first and foremost, there's something obvious which is the camera is the lingua franca, if you pardon my French. It's the universal sensor, it's everywhere. It's the cheapest sensor you can buy. It's common in robots and cars but also in your mobile phones. The second is that in complex sensor suites like the ones you'll see today that is common in soldering cars for instance, you often have like for cost minimization reasons you have often a wide baseline. We don't litter the car with the bazillion number of cameras. We try to minimize and optimize the sensor placement. And again, in the order of reducing costs so that you can commercialize products at a very large scale. As a reminder Toyota sells 10 million cars every year. So at this scale, you're thinking about costs. So the other important thing in robotics is redundancy, obviously. And so no sensor is perfect, even cameras, even LiDAR. And so having multiple different sensors that are capable of the same features, the same functionalities enables you to get robustness from redundancy. And that's related to the Byzantine general's problem and et cetera. And now, why depth in particular? I mean, it's a legitimate question because we care about 3D objects, right? Like detecting where the objects are. So do we care about every pixel, every range of underlying every pixel? And the answer is yes, because interestingly, if you want, even if you just want to detect objects in 3D from vision, the main bottleneck today is monocular depth. If we had good depth estimates, we would have much, much better 3D detection. So it's the scientific bottleneck right now. And so the overarching question in our research is, is robust 3D vision possible thanks to large-scale self-supervision? That's a specific scientific question that me and my team, a part of my team, have been studying extensively in this past few years. So as just a quick refresher, supervised learning, you get raw data, which is easy to acquire, right? Just record the sensor stream. You feed it into a model and out of it you get predictions. And you pay a loss to basically train your model, right? By gradient descent and back propagation in the usual way. And that loss is computed by computing an error with respect to ground truth, right? Target values, right? Or labels. And that is the real dirty secret that everybody in machine learning knows, which is that is what drives performance, is how much labeled data you can get. And the quality of that labeled data. So in self-supervised learning, it's different. You go from raw data to model to predictions, same. But the loss is not paid by comparing to the ground truth. It's paid by comparing to the raw data itself. And so how do you do that? Well, obviously, you can't just have a reconstruction objective. You need to inject more inductive priors. So help learning the right things. And that's where prior knowledge really comes into play. The reason we're doing this is because we have large volumes of unlabeled but structured data. So we can make some assumptions about the road, other objects, or about geometry. And there's also multiple sensors. So you can make assumptions about the relative calibration of the different sensors and things like that. One thing that is different in our research when we talk about self-supervised learning is that it's different than image classification self-supervised learning that you might be familiar with. Because here, we're not talking just about the pre-training task. We're not just trying to learn a good representation and then use that representation in a fine-tuning way and a transfer learning way for a downstream task for which we have labels. Here, the task itself, which is predicting that, is self-supervised. So after the self-supervised learning, the model is ready to be used. So with this brief intro, I'll basically dive right in into the three big topics that I wanted to cover and the research that we've done in the team around 3D self-supervision, how to go beyond self-supervision because that's not a panacea either. And more importantly, how do we use monocular depth? So it all started actually with a really famous paper from Clemangadar et al. about self-supervised stereo training, where they've shown that basically you can use some simple geometric constraints when you have a stereo pair, so two time synchronized frontal parallel cameras, where you can feed in the left image from your stereo pair into a monotef network, get a depth prediction or a disparity prediction. And then you can use the equations of projective geometry to warp the pixels from the left image onto the right image because you know the baseline. And that can enable you to reconstruct the right image from the left image. And then you can compare the pixels, which is called photometric loss. And if your pixels are not aligned in color space, then it means, not geometry is wrong, obviously, but it means that your depth network is wrong. And that gives you a useful signal to backpropagate and then learn this network. And so that's why you see that in the learning objective for self-supervised learning. You're trying to learn this theta hat parameters, and you're minimizing an empirical risk, which is a sum over losses over the different image pairs. And it has multiple different terms. The photometric loss, which is the loss that you obtain you have use synthesis by comparing pixels, as I just explained, plus a sum of other losses to try to add a bit more structure to your depth maps, for instance, by regularizing using edge-aware depth smoothing, or by also regularizing for occlusion. And so that work really inspired us. And we made a paper called SuperDepth, where one of the things that we noticed is that, obviously, the more pixels you have or the more the higher your resolution, the easier the problem is. And that's because most papers in research deal with limited resolution. So they haven't really necessarily seen that fact before. But as soon as we started to use our data and starting to use it in the cars, we realized that if you want to improve your performance, just increase your resolution mechanically. For cell data, it's very important because you want to know how far things are for far away things so that you have the time to react to them, so you need a high resolution. And if you train these networks at high resolution, which is not trivial either because it requires a lot of memory. It requires maybe some different optimization techniques. But if you are able to do that, that improves performance. And so one natural extension of that thought process was to say, well, in a typical depth network, you do an encoder-decoder where you're compressing, spatially, the information into the encoder. And then you're trying to re-interpolate missed details, basically, by decoding at progressively high resolution back towards the input resolution. And again, this is typically done for memory constraints. Also, to some extent, to prevent overfitting and things like that. But here, we basically, in the super-depth network, just looked at the neighboring field of super resolution and looked at how to recover lost details from images, or in this case, feature maps. And there's a set of new interesting operators that had come out at the time called sub-pixel convolutions, which we've used and extended, basically, the encoder-decoder architecture to try to learn to recover lost details by using these super-resolution techniques for the intermediate disparity maps. That turned out to work really well. So working at a high resolution and using sub-pixel convolutions to super-resolve intermediate disparity maps. And as you can see at the top, depth maps. One of the benefits is also because it's a stereo pair for which you know the baseline in meters. You can actually also reconstruct point clouds, metrically-scaled point clouds with the camera calibration information, the stereo rigging calibration information. This is what you see at the bottom. That's great, but stereo is not really ubiquitous. Also has some challenges with respect to calibration and maintaining the calibration over time, if you're in a pothole or something like that. And so one of the major source of information for us is actually monocular videos. And so the question then naturally came for, can we do the same thing with a generalized type of stereo setup where instead of having two time synchronized front-to-parallel cameras, we have one single camera, but we're using the frame T minus one and frame T. And the answer is yes. You can use a similar idea, but there's some differences. Namely, the fact that now you also need to predict how the camera has moved from frame T minus one to frame T. And you can use very expensive and precise sensors to do that, as in self-driving cars. But turns out you really don't need to do that. You can actually just feed the camera pairs, frame T minus one and frame T, into a convolutional network that tries to regress the rotation and translation of the camera, so also called the ego motion, and use that for view synthesis. And the equation then is this one, where you're trying to basically synthesize a target frame, say frame T, from a context frame or a set of context frames. And the way you do that is you, again, predict the depth from the image, predict the pose from the two pairs of images, and then projective geometry tells you if you have the camera calibration, the Intran 6K, and the ego motion T hat, and the depth D hat, you can basically exactly reconstruct the frame T from frame T minus one, using this equation. That is a reprojection equation of the pixels using the pinhole camera model. And so we've proposed basically this network called PACnet in our CPR paper from 2020 called 3D Packing for Self-Supervised Molecular Depth Estimation. And this network basically built upon the insight of super depth, which is, same thing, resolution matters. And to quote a very famous speaker from Andrew Zisserman, the devil is in the details, and especially for the photometric loss, where if you remember, we're warping the pixels from frame T to frame T minus one and comparing the pixels in this photometric loss. So the higher your resolution, the more detail you capture and the less ambiguous your photometric loss is going to be. For instance, two pixels on my vest right now would look black. So if you are at a low resolution, they will all look black the same. But if you are at high resolution, you might see some different shadows, maybe some wrinkles, or some little visual cues that enable you to differentiate pixels and then tell you whether you predicted the right depth and the right ego motion. And so here, as I mentioned, we took this idea of going further into the details. And instead of doing the traditional encoder decoder where you do pooling operations in the encoder, so max pooling, which is destructive, you lose information, we looked at learning this compression. So we are learning to compress spatial details, and then later learning to decompress them. So it's a lossy compression to compression of intermediate features that is done through these packing and unpacking operations that replace max pooling and interpolation that you find in traditional deep networks. Now, how do these packing and unpacking layers look like? It's a bit maybe too technical for this talk, but at a high level, you're basically taking your 2D tensor, h by w, and you're shuffling the pixels to pack it along the channel dimension so that you're turning it into, you see here, instead of a CHW tensor, you get a 4C times h by 2 w by 2 tensor. So you're packing along the channel dimension or along the depth. This is what's called a space-to-depth operation. And then what you have is you basically have a weird kind of tensor because it has like pixels and colors along the channel axis. And so one way we found to actually make sense of that feature space of that tensor shape is to use 3D convolutions to basically learn to pack this into compact 2D tensor representation of features. The resolution is still divided by two, but these features are basically learned with some spatial details that are preserved. And then unpacking is basically the reverse operation. One of the, just proof of like toy examples that you can see whether it makes sense is if you take an input image and you do a kind of two-layer like encoder-decoder where you use the traditional operation of max pooling and bilinear upsampling and you're trying to learn to reconstruct the details. As max pooling is destructive and bilinear upsampling is just simple interpolation, you see that the reconstructed image is very blurry. It's like a very poor encoding, lossy, lossy encoding. But if you use a single packing and unpacking operation, you can get almost lossless reconstruction. I won't go into the details of this monster table, but this is a very active field of research. We've had a lot of people working on this very interesting challenge. One of the interesting things is not just that Pac-Net actually was the first state-of-the-art method, but also it was the first method where self-supervised network outperformed a supervised state-of-the-art, meaning methods that take an image as input and trying to regress depth, from ground truth, depth measurements from LIDAR. And in particular, other strengths of Pac-Net was the fact that it scaled much better, which is, again, for us, I think that's something very important. We're very much thinking at very large scale applications. And so we found that it scaled in two ways. One is it scaled better. If you look at, so the x-axis is, the y-axis is absolute relative error, so lower is better. On the y-axis, it's number of parameters. And so what we found is that ResNet's type of architectures, they scaled better with parameters, but they kind of like paper off, like plateau, whereas our models, they really scale very well with the number of parameters. So depending on your computational budget. And the other thing is that with resolution, the higher your resolution, the better your performance, as I mentioned. ResNet has improved a little bit from the orange curve to the yellow curve, but Pac-Net has improved a lot from the blue curve to the green curve, because that's what it was made to do, to learn, to preserve these details. So, and again, the reason it doesn't over-fix there, in spite of the number of signatures, because of the strong inductive biases in addition, especially in the 3D convolutions. By the way, if you are scared about like this size of parameter spaces in the hundreds of millions, that's normal. But one of the cool things is that this can run in real time on GPUs, and this actually runs in the car. We did a lot of different ablations. I'll skip over that. And one interesting thing also is, I mentioned like over-fitting risks and with big networks like this, but it turns out that, and as often the case in deep learning, over-parameterization is a good thing. And we found out that our network also generalizes much better to other domains. So for instance, train on TT evaluated on UCS. Okay, enough of the blah, blah. Now some results. So here what you see is upper left image inputs, lower left depth map prediction color coded. And on the right, what you see is a point cloud that we reconstruct from that, because again, we have the intrinsic parameters. So we can basically reconstruct a point cloud. And so you see this is data from our fleet in Japan in an area of Tokyo, close to Tokyo called Odaiba. Here you see it works on the pedestrians, on the cones, on the cars, on the buses, on the road, quite far away to previous, you see it works really well on wet surfaces where LIDAR struggles quite a bit, including very wet surfaces such as this one. So there's here some just more qualitative results to show that it captures a lot of details. And in particular, if you see this one on the lower left, you can see through the fence, which I find very cool to be able to be that precise. All the code for that model that I talked about, PAKNET SFM is available on our Github. The dataset that I also talked about, which is called DDAD for dense depth for autonomous driving is also available publicly. And we're actually organizing a competition at a workshop that we're organizing at CVPR, which is called Frontiers of Monocular 3D Perception. That competition is ongoing, and we encourage you to participate. I'm still having to have one month to participate. So if you want to start from our code and try to use your own ideas build upon that, or if you have your own models that you wanna try, feel free to give it a go. Cool, one thing you might have noticed is that it's only front camera and obviously in driving, we need 360 3D perception, also for certain robotics applications. And so to minimize the costs, as I mentioned, you typically don't have like 50 cameras all around the car. So instead you have like a minimal sensor suite, like indeed that case, for instance, six cameras with minimal overlap. So how do you reconstruct a full 360 point cloud from these six cameras? That's a problem that we call full surround monodepth. And we've just recently put a paper on archive called full surround monodepth for multiple cameras. So FSM is a play on word on structure for motion, SFM. And so the problem and the result looks like something like that. So this is actually from the new scenes dataset where you see that for new scenes, it's even worse than for DDAD in terms of the overlap is extremely small. You see these blue points here, which represents the field of view overlap between cameras. So most of the visual field is actually single camera. So another reason to work on monodepth. And this is on the right, you see this is what we are able to obtain in terms of the 3D point cloud reconstruction from a single model. And the way we do this is by leveraging again, like a geometry and prior knowledge, but in this case, we have even more that we can leverage, which is this relationship across cameras in space, but also in time. And so that's really the key to make this system work is to leverage spatial temple transformation matrices between cameras. So we don't just leverage the relationship between cameras at the same time, but also across time. And these constraints, and obviously even if you don't overlap, two cameras don't overlap a lot in their field of view at a given time, because the platform moves, the overlap is increased, right? Different cameras can see the same thing over different time steps. And if you leverage those constraints together with a couple of other important design factors like masking, like photometric loss masking and things like that, we managed to get a really good reconstruction, 3D point cloud. And in particular, we're much stronger than like approaches that are trying to reason explicitly in terms of matching like things like cold map, for instance. So how does it look like in practice? So here you see the six images on the left, the six depth maps, the color code of the border of the images correspond to the color code of the cameras that you see on the right. And then you see the point cloud. So obviously it's not perfect. I wouldn't drive off of this, but it's actually fairly decent and a close range. What you see is some bleeding artifacts around the boundaries of objects. That means we should be going even higher resolution and try to be even more precise. But we get one scale consistent point cloud from all these cameras with a single model. And that's again, only self supervised. That's great. If you remember one important little thing that I mentioned earlier, which is this projection equation, which is assuming a pinhole camera model. Now in the standard pinhole camera model and projection is a simple matrix vector product, right? As you see here, but it's just an approximation when you have distortion such as these examples here, like percussion or barrel distortion. And it's not a very good model at all when you have wide angle cameras, like fisheye cameras or catabiotic cameras, when you have like panoramas, when you have like a dash cam behind the windshield because the windshield creates some distortions, even worse if you have rain. And also even crazier application would be when you have a camera underwater because the water acts like as a very complicated type of lens. So we can't use geometry, the projector geometry in the pinhole camera model in those instances. And if we, I mean, we can, but the results are gonna be really bad. And then remember when I said self supervised learning is you compare the pixels after reprojection and if the reprojection is wrong, it's not because geometry is wrong, it's because the depth is wrong. Well, in this case, it might be geometry that is wrong or your assumption about geometry. So to go beyond pinhole cameras, we proposed something called neural ray surfaces. So we are leveraging a generalized camera model, which is inspired by Grossberg and Nair, which is just saying that instead of having a global linear projection operator, we're just gonna assume per pixel viewing rays. And so this neural ray surface is something we're trying to predict from a single image where we use a deep network to predict these per pixel rays. And that's what we did in our paper mentioned here. So here you see pinhole model, right? That's a classical picture of the pinhole camera model, where again, your equation, simple linear equation, but in the case of this generic camera model where you have neural ray surfaces, you have a much more complicated model. So for the proposed rays, like each 3D point Pj, basically, we must find the corresponding pixel, PI that belongs to IC, the image, with ray surface vector qi. So q are, these q are the ray surfaces, right? And the set of qs is this neural ray surface. And we have to find for each pixel and each 3D point. So for each 3D point, the pixel whose ray surface is the closest to the direction between Pj, that 3D point, and the camera center, SC, right? So here you have, for instance, like three, four qs, and, or three qs, q1, q2, qi, and you're trying to find the q that's closest when your space would be q1. So the problem is that here, you see you're trying to find the best ray surface, the closest ray surface, and you have an argmax over there. And so we want to basically find a neural network that can predict those qs and optimize that your back propagation. So for gradient-based optimization. So we approximate this non-differentiable argmax with a softmax over patch-based residual surface to have some also regularization, local regularization, and that enables us to get the really fast end-to-end differentiable model. So the picture overall looks almost the same as before. You have an encoder, an image encoder, predicts depth, yeah, depth decoder. You have also smoothness loss and other losses, regularizers. You take two frames, you feed into the pose network, you get the ego motion estimation, again, a rotation translation that doesn't change. But what changes is you have this ray surface decoder that is trying to predict this per pixel ray, which you see here, color-coded arrows. And then you use these equations of the generic camera model to do view synthesis, to warp the pixels from one frame to the previous frame and then pay the photometric loss. So does it work? The answer is yes. And actually, surprisingly, the same model, the same approach works really well for crazy different configurations. Like here you see a catapulted camera from the Omnicam data sets. We are able to get a really good depth map and a good accurate point cloud over a short range because Omnicam is basically, like the catapulted cameras don't have high resolution for faraway objects. This is for fisheye, same thing. We get like very heavy distortion and we're able to predict the depth really well on the poles, on everything and reconstruct really good point clouds. And this also worked in practice for the dash cam setup that I mentioned before. Same thing, code is available online if you wanna try it out. And yeah, of course it also works in panel setups. It can approximate the panel camera really well too. Last, but not least, I really briefly mentioned ego motion estimation and this pose net part of the architecture. We've also done quite a bit of work on that, whether end to end with deep networks, but also more structured approaches. For instance, using 3D key points and a bit more geometric approach than just predicting rotation translation from pairs of images. And again, this is available online including the code. All right, so I hope I convinced you that there's significant progress being made in depth estimation, but there's still some challenges. And in particular, you have some issues with self-supervised learning, but hopefully in practice you very often have some supervision. And this is where research institutes at the company, so practicality beats purity some of you know the Python Zen. So we're trying to leverage whatever supervision we have, which is common in practice, which would be some supervision from cheap range sensors. So for instance, more and more cars are getting light ours, but not the light ours that you're thinking about for self-driving cars, just more like low number of beams, like very sparse, cheap range sensors, like with four beams that are there just mainly for frontal collision avoidance for safety applications. But they still return you some 3D points, maybe no easy. So we could use that. Also very often, as I mentioned, you don't just care about predicting depth, you want to know what are the objects, where they are, et cetera. So you often have semantic labels and you do other tasks like semantic segmentation. And also another big source of information and supervision is synthetic data from simulation. So what we found in all these cases is that we should leverage supervision as much as possible, but what also motivated us really to work on self-supervised learning is that we found that even when you have this little bit of supervision, what is key to unlock its use is self-supervised learning. Because when you have insufficient supervision, you need to compensate with this prior knowledge, this inductive priors that are encoded in the self-supervised learning approaches. I won't go into too much details about the semi-supervised learning because it's exactly what you think it is. In addition to your self-supervised loss, if you know some ground truth 3D points in the world, you can re-project them onto the camera plane and pay an additional loss for those pixels in terms of a supervised loss, just a pure regression loss, depth regression loss. We made a paper on that because what we found is that that loss is very different. The photometric loss is lost between colors and RGB in the 3D plane. And the 3D loss is typically lost in meters in 3D. And so you're comparing basically pixels to 3D points and mixing them into the same loss, which does not work really well. So for optimization purposes, you actually need to re-project the 3D loss into a 2D plane using again some basic geometry that actually turns out to be not done in practice by most people, but actually turned out to be really important. And so that improves performance a lot and in particular that can improve performance even when you have very few points, like a hundred points per scene. And that's interesting. And because you can assume you have a few points that are available to you during training time, but what about inference time? And actually, if you're assuming that the cars, some cars have some forebeam lidars, for instance, well, you should be able to use them, not just for training, but you should be able to use them for inference. And this is exactly what we've done in our most recent CEPR paper called sparse auxiliary networks for unified molecular depth prediction and completion. And the idea is the following is you in the bottom, what you see is, so at the top, sorry, what you see is the typical setup of depth prediction where you have an image, feed it to a network like backnet and you get a predicted depth. But below what you see is you see that if you have a very few points in the environment, like not enough to really drive, but you want to still leverage them to improve your depth estimates. Well, that problem is called depth completion going from a few points to the completed depth map. And so here, what we do is we use the same network, the depth prediction network, but we are basically imbuing the decoder features with some information about the depth, which we, about the sparse depth, which we encode via this sparse auxiliary networks. Now the sparse auxiliary networks, they're basically sparse convolution with a bunch of convolutional layers that then gets transformed and mixed with the decoder features of the depth decoder. And the beauty with that type of approach is that when you have some points, you can just dump them in your decoder via this way. But when you don't have these points, you can just still use straight up the encoder decoder and predict the depth map. So this doesn't just provide you with the flexibility to use different networks for different cars or different robots that might or might not have this sparse range sensor, but it also works at runtime where if, for some reason, drop packets or sensor failure, you don't get the depth readings at any given point in time, you can still predict a depth map. And this provides you with this redundancy and robustness that we need. I won't go into the details of the sparse residual block, which are these blocks over there, but essentially relies on sparse convolutions, which are really important for efficiency. How does it look? So again, here we can, with the same network, we can basically compare the outputs of prediction, meaning no sparse input points or end completion. And what you see is that obviously when the results are pretty good with prediction just like before, but for completion, we get a lot more details like, for instance, the bars on this truck. This also works, so as I mentioned, talk a lot about driving because that's one of the major applications that we're thinking about at Toyota, but robotics is also a very big application for us. And so we've also validated that this approach works in home indoor scenarios, which typically are a bit easier because it's a bit cluttered, more cluttered, less structured, but has less range. So it works also pretty well there. We have much more results in the paper where we show that we can improve the state of the art on depth prediction and get quite competitive with specialized completion approaches. And we also showed that you can use this depth for 3D detection and it works quite well, actually better than the traditional supervised depth. And I'll talk a bit more about this later. So how does it look? Same thing, bottom left input image, bottom right depth map, upper section, it's the predicted depth. And when you see some color points that are flashing in and out, that's when we're switching on and off the sparse input depth. So this is basically, now we're doing completion, now we're turning off completion, we're resorting, we're falling back to just prediction. And so that enables you to be very robust to this kind of like maybe sense for failures. So I mentioned like to go beyond self-supervised learning, you can use some supervision, right? Like semi-supervised learning or even like completion, what we call dialable perception, you can dial between the image input or the image plus depth input. Semantics is a big one. And if you remember your basics of computer vision or even human vision, it should be that in theory, geometrically it's impossible to invert the image, the depth, right? You get the 3D world projected into the 2D plane and there's actually many un-projections that are valid, for instance, up to a scale parameter. And so it's an ill-posed, what we called an ill-posed inverse problem to do depth prediction from a single image. So why does it even work? Why is it even possible for us to do it? And that's simply because, and why can you do it when you look at a photograph or when you look with just one eye? And it's because there's obviously patterns that relate appearance to category level geometry. We all have faces that are different, but are roughly the same dimensions. And so they re-project into the 2D plane in a way that is equivalent with how far we are from the camera. And so obviously, because it works based on appearance and relating appearance to geometry, one natural question is to wonder whether Monodeph can benefit from pre-trained semantic features. And the answer is yes. And we showed that in a paper called Semantically Guided Representation Learning for Self-Survived Monocular Depth. I won't go into the details, but same idea. You have a pre-trained semantic network, you have a depth network, and same idea as a pac-net sun. You can actually guide the representation learning of the decoder to be more semantic aware. In this case, we use pixel adaptive convolutions. And so the results are actually nicer and we improved upon the state of the art again. What you can see here in particular is that there are certain structures that are really hardly like this pole here is really dark and it's hard to tell what depth and whether it's a pole or not. But because we're having some more semantic aware features, we can actually recover the depth much better. Same for other type of fin structures that have a weird geometric structure, but that is very consistent across categories like signs, people, trucks, things like that. And I mentioned we get a pre-trained semantic network to help us guide the learning of depth representation, but where do we get this pre-trained network? And you typically get it from manual labels, but that's very expensive. So one question that we've been exploring is can we use simulation as a source of supervision? And so the answer again is yes to those questions. Can we get synthetic data that is useful? Can synthetic semantic segmentation benefit depth and can depth help seem to real of some seg? So it's the other way. It's the two ways, right? Is that can semantic help depth and can depth help semantic? Can they work together in a multitask framework? And the answer is yes. In this very recent paper we put on archive called Guda, which is about geometric unsupervised domain adaptation for semantic segmentation. And we've shown across multiple data sets how this can work. In a nutshell, you have real data, which is just a video as usual. And same thing as before goes to a pose encoder to get the ego motion, goes to an image encoder and a depth decoder to get a depth map. Now this time in addition goes to a semantic decoder to predict the semantic segmentation information. And on the simulation side, you have the video, but you also have ground truth, semantic segmentation that comes for free because it's in simulation. You know everything that's happening in the world because you created depth ground truth and ego motion ground truth. And then what you can do is you can basically feed that through the same network architecture and pay the supervised loss in addition to the self-supervised losses. Whereas in the real world we only pay the self-supervised losses, which is the view synthesis loss. And in addition, there's a couple of different regularizers, regularization terms that you can compute in simulation and that helps stabilize the training. Again, it works really well. So what you're seeing here are examples of predictions of our network. Again, that predicts independently semantic segmentation and I'm sorry, jointly semantic segmentation and depth. And we're just superimposing the semantic labels onto the depth point cloud that we're getting. And all of these predictions are obtained without ever having seen any real world semantic label. So it transfers really well in this form called seem to real transfer without any supervision in the real world. I'll skip over this. We have plenty of results in the paper. Interesting stuff we're showing is also that how does it scale? If you feed more stimulated data, if you feed higher quality simulation does it get better and better? And the answer is again, yes. Lots of results in the paper. That's all right, I encourage you to check it out. Because I'm running a bit out of time, I'll go quickly over this, but same thing. There's all these works are available online and feel free to just ping me if you have any questions. So as I mentioned before, we can predict depth. We can predict better depth thanks to like going a bit beyond self-surprise learning. One natural question is to say, what is it good for? I briefly mentioned 3D detection. So here is another example of work where we've done 3D detection, molecular 3D detection, a paper called Royton D. And where really you get as input in RGB image, you predict the depth. And then from the depth plus the image features, you can get 3D bounding boxes. And this is actually like an old paper, like in deep learning days. So if you are 2019, since then results have improved quite a lot, but actually the approach remains very similar. You predict depth as an intermediate step. So one of the cool things about that work is that we didn't just predict 3D boxes, we also reconstructed 3D shapes. And the way we did that is by learning some priors over CAD models of the 3D shapes. And then we just regress, in addition to the 3D bounding box, we regress the latent shape encodings to get shape reconstructions as you see here on the bottom right. And one of the cool things you can do then is you can start to do data augmentation where you do augmented reality style where you're adding, you're pasting into the scene fake instances of cars that you reconstructed from other sequences and enables you to like improve your training data with some 3D data augmentation technique. And we took that actually further and we looked at a fully auto labeling scene. So because the previous approach still needs 3D bounding box labels, the ground truth, it's supervised. So we've looked at auto labeling, the differential rendering. So the same idea as before, which would be you are trying to predict parts of the scene, but instead of comparing like parts of the scene like object shapes, et cetera, to a ground truth, you're instead using a self-supervised approach based on differentiable rendering where your image, if you look at the bottom, the image is fed through a neural network that predicts intermediate scene parameters, objects, their shapes, et cetera. And then you are re-rendering the scene by saying, okay, if I'm right, I'm able to regenerate the pixel values of my input image, right? Because I can re-render the objects in the way that I think they're placed. And then if I'm right, my pixels of my re-rendered image should be close to my input image. And so this is very similar to the idea of view sensors as before. Instead of copying pixels from one image to the other, we're actively deconstructing the image in this inverse rendering process and reconstructing it with the rendering process. Now this de-rendering process because it goes through a neural network is obviously differentiable. So the hard part is the renderer that needs to be differentiable itself. So you can back propagate through, again, this photometric loss. And luckily there's a lot of research in differentiable rendering. And so we've been able to leverage recent differentiable renders and actually came up with our own 3D differentiable renderer. And this is what you're seeing here in this little animation, which is the whole inverse rendering, re-rendering and optimization throughout this whole process for the three car shapes. And now the cool thing about that is that now you can basically this de-rendering of the scene is actually labels you can use for like downstream training of 3D detection networks. And what we found is that these labels are almost just as good as manual labels you would get from LiDAR. We did the follow-up work at ECCV. So that previous works called SDF label was neural at CPR and we did follow-up work at ECCV where we didn't use any LiDAR. We really just did it this time from images. So to conclude, I've talked about self-supervised 3D vision and our research in that area. I've explained why we are doing it. It's to get robust 3D perception from a vision. How do we do it? We use geometry for scalable 3D self-supervision. I talked about SuperDef, which was in the stereo case, PacNet, which was in the monocular structure for motion case. I showed that it works for one camera, multiple cameras, like the full surround monodep for even weird cameras, like the neural, where for which you need to use the neural ray surfaces. In the second part, I talked about beyond self-supervised learning using semi-supervision from maybe partial point clouds, but you can also use them not just at training, but also at inference time. Semantic guidance, where you can use pre-trained networks for semantic segmentation, but also you can use simulation and this is the good artwork that we talked about. And finally, I very briefly brushed on how we use molecular depth, mainly for 3D detection, but also for autolabeling thanks to differential rendering. That's a lot of work. These are basically, like this is the bibliography. I'll share the slides so people can have all the references. And just as a reminder, code is available, data sets are available. We're organizing this workshop and competition at CEPR, which I encourage you to participate in. You can find more information on Terai Global website or on my website. Obviously, that's a lot of work. Takes a village to do that. So thanks to my many collaborators and thank you for your attention. And yeah, if you have any questions, feel free to ask. Cool, fantastic. Thanks a lot, Adrian. That was really interesting. I actually do have a couple of questions, but maybe let me ask first if somebody else wants to go. Can I ask if... Of course. In this very beautiful sketch you've given here of the depth assessment, does it make sense to use 3D information? So as opposed to working from an image, could you gain a lot of extra learning capability from saying this is a moving scene with other objects that are maybe stationary or have some limited speed? Is there other opportunities in that direction? Yeah, absolutely. That's a great question. So it's a fairly deep question. So we're doing some of that, but not enough of that. So in a sense of the semi-supervised learning, when you have this multi-sensor setup, we are reasoning in 3D, but we're just for optimization purposes reprojecting it into the 2D image plane. So for the neural ray surface also, we also reason in 3D. But I think the case where we used 3D information the most in this 3D composition of the scene is really this auto-labeling work. Because in the auto-labeling work, the scene parameters are 3D scene parameters, right? It's the full 3D scene that we're trying to reconstruct. So here we're focusing in this work or focusing just on objects, but one interesting research direction that we're actively working on is reconstructing everything in the scene in 3D and to be able to do the reasoning exactly in the way you describe Peter. Thanks. Yeah, maybe to continue along that direction. I mean, in principle, if I understand this correctly, using this differentiable renderer, you can turn this essentially into a totally generic unfolding problem, right? It doesn't need to be, you can replace the renderer, which of course does some ray tracing through the atmosphere and stuff, or something that hasn't even have optics, right? In principle, you can have some arbitrary unfolding of some arbitrary physical system, as long as you know, somehow how to go from the scene parameters to the output which you have accessible, right? With your sensors. Exactly, exactly. And this is actually what we've done because if you look at these optimization, so the second work that I mentioned is using only images for self-supervised differentiable rendering, but here we're using LIDAR point cloud. So going back to Peter's question also, here we're reasoning in 3D because we have a 2D loss, but we also have a 3D loss, which you see is being optimized at the bottom of the animation. And so we actually made a differentiable renderer, which is a 3D differentiable renderer of sine distance fields of SDFs. And so here we leverage the properties of that SDF space to basically optimize the auto labels and the shapes. So it's not a renderer, it's not a renderer per like in a game engine sense. It's actually a renderer of the output we're trying to optimize over. Yeah, I see, I see, nice. And maybe you hinted a bit at the generalization, but I also think, correct me if I'm wrong, but I think with autonomous driving, would you really need to get correct is somehow the generalization to be very extreme tales of the events that can occur on a road, right? That usually accidents happen because of a reason, right? That's just a long chain of unfortunate coincidences or whatever. How stable is all of this? Or some of these techniques to these very strange unexpected situations or unexpected inputs. Yes, so that's just a very good question. And that's like what animates us. And that's why when we're looking at examples like the one I had shown before, we're looking at the very little details to see if it works, right? So, yes, so how does it generalize to the long tail? So for us in terms of computer vision first, like the long tail can be like very fine-grained details. So if we look at things like this, for instance, right? Like construction cones, right? You don't see them very often, you don't see a lot of them, right? Or things like thin structures, like here you don't care too much about that, that's okay. But there's other instances like child, like a child has a weird size distribution compared to most of your pedestrians. So I think that the challenge is that machine learning is pattern recognition. And the long tail means you might not sample, have enough samples to generalize to those rare events. And there's really no way around that besides two things. One is, well, you need to replace the lack of evidence, right, of experience for that long tail by prior knowledge, right? You need enough information to be able to make predictions. So if you don't observe it, then you need to inject it, and so using stronger priors, using more structure, et cetera. The issue with that is that you need to be right. As a human, when you decide to implement those priors, because if those priors turned out to be wrong or bias, as we know in machine learning, this is why there were racist computer vision algorithms, right? Not because people were necessarily racist, but because people didn't think about darker skin colors, for instance, right? Because they were themselves white. And so that's very dangerous to not be careful when you design these priors. And the other solution being, for instance, data. So, okay, I have the long tail, I have these rare events. I have only 10 examples of that. I need 100 more, right? So you need to go and find those examples. And there's ways to do that, but it has the same issues as what I mentioned before about the human bias, which is you need to be aware of what long tail events you want to boost in your signal. And so that's also why I like self-supervised learning, because self-supervised learning is taking kind of an extreme stance on this, which is, if you know the Hard Rock Cafe moto, love all, serve all, you know? Well, this is what self-supervised learning does. It tries to learn from all the data. And hopefully you have enough of everything if you collected enough data. And because you can train from all the data, maybe that's gonna work. Now the problem is that's not also like a perfect picture, because all the biases in your data, like the dominant modes, for instance, will be dominating your learning process too. So you have to also account for that, but it's got more complex and self-supervised learning because self-supervised means less human oversight. So this is really hard. I mean, these things problems, I don't have a solution, right? This is why it's a research challenge. So the long tail is like, as you said yourself, it's a long tail problem in itself because if you start to kind of unroll the issues, it's like it's never ending. So I think we need some fundamental scientific progress to tackle that issue. It's not easy. Yeah, that's a very nice answer. Thanks a lot. It's a long answer because it's a very, it's a core problem for us. Yeah, right, as you say, right? Exactly, yeah. And also one that affects many other people, I suppose. I mean, this is not specific to that specific problem, right? I mean- Yeah, right, right. And when the stakes are high, of course it becomes important to get right or to try to, you know- Yep, absolutely, absolutely. Here's one more question, which I apologize because it's very vague. Having thought a lot about this, with your colleagues and so on for a long time, is there anything to be said about why evolution didn't come up with this? So why do we have two eyes? Why do most species on this planet have two eyes instead of a single one? I mean, we have two eyes, but actually there is, we have very strong monocular depth cues. Right? I think I was rereading recently, the Wikipedia article on depth perception is really good. And it covers a lot about this, including the monocular depth cues. And it's quite fun article to read because you can like close one eye and do these experiments and check for yourself, like experiments on yourself, you know, which every scientist should do, you know, responsibly. But like here it's easy. You know, don't have to poke your eye out. Just hide your eye and just try a couple of different things. And you'll see that we, even though stereo and products and all these kinds of things are what like binocular vision is what enables precise depth perception, you can do pretty good depth perception with just one eye. Right. Evolution did come up. To do so, right? Yeah, without training, evolution did come up with it. So you could try, you know, I can grab my glass and I can drink, you know, I can touch my nose. Isn't that just because your visual cortex already has learned how to deal with the information from two eyes and then interpolate back if one of them goes missing? Like if the only thing you ever had from your birth was just a single eye. Yeah, I think that would, I don't know if there was ever an experiment or someone that actually was in this situation. So I don't know. But from first principles, it would seem like it would still be okay because you again have motion. Right, exactly. So you can get relax from motion and et cetera. Right. Okay, nice. Thanks so much, Peter, if anything else? Because you're on mute. I think we're good. Yeah. Okay, wonderful. Thank you very much. Thank you. It was a wonderful talk. Yeah. Thank you for having me. Bye-bye. Bye-bye.