 Okay. Hi everyone. Good morning. Thank you for coming here. I'm Kabir and today I will be presenting on visual navigation for flying robots. And we'll be, this is our latest generation system which uses an embedded GPU for navigation. So I'll go through why we need to, and why a GPU is advantageous for this application as opposed to say something like an x86 based system without any GPU. And I'll also go through the basics of visual navigation, obstacle avoidance, things like that. So let's start. Now with the proliferation of consumer drones, and we need better ways to navigate. Now most consumer drones in the market are highly reliant on GPS based navigation methods. They use GPS to localize themselves and they use GPS to hover in position. And as you know, GPS is not guaranteed in situations like urban canyons. And also we don't have GPS indoors. So we can't fly indoors without any other external assistance. As you can see, the drones are highly dependent on GPS in assisted modes. And if you do not use GPS assistance, it's really hard for a new pilot to fly one. Although it can be made easier if we have extra navigational aids. There are a chance of fly aways due to bad GPS reception. As I said, in urban canyons and places where there is GPS interference, what can happen is you can get a bad GPS signal which tells the drone it's somewhere else and it's not. And then it flies off. And then you've lost your Christmas gift and then boom. So no. And so there is this immediate need for GPS agnostic navigation methods. And especially today because Christmas is coming and then there will be a huge boom and then everyone will be buying drones. And all these drones are reliant on GPS and lots of things can go wrong. I mean, you just have to see YouTube for crash videos. And the obstacle avoidance on consumer drones, there are like only three drones which actually have obstacle avoidance. Two variants from DJI and one from Unique. And obstacle avoidance in these systems is only marginally effective. They slow your vehicle down so that you can't actually speed into walls and things like that. And they work. Of course they work but they're only marginally effective. So we need better ways to do that. And of course, so our drones are not truly smart if they can't fly on their own. And they need to fly on their own everywhere. Not just outdoors. They should not have to rely on an external navigation aid just to fly. This is a short history of the project. I started Project Artemis in 2014 as a small research project to provide indoor navigation solutions. And because flying drones is banned in India where I'm from, I needed a way to fly indoors. And I needed a way to do it safely. Now the very first variants of the vehicle, I call them Artemis MAVs, micro aerial vehicles. The very first variant used a simple passive method called optical flow. There was a single downward facing camera which used to track natural features on the ground to stabilize the vehicle. Now optical flow has its own limitations. It requires strong features on the ground to track. It is also there is a limit on the highest speed we can achieve depending on the focal length of the lens and the distance from the ground. So yeah, it's limited. And also optical flow is a relative positioning method. We only have relative velocities of the vehicle. So over time as we integrate that to position, a drift is going to build up. So over time if I say if I were flying an optical flow based vehicle here, it would start drifting and eventually it would end up somewhere else. Although this drift can be mitigated, it's still a limitation. Late 2014, I started research into active positioning methods. And the active positioning methods differ from something like optical flow because they actually track landmarks in the environment and we project this landmarks into a global frame and then we know that these landmarks are here and we localize ourselves against that. It's not relative anymore. We have a real position here. So we don't drift. Early 2015, first flights with monocular visual odometry. Again a single camera bottom facing, but now we are using active methods to track our position. So that improves the drift which we observed with optical flow. But then monocular visual odometry, we are not using any other sensors. We're just tracking pure image motion. We're tracking the movement of features on the ground. And it has limitations too. Yay. So what happens is when we're tracking features on the ground, there are several challenges, especially when tracking image motion only. You require sufficient texture on the ground and then you have illumination changes, featureless carpets, I mean featureless terrain. So then all those things add up to the error. And since this is a monocular approach, we have no way to scale the measurements. So what I get is actually an unscaled measurement directly tracking image motion. And then what we do is we scale that measurement using inertial data from the accelerometers and then we have a metric position estimate. And we can use that to fly the vehicle. We close the feedback loop using that estimate. And since it's a monocular approach, there is no direct way to observe that scale factor which we need to estimate in flight. So basically when we want to fly with a monocular vision system, it requires an extra initialization step. I need to take the drone, wave it around until the scale converges and then it can fly on its own. And that's not really practical because it's not autonomous if you need to wave it over the ground. Late 2015, MAV2, that's the second generation prototype, we completed that. Flights were successful with monocular odometry only. And we started more work into using two cameras or three cameras to mitigate that initialization problem. Early 2016, MAV3 completed with stereo inertial odometry. So what we did was we switched from one bottom-looking camera to two forward-facing cameras. Now, having two cameras allows us to triangulate a landmark. And thus, we do not need that extra initialization step. We can triangulate a landmark, we can observe the same landmark from two cameras, we can triangulate it and we know the exact depth to that feature. And we can use this depth to get the metric position estimate directly. And we're still tracking pure visual features here without any assistance from any other sensors. Now, when I say stereo inertial odometry, what we do here is we track features with the assistance of the inertial measurement unit. Now, what happens here is it allows us to predict where a feature can be in the future. And thus, we can constrain our search to that area. And then when we know it's in this area, it reduces the computational demand as well. I will go into more detail in the future slides. And now, what we're doing is the fifth generation vehicle that's on the table here. It uses a GPU, as I said, and it does obstacle avoidance, GPS-denied navigation. It does everything the previous generation vehicle could do. And it does it much better with lower power consumption. And there are other advantages as well. When it comes to using multi-rotors as a development platform, it's they are not the best thing you could work with because they're inherently unstable. And they require active control for stable flight. So there is this notion of a feedback loop which we need to close without the feedback loop which we cannot fly stable. They are limited in terms of the payload they can carry because, of course, the payload limits our flight time. And we can only lift so much. And our sensors and computing need to be optimized for a particular use case so that we can maximize our flight times. And as this vehicle is using a GPU, we really need to carefully plan what we're going to run on the GPU and what we're going to run on the CPU. Primarily because GPUs only excel at very specific compute loads. They need parallelizable tasks to excel. If you run an unoptimized version of any standard computer vision algorithm on the GPU, it's not going to perform. You would probably get results an order of magnitude worse than what you'd get on a CPU. On unoptimized algorithms. So careful time investment is required there to make sure that we can take full advantage of the GPU here. These are some of our previous vehicles. Clockwise, that one on top, that's MAV1. It used monocular visual odometry and also served as a test bed for optical flow originally. This is MAV2. It used stereo cameras and that's MAV3 which also used stereo cameras but with inertial assistance. Design goals, when designing this vehicle here, it should be capable of all the features of the previous generation vehicle. GPS navigation, obstacle avoidance, things like that. And it should be small. This was the primary design goal for this vehicle. It should be small because we wanted to fly in real-world situations where a bigger drone cannot or where a person cannot. So we're targeting use cases like storm drains and thermal boiler plants where it can fly autonomously and inspect these areas. It should be capable of real-time high-speed reactive obstacle avoidance. It shouldn't bump into things and it shouldn't be slow. And it should also be able to transition between an indoor situation and an outdoor situation. Now this is especially important for real-world use cases because we could be flying inside a room and then we could just fly out of a window. You can't do that with consumer drones. The estimator would get confused and then bad things would happen. And of course increase the efficiency of algorithms wherever possible by using the GPU. Just to put that into context, we were doing stereo matching using the two stereo cameras to obtain a depth image. We did this on the last three vehicles. And the last two used an Intel system. It was an i7 6700K Skylake series processor. And that's a really powerful processor. And the studio matching, the algorithm would take around the full 200% of the four cores. So that's quite a lot. And then we got results which weren't at frame rate. So that's not very convincing. And what we can do now with the Tegra here is we can run it at frame rate. We can run stereo matching on higher resolution images. And we can do it at around a tenth of the power consumption. This is the vehicle here. You can see it. You can see the various components. There is a Jetson Tegra X1 embedded compute platform. It's sitting on a small carrier board, which is the same size as the SOM. You have a Pixhawk autopilot. It's running an RTOS, doing all the flight critical tasks. You have the stereo cameras on the front here. And you have a GPS receiver on top. That's so that we can transition between indoor and outdoor situations. And also so that the GPS can assist when it can be used properly. And we have some CAN bus ESCs between the frame plates, which provide assistance to the feedback loop so that the controllers can perform better. Just running through the specs. It uses the Jetson Tegra X1 computer. It sits on a small carrier board from Auvidia. It's a J120 if you want to get one. It's got a Pixhawk autopilot, a Ubiquiti Rocket M5 data link. It's good up to several kilometers. Two IDS UI SE cameras, which are synchronized in hardware. I'll come to why hardware synchronization is important here. And we have the Zubax GNSS V2 interface via CAN. And the OL20 motor controllers also interface via CAN. And the vehicle in this present configuration can fly for 20 minutes. Coming to the Tegra X1 compute platform. This is just the specs of the platform. It's got a really nice embedded GPU platform. It's got a quad core ARM Cortex CPU. It's really small. It's almost the size of my phone here. And you can put it on a carrier, which makes it slightly bigger, but that's sufficient for our case. And as I said, it outperforms the i7 Skylake series processor in terms of perf per watt. I mean, I'm not comparing raw performance here. I'm just comparing performance per watt. And it supports the CUDA API. CUDA is Compute Unified Device Architecture. It supports CUDA for really fast algorithm implementation on the GPU. It's an easy way to do it. And there are several other options for GPU programming like OpenCL for Intel platforms. But among all of them, which I've used, I found CUDA the easiest to work with. Coming to the overview of the navigation problem. When we want to fly indoors or wherever, whenever we want to fly or navigate, there are several problems which must be solved simultaneously. So at the very beginning, we have perception, which is observing the environment, localization, getting to know where we are, planning, which is planning our movement so that we don't crash or bad things. And we have control, which is actually getting the vehicle to perform the plan maneuvers. And then we have the operator interface, which is me controlling it or whoever the operator is. They should have a clean interface to control it via. Yeah, sure. Yes. Yes. Here is our perception. We have forward stereo cameras, and the cameras and inertial measurement unit are time synchronized. Now, why time synchronization is important here is we need to have accurate time stamps for the inertial measurements and the camera frames. And when we have a tuple composed of these three, two camera frames and inertial measurement, we can use that to estimate vehicle motion using odometry methods. And this time stamping is critical. The stereo pairs are also used to compute depth images on the GPU. You can see a depth image up there. It's by comparing the disparity between the two images. And these depth maps are used to create a small local map around the vehicle so that we can navigate in that frame. So we don't bump into obstacles which are close to us, but it's just a local map around a small radius around the vehicle. And the choice of a sensing suit is highly important for us because we have really limited payload capacity. So on the y-axis, you have the frame rate of the sensor, which is how fast it updates. And on the x-axis, you have the drift speed or how fast the estimate will diverge if that is the only sensor providing the estimate. Now on the bottom left, we have GPS. Now, GPS is not fast. The current L1 consumer receivers can only do around 18 hertz or something. And we also have stereo cameras, which are slightly faster, something like 30 to 40 hertz depending on the interface you're using. And they have a medium drift speed. So you can use that to navigate alone like we did on the previous generation vehicles. And then you have the inertial measurement unit up there, which drifts very fast. You cannot use it alone for navigation because although the inertial measurement unit provides body rates, the accelerations, things like that, we cannot directly integrate that acceleration to position. There are several reasons for this primarily because our sensors are really cheap. They cost around a dollar or two in quantities. So and when we integrate the acceleration from that, we get really noisy measurements. And another point here is to integrate accelerations, we first need to subtract gravity. And for that, we need to figure out the direction of gravity. Now figuring out the direction is definitely possible using the gyroscopes and a sensor fusion method. But even a degree of error in the figuring out of gravity direction would lead to several kilometers of drift in the position estimate in a few seconds. So that's not practical. So what we do is we fuse all these sensors together to get our final position estimate. Now all of these sensors are used depending on availability and quality. The variants of sensors are individually tracked. And in case a sensor is misbehaving, it is shut down. Coming to state estimation, the system localizes itself using a combination of GPS and vision as and when whichever is available and of higher quality. And the inertial measurements are used to propagate the state of the Kalman filter, which we use for sensor fusion. And the visual information is taken into account during the filter update step, which is at around 30 hertz, as well as the GPS that's also taken into account on the update steps depending on the accuracy of the measurements. And then we use the inertial measurements to propagate forward into the future at 200 hertz, which is the IMU rate. And the fusion of these complementary data sources allows us to have this really robust position estimate, no matter which sensor is available or not in a particular situation. Like indoors, we do not have GPS, but vision is really accurate. And outdoors, we do have GPS, but the natural features which we're observing using the vision system are really far from us, so that degrades accuracy of the vision system. So we want to trust GPS more in that case, as opposed to trusting vision. So this is done dynamically as a system flies. And you can see a couple of plots there which show the smoothness of the position estimates in an urban canyon. It's actually flying in a small area with walls around it. And it uses both GPS and vision to fly. There's a video later, so you can see it there. Coming to the core of the system, the visual inertial odometry method. We have a stereo camera. So we've got two image frames. We can observe a particular landmark in the environment. That's a prior. And then we can use a camera model to know exactly where we are with respect to the landmark. And we can use that to know our position in the world frame. So the basics of visual inertial odometry here is we need to estimate the transform from the world frame to the IMU frame. Because the IMU is rigidly attached to the vehicle, if we know where the IMU is, we know where we are. So we use tight IMU camera synchronization. We track the landmarks directly as states in the Kalman filter. So each landmark has several characteristics like a bearing vector, a depth, and then also the uncertainty about its position. So we track the uncertainty of everything, of all the data sources we know about, including each feature. So we can estimate better. The feature parameterization on that is fully robot-centric, which means we track the features in the robot's own frame. And that allows us to do a full power-up and go system without needing that extra initialization step, which I told you about. So I can just power it on and press takeoff, and it would hover here, which we're not going to do because it's not safe. Yeah, we initially use a very basic feature tracker called Fast. And then once we have a few features in a particular frame, we choose the features with the best response, which have strong contrast, which are corners, things like that. There are several criteria for choosing good features. And then once we have a feature, we use the inertial measurements to predict into the future. So we try to predict using the inertial measurements where a feature will be in the future. And so we know that it has to be in that area, plus or minus some threshold. So we can restrict our search for new features in that area. That reduces the computational demand. We do not need to search all over the image for a feature which we know it will be here. I know it's in here. This is a video of the visual inertial state estimation. No GPS here. It's just using visual features to track. You can see the two camera views there. The bottom one is showing the tracking. The green squares are the feature patches which we're tracking. You can see as certain features come into view and they're also lost. You can see that on the right, the colored dots are the tracked features. Here we're just doing some aggressive maneuvers just to show the robustness of the tracking. A pure visual algorithm would not have been able to track under these situations. And the inertial assistance allows us to maintain a robust state estimate. Oh, no, they're discovered as we're flying. No, no, we detect features and track it instantly. Yeah, it's all done in one step. You do both together for free and basic. Yes, yes. Generation and tracking. Yes, yes. Yes. With aerial cameras. Yeah, we had to accelerate a lot of things to get it to run at frame rate on the Tegra because the Tegra is not as powerful as an i7 when it comes to pure CPU loads. So yeah, we had to accelerate the feature tracker and things like that to get it to run in real time. Yeah. Yeah, the cameras are actually capable of 1,280 into 720. But for the visual inertial tracking, we downsample that to VGA and also monochrome because the algorithm just needs monochrome for tracking and that speed things up. And also VGA for the speed factor. Yes? Oh, yes, yes, yes. That's because I was manually flying it. We are not closing the feedback loop here. There is another video which shows it controlling its own position. Is it global shutter? Yeah, the cameras are global shutter so that we do not have image tearing when the vehicle is moving fast. So it exposes the entire frame at once. And they're hardware synchronized using a precision pulse from the PIXOC unit. So we know exactly when the image was taken. We also have a corresponding inertial measurement for it. Yeah, it's just, well, consumer optics, yes. The lens is just a normal wide angle lens. These are very popular in computer vision, the cameras. They're standard computer vision global shutter cameras which can be synchronized in hardware. I mean, any cameras would work here as long as they're global shutter. And you can externally synchronize them. And a lens with a greater than 90 to 100 degrees of field of view will be sufficient to run the algorithm. We need a wide field of view because we need to track features. We need to track as many features as possible. And the wider field of view allows us to not lose track during fast motions, especially during rotations. Yes, yes. Yeah, yeah, yeah. We need to expose for around 10 milliseconds indoors in a situation like this so that limits our maximum frame rate. Yeah. Coming to GPU accelerated stereo. This is the core of obstacle avoidance. We directly push down pictures from the CPU. As soon as they're acquired, we push them down into the GPU. That's the green bit. We push them down and they're rectified using a pre-calibrated camera model. They're rectified on the GPU. So we remove the distortion in the image because they are wide angle lenses. So we remove the distortion. We perform a local matching cost calculation. This algorithm, from local matching cost calculation to disparity computation, that's called semi-global matching. So, and that's a standard stereo matching algorithm which can be highly parallelized to be used on a GPU. And it was a really good fit for our case. Now you can see more at the link down there. That's the paper addressing the implementation of the algorithm on the GPU. So you basically just calculate the matching cost in four directions. We merge them, we get a smooth cost. And then at the end, we get this disparity image on the GPU which we again copy back to the CPU and perform the rest of the obstacle avoidance there. On the detection of obstacles, things like that, we do that on the CPU. So, and this, and the time we have from capturing an image to reacting to obstacles in the environment is only around 30 milliseconds. So the pipeline needs to be really fast because we need to react to obstacles as soon as we see them. There cannot be any lag between detection and avoidance. What's the precision here? The precision? Yeah. Yeah, for the stereo matching, yes? Yeah, the precision for stereo matching depends on the depth because it gets worse as you go out further and it's more accurate nearer. It can be mathematically proved and if you have the camera model and the distance between the cameras, so that depends. It depends on the distance from the camera. Closer objects are more accurately detected. Things, yeah. Anyone else? No, thank you. Yeah. Yeah, yeah, yeah, yeah. Yeah, so we have centimeter level precision at up to from one meter to around three or four meters and that changes with the camera baseline, of course. It also changes depending on the nature of the obstacles. If they're highly textured, it's easier to detect them and it's easier for the stereo matching to be performed. So it's not deterministic. It depends on a lot of factors and they keep changing. Yeah. I'm not gonna be bothered. Yeah, yeah, yeah. So you can stand around a meter from it. I mean, we've set the distance to a one meter. It'll avoid obstacles up to one meter away from it. It won't get closer to you after that. We maintain that threshold because the stereo matching also fails when you're too close to the camera. We do not want to get too close. So basically we project a buffer in front of us and we try to get nothing into that buffer. And if something does enter the buffer, we back off. Yeah, ultrasound. Yeah, yeah. There are several methods for obstacle avoidance. We could have used ultrasound, but it's like only in one direction and the resolution is really bad. Further, we have the motors directly spinning next to it and the air currents mess with the ultrasound. So it gets really messy and the range is really limited. The ultrasound would only work to around four meters maximum and the resolution is really bad. So an ultrasound has a focused beam which we cannot change the direction of. So it's not really useful for real-time, fast obstacle avoidance. Yes? Yeah. With the rotation of the... Yeah, yes, yes. A very good question there. How do we ensure that the calibration stays the same? Well, we don't. We try to estimate changes in the calibration as we are flying. So system self-calibrates as well in flight. And it's very important for the stereo mount to be well designed because we need to maintain that fixed separation between the cameras, the rotation and the translation. So that's something we depend on to do the stereo matching. And you can see here that it's designed on a carbon fiber rod and it's really fixed pretty well. So we don't want that to change, but small changes are estimated over time in flight. We estimate the extrinsics in flight. This is our reactive avoidance system. So traditional approaches to obstacle avoidance usually involve capturing an image from the camera, computing a depth map from that, pushing that depth map into a global map and then we perform obstacle avoidance on the global map. Now that's what we did on the Intel system. And while it works well, if you have a lot of CPU, it's also slow. So it's not real time. It's not reactive. And we are also limited on resources on the Tegra platform. So what we did here is our new algorithm directly operates in disparity space. So it takes the depth image, it directly detects obstacles in that and then performs the reactive maneuvers. We segment obstacles using a method called UV disparity segmentation. You can see the paper there as well. It's a standard method in autonomous driving. It's an older method, but it still works really well. And it's really fast, which is important for us. And since it acts directly on the disparity image, we can cut out that extra mapping step. We do not need to create that global map. We can just maintain a small local map around the vehicle so that we have the required environmental perception. So we don't need a global map. It's much faster. It can react to sudden changes because we are not creating a big map which we need to update and insert into. And thanks to this, it can avoid dynamic obstacles as is required in real world situations. Yes, yes. As I said, we only maintain a local map which is around in a five meter radius around the vehicle. So if you've observed that and it's within five meters, we can track obstacles behind us. Well, not track them, but we know that there's something behind us. If we have observed it, it is saved. But it's only around five meters around the vehicle. In the global mapping approach, we just maintain an entire map. We map whatever we see. And in this case, we just maintain a smaller local map which we can use for avoidance. Global planning. Now, global planning obviously involves the use of a global map. This is something we are working on currently. It's not very easy to run a fully-fledged mapping approach on the Tegra because of the CPU limitations obviously. So we need to port over stuff to the GPU as far as possible. It's something we're currently working on. The standard for global mapping these days in the robotics community and academia in general are octrees, which are a probabilistic free space representation. That's actually an OctoMap, a map composed of octrees. But we can't do that in real time on the Tegra. So we need a better approach. I'm currently working on this. It's a work in progress. It's active research. And hopefully we have something out within a few months. And also, once we have a global map, we can directly plan in that map. And that's really useful for things like, okay, I've flown in this room and now I'm running out of battery and I need to get back to the original position without bumping into things. And the trajectory needs to be optimal as well because I want to save battery. It should be the trajectory with the least cost. And we cannot do that with the local map obviously. So we need a global environmental representation for that. And as far as the planning step goes, on the previous system we used RRT, which is rapidly expanding random trees. It tries to minimize the cost and expands a tree in the direction of our goal. And then we get a more or less optimal path depending on the number of iterations. And RRTs are really great for implementation on the GPU. I mean, they are really, really good for parallelizing. So that's what GPUs excel at. So we'll be working on that as well once we have a proper mapping approach up. And we also used to do autonomous exploration on the previous generation vehicle where you can just tell it to explore and it would find the environment, map out the room on its own. And again, we can't do this yet because we can't do a global map yet. So that's something we're working on. Hopefully we have some good results soon. Okay, coming to control. We designed a non-linear controller for the vehicle which is not standard. The standard controllers are proportional, integral, derivative controllers which have been in use forever. It's a standard feedback loop. And in contrast, our controllers do the tracking in a model predictive fashion so it can react to weird things like changing widths, failing motors, as far as possible, that is. So yeah, it can react to those. And the way it works is the controller receives a high level path from the Tegra which is the companion computer. And then using this, using the obstacle information in the local map, it deforms these paths. And that deformed path is the obstacle free path. And then the vehicle executes those maneuvers and then we don't bump into things and we can stay in forward motion. And thus the system can react to sudden changes without deviating from the original goal. It tries to achieve the original goal as far as possible. Even though the trajectory might not be optimal since the navigator module has no planning step involved in it, it will still not bump into things. Finally, the operator interface, it's really simple. I can just use a single tablet to fly that and it will be safe. I don't need to control it manually, so it's safe. I can just press a button, have it take off. It will avoid obstacles. If I'm manually controlling it, it will also not bump into things. As you can see there, the forward view of the vehicle is visualized as four quadrants. And as obstacles come into view, the quadrants will light up. And if you get too close, the vehicle will break and stop. Or if we are above a certain threshold where we cannot break because it's going too fast, you will apply path corrections so that we don't bump into it, we just pass by it. And that depends on the speed the vehicle is currently moving at. The decision to stop or keep moving in a different direction that depends on the vehicle's current forward velocity. Coming to our software framework, we use a high level low level split here. The critical flight tasks like actuator control, attitude estimation, these things need to be done at a much higher rate. Around 200 to 400 Hertz is the norm. We do that on an RTOS, Nuthex, that's running on the embedded controller, the PIXOC. We run the PX4 flight stack on the embedded controller that runs on top of Nuthex. And the higher level tasks like mapping, planning, which run slower than the vehicle dynamics compared to the vehicle dynamics, we do that on the Jetson Tegra X1 companion computer. The Tegra X1 runs ROS, which is the robot operating system, which is again not really an operating system, but a collection of libraries and tools, which is widely used in the robotics community for things like obstacle avoidance and things stuff like that. It provides a really nice IPC interface for inter-process data transfer. And we take advantage of that because we don't want to reinvent the wheel everywhere. And this is our final navigation pipeline, right from image capture to reaction. So the navigation pipeline runs at 30 hertz overall because that's the frame rate of our cameras. We are using these USB 2.0 cameras, which can only run at around 30 hertz when externally triggered. And we can go much faster. We can go around 60 hertz or so, given the limitations of the rest of the system, but that would require better cameras and better cameras are expensive and we don't have a lot of funding. So we have the GPS receiver, which directly feeds into the yellow box here. That is the embedded controller. And on top, that is the Tegra X1. Should've labeled them, sorry. And you have the images feeding into a frame synchronizer. What it does here is it takes the inertial measurement and the two image frames and then synchronizes them. So we have a combination of three data, three types of data, which is then used for state estimation. We have the image on distortion step, which removes the distortion, the wide angle distortion from the image. We have the rectification step, which makes straight lines straight. And that is pushed into the Tegra. I mean, it's running on the Tegra anyway. And that is pushed into the rest of the pipeline, which is stereo matching and state estimation. These processes operate in parallel. Stereo matching is done completely on the GPU and state estimation. Some parts of it are just accelerated, but not fully. Both of those feed into the local map. And then that is fed to the local planner, which again deforms the paths to get a smooth collision-free trajectory. And that is fed back onto the embedded controller, which is running a trajectory controller. And that controls the vehicle's motors, the actuators, and we can fly. I have a small video after this. This shows you how well it hovers in an urban canyon, situation where multi-path interference makes it really hard to fly otherwise. Okay, this is autonomous. I'm not controlling it anymore. So you can see it's stably hovering. It's using vision and GNSS GPS for tracking its position. And this multi-sensor fusion allows us to fly in a situation where it would have been impossible otherwise. So you can see over a long term, the GPS is also used because we are correcting the vision drift here. With pure vision, it would have drifted slightly. But the GPS corrections allow it to stay in the same position over a long time. Thank you. That's it. If you have any questions, please. Yes, please. Yeah, I was actually trying to fly worse so that I could show the accuracy of the VIO, the visual artificial automotry. So yeah. Yeah, the autopilot flies better than me. Yes. Yeah, yeah, we can actually fly during night, but not using infrared. We have the setup going with synchronized computer vision strobes, really powerful LEDs actually. They're synchronized with the camera shutter. So you can fly inside things like boiler plants. That's something we're working on currently. And so that's incomplete darkness. You're flying inside a boiler. So we can track the features on the walls of the boiler and the strobes keep the lighting up. And it does automatic exposure control on the fly. Yes? Yes, yes, yes, yes. Yeah. Yeah, because there is some uncertainty in the estimate of each feature position, that uncertainty feeds back into the vehicle's position estimate as well. So we cannot always know exactly where a feature is. There is some uncertainty. And that uncertainty slowly builds up, which means over say, if I were hovering using vision only for say half an hour, I might have say five centimeters of drift. Yes? Yes, yes, yes. We do know how far it is, but then the factor which he mentioned comes into play. Because when it gets further, the uncertainty increases. And we have to track that uncertainty as well so that it does not ruin our estimate. So that uncertainty adds up in the end. That's why we use the GPS to correct for that, because GPS is an absolute position estimate. Although GPS is not very good in all situations, it is a global estimate. Yes, yes, we can correct it with better optics, better cameras, better algorithms, because the feature tracking is not deterministic at all. It depends on various characteristics of the feature like contrast, cornerness, things like that. So we need to track that uncertainty as well, because there is a lot of uncertainty in the feature characteristics, the bearing vector, things like that. So we use all the features together and we can get a robust final estimate. Yes? Yes, I can actually, I wanted to show that, but I think we are out of time. Yeah, I can show that to you after, I had a demo ready. It won't fly, but yeah. Yes? Yes, yes, yes. The really good thing about the Tegra X1 platform is that we can use shared memory. So it's on the same memory and then we can share it between the CPU and the GPU. And there are several cases where shared memory performance is actually worse than copying it onto the GPU. It depends on the kind of kernel you're using. It depends on the ordering of data in the memory. But in most cases, what we've seen is using the shared memory acceleration eliminates copying completely. And where we do need to copy, we only need to achieve around 30 hertz. So in that timeframe, the copy overhead does not really come into play. Yes? Yes, particle filtering, yes. But this is quite a standard approach in academia, the visual inertial odometry here. Because we're tracking each feature as a filter state, we are pretty locked into the Kalman filter paradigm. Yes, I know, because yeah. But then we also need to track the uncertainty of each feature. We need to track the uncertainty of the inertial measurements. And that's what a Kalman filter is really made for. And in any case, the filter iterations, they are done on the CPU and it's not really computationally expensive. So there is no point accelerating the filter update on the GPU because it works. Yes? Yeah, yeah, yeah, having a large set of particles that would involve simultaneous localization and mapping, SLAM. And SLAM is not something we are interested at this moment because SLAM involves, again, making this global map and odometry is much faster. In real-world situations, all you will need is odometry. Yeah, that's true, that's true, yeah. We can, of course, use a SLAM approach here, but as far as it comes to real-world use cases, we can only fly this for 20 minutes, maximum. I mean, how big a map can we make? And I'm targeting use cases like bridge inspection, where we're flying under a metal structure, we don't have good GPS, we do have, however, a really nice wall or whatever, a pillar or something, which the vision system can look at and localize itself against. In that case, odometry makes a lot more sense for us. Yeah, the synchronization between the camera and the inertial measurement unit, right? Yeah. So how it works is we have a GPIO on the PICSOC, and this GPIO output is connected to the two cameras. They have an opto-isolated input, and the same signal goes to both. So when we want to capture a new image, we trigger acquisition using the GPIO. So we know the image has been exposed at the exact same time. We also know the timestamp of the inertial measurement corresponding to the image capture, and then when the image comes back to us, we can just timestamp the image using the inertial timestamp. So all three have the same timestamp, they are against the same clock, and we exploit this for visual inertial odometry. Sync. Yeah, both the sensors are running in manual exposure, both the sensors are running in sync, and the thing about manual exposure is that we actually do automatic corrections for exposure. There is an algorithm running on the GPU which analyzes the image and performs corrections for exposure. So we can fly from indoors to outdoors, there will be a sudden jump in the exposure, and it'll again stabilize. Yeah, yeah, we actually perform the acquisition at a fixed 30 Hertz. So there is an upper limit on the exposure, and then we can go lower, it doesn't matter. Yeah, we can go low as 0.0009 or something, but that's limited by the camera, and we cap the maximum exposure to 30 Hertz, whatever. You get the shutter time, but the actual... Yeah, yeah, the shutter time, yes. Yeah, it's a fixed frame rate, but then we can go lower than that, as in the shutter time, yeah. Yes? Yeah. Yeah, obstacle avoidance on birds flying towards us. Okay, yeah, that's really interesting. I haven't tried it, but it would probably work given enough time because birds are moving around and the system needs some time to react. Enough suicidal birds. Yeah. I think we're out of time. Thank you, everybody. You can meet me for questions after if you want to, and I can also give you a small demo if you want that.