 Hi, I'm Saad Nasser. I'm a co-founder of Ati Motors. We are creating an autonomous cargo vehicle which is designed from the ground up to be to for computer control. So it's it's a three-wheel vehicle which carries a hundred kgs of cargo. It's a two, the two front wheels are actually the front of the vehicle and the rear wheel has an interesting arrangement where it can turn up to 90 degrees and the vehicle can turn in place. This is a vehicle design specifically for machine control and we use machine learning and deep learning a lot in our autonomy stack and this talk will cover where we use it. For example, we use CNNs for vision and other technologies. We use reinforcement learning for driving policy and this talk will also go over the various challenges we faced around data collection, how we use simulators and transfer learning and the various engineering challenges we faced actually converting this whole thing to run on the vehicle. If you think about it, there are a few definitions for what autonomy is in our case it's getting from point A to point B in some some environment like like a private campus, a large private campus and the most important thing to navigate in such environments is firstly you just need to know where you are and how to get to the point. That is you need a map and it's crucial to know where you are because if you don't know where you are you don't know you can't really go there and you need a map because that tells you how to go there. Once you've decided what you're going to do on this map like you're going to go on the road or you're going to take a turn then you need to actually perceive all the obstacles around you. You need to figure out this is my vehicle and there's this road and I have to stay on the road and there are these various obstacles which I should not hit. After you've actually perceived the environment around you you have to have some policy on what you're going to do. So for example if you have a stop light what do you do? That is a perfect example of policy and you need to decide when are you going to stop. What do you do with the orange light? For example you may do different things based on whether you've already crossed the line. And after your policy actually gives you some action to do. There's some controller at the bottom that will actually control the vehicle. So if you think about it there are a lot of existing technologies for vision. So listed here are some very popular networks like ImageNet and YOLO and all of these. And these are good for actually taking in like an image and outputting segmented out outputting various outputs. While they have made a lot of attempts to actually take one of these networks as is and actually just use them that doesn't work because a lot of these networks if you take ImageNet for example they are not really designed for autonomous driving as such. So for example ImageNet has a lot of classes. For example we probably don't care about the specific, for example if you detect a cat we probably don't care about what type of cat it is. It's more that it's a cat or it more generally that it's an animal. And some of the other techniques and on top of this we also use non deep learning methods. We still use for example if you have or in a highway and you wanted to detect lanes then who transforms are still the easiest way to figure out what the road area is. For our own vision we are making our own approach using a mix of all of these networks and we are now evaluating how these networks will fit into our autonomous stack. So if you think about it one of the most important things in an autonomous vehicle is to figure out where the road is. If you don't know where the road is then you can drive. Some of you may have heard of segmentation. So segmentation is usually done using an auto encoder based network. So you have a network where you take an image and then you output something like this. So green for example is the road and you can see it's done a pretty good job. This is an image taken outside our office in Malaysia. So you can see the network did a pretty good job of estimating the complete road area. It didn't do well in certain areas so you can see near the bottom right side of the image it's not realized some road is actually road and a few other things. So segmentation is important because a lot of networks will happily give you a bounding box. But for things like roads they aren't really any bounding boxes. For example there's no line per say in this road. For example there's a scooter so that definitely is not a drivable area. So if you just push two lines and connected them that wouldn't work. We also need to detect obstacles. So for example this is YOLO running on the same image. So here you can see these networks sometimes have errors you can see the purple box. So these networks usually do need to be trained. For example this particular network if you show it image of a car it may sometimes actually detect people inside the car. And that is good but we probably don't care too much about things inside the car for example. So we need to train the network so that these things are handled. There are a lot of errors you encounter when you apply these off the shelf networks to our specific problems. So and we do fit we will fit these to our needs but you can't start from scratch because these data sets are huge and it takes a long time to train. Another interesting thing is you can see here you get 2D bounding boxes but finally you want to drive a vehicle. So you really need to know what the obstacles are like in a 3D space. So for example you need to figure out that this car is also deep so there's like a 3D in 3D space it occupies some space. So that's another thing. Finally once you put all these things together you get a map of the area around you. Once you do have it you need to actually use policy. So for example take this intersection. If you think about all the possible things you can do there are like few hundred things you could do. But not all of them are legal and in some cases even if it is legal you may not want to do it. So you need to actually decide what you do. For example if you want to take a U turn you will go into the turning lane or if you want to take a right turn you will stick to the right lane. This is the sort of thing that reinforcement learning is very good at because in a lot of cases sometimes making errors is not really dangerous. It may just mean you slow down or things like that. In some cases the problem with reinforcement learning is even the very simple things like stopping at a red light need to be trained and sometimes it will just randomly not go at a red light. So you do need some way of generating scenarios for reinforcement learning and in general any other vision stuff. So for example if we go ahead and collect a lot of data you will get a lot of data but you won't get data about edge cases. For example if you wanted a very specific type of dangerous event or a craft or something you can't really reproduce it or get data on it. So for that we have developed our own simulator. So here you can actually see in the simulator this is our vehicle and you can actually run a script which provides inputs into the vehicle and then you can actually drive it around and you can even add other vehicles and test various edge case scenarios. This is very important because if you think about it the difficult part about driving is not the generic scenarios but the very specific scenarios that you rarely get. For example another driver behaving in an erratic way or these edge case scenarios like crashes and for one you cannot create them even in a controlled environment and it's just that making errors in a simulator is a lot less worse and simulators can be paralyzed like any other program. So we face quite a few challenges actually putting these on because these have to run on a vehicle so you face quite a few challenges actually putting these on the vehicle. So if you think about it on the vehicle you first have simple hardware issues like you have to synchronize all your sensors because if you think about it we have a lot of sensors, we have a LiDAR, we have cameras, we have radars, we even have some other cameras like IMUs and other things and you first of all you need to sync all of these because if you wanted to driving timing is very important because for example you can predict how fast the person is moving unless you have an accurate time base to calculate that. Another interesting fact some of you may have heard of is if you have a camera most cameras are rolling shutter so if I take a photo the top most line of the camera is captured at time T and the lines successively below that are captured at different time intervals. So if you actually are moving and you take a photo you will get really bad distortion because this is not one instant in time it's actually a rolling shutter. So you need a global shutter for example if you wanted to CNNs and other stuff on a moving vehicle. Another important thing is just mounting because once you have the camera you actually want to calculate where a person is and the way you do this is for example you just use heuristics for example you know a person is roughly this wide or something and then you know where your camera is mounted and then you can calculate using various approaches because you have multiple cameras. But mounting is a rather important thing because first of all you need to mount it very accurately because even slight variations in where your camera is mounted could result in larger variations because cameras are devices where like that you also need to calibrate your cameras because what can happen is even two exactly same two cameras of the same model can be significantly different in their performance and characteristics. Another important thing is detection so we since we are a vehicle we require real time detection so if you wanted so 30 FPS which is roughly 33 milliseconds so you need to run your whole loop in 33 milliseconds so if you think about it most networks take a lot longer for example image and execution times are measured in seconds on high-end GPUs like the ones we use in our servers. So there are various ways people actually do detection at higher FPS. We for example we have a method where we so for example one thing you often shall networks do not exploit is the fact that you have temporal data so you don't necessarily have to look at each frame as a completely different thing because it's a video for example if something was a cat in the previous frame it's very likely it's going to stay that in the next frame. So you can use things like you can actually do separate object detection so you detect moving boxes and then you can simply recognize these when they come in so this way you can get significantly higher performance. And another thing is GPUs. So GPUs are usually better training because the way they operate is they have really long pipelines and they're basically designed for throughput so if you think about the main purpose of a GPU is to actually render graphics and in that case it's more a throughput than a latency game because you have to push millions of pixels on the screen and you need to do of course there is a latency constraint but usually that's soft. For example nothing really bad happens when a frame is skipped so there are a lot of other methods to actually run neural networks in production so for example Google made a TPU for this exact reason so the TPU is like specialized unit for deep networks it's just a matrix multiplier and you just load your weights and it multiplies things and it's a very it's a rather short pipeline so latency wise it's a lot better than comparable GPU. Another constraint we have is the vehicle simply has compute constraints. Because we are a small vehicle we have power constraints most importantly as well as cost constraints so for example we have Titan X on our training machine and that can run even rather complicated networks really fast but we will require a lot smaller GPU on the main machine just because of both power and cost and if you think about it there are a few things you have to optimize when you move to a smaller machine for example just basic things like memory and other stuff become more scarce and the more complicated thing is you actually have six cameras so you need to actually merge them together and process them in real time so you also quickly realize that data transfer is a rather large part of your thing so you spend a lot of time just transferring data because uncompressed frames are pretty large and just on our first prototype we will be collecting 150 gigabytes an hour just using two cameras so these also lead to various engineering challenges offline in our data center so for example you will require petabyte scale storage and structure because it's hard to act because at 150 GB an hour you can easily run into petabytes of data over reasonable time period another important technology that you need is the ability to train across multiple GPUs and machines so usually of the most city libraries you get are focused on training on maybe one GPU or multiple GPUs inside one machine but at our scale you can't really just use one GPU or one machine just because you need to use a lot of GPUs and a lot of machines and there are ways you can actually push data through multiple GPUs and multiple machines and we are actually going to collect over 10,000 kilometers of Indian roads to supplement our training with other data sets so the important thing is you have to supplement your training because you just can't train on one data set because then your network would be really tuned to this particular area and it would not be generalized enough that you can just put it on some road and it works so one question that one important thing we found out early was you can find a lot of data very easily on YouTube for example of driving but there are few issues with those kind of data sets one important one is quality so if you think about it there are just so many different cameras capturing data they say everything, each video is almost a completely different setup and another thing is even if you get that data the data may be unnecessarily, you need to get a lot more data to get the same quality network you would do with high quality data and another important thing is you also need to make sure that you get edge cases because that's where your network would usually fail we also have a harder problem with our specific use of deep networks so what's bad for us is if we don't detect an obstacle or so we need to make sure that your network also needs to be self aware in the sense that for example if it thinks that this particular scenario is too confusing for me I think there's something in it that I haven't been trained for instead of just reporting nothing it should say that I don't feel that this scenario is safe given that this is the training I have and you need to be able to transfer control to someone else so once you put all of this together you get a robust autonomy stack and we will and yeah thank you we'd like to open the floor to questions if anyone has one please raise your hand we'll bring you a microphone hi one question I have is specifically for this network while driving in the training set you'll find a lot of cases where there will be shadows and objects an obstacle is not clearly distinguishable but how are we trying to tackle those kind of problems? so the important thing is your data set should be really diverse so you should have a lot of cases where you have labeled objects that are clearly in shadow or in the problem is even there for example the simplistic road detection methods based on computer vision often get very confused when they see a shadow so it's important that you have data which includes these scenarios like shadow and even rain and other edge case scenarios hello you mentioned LiDAR so are you planning on implementing that? so we actually buy LiDAR right now you need data to basically diverse data so you need diverse data to make your model robust right? but what about those cases where like let's say if there is a new model of cars coming out or any vehicle or anything which your model has not seen before how do you detect those kind of cases? so there are two things in that case one because most cars are extremely similar the model should be robust enough that it just recognizes this as a car even though it's slightly different than most of the cars in the data set for example like this Google has this car this does this mapping of different industry to your car so it has this weird looking device on the top of it normal car won't be looking like that but certain kind of cases can arise in particular yes so that device is actually LiDAR so my question is basically so are the model robust by looking at other cars which are already available in the data set so there are two things actually in that particular case so one your model should be robust enough that it actually detects it as a car if not you also have multiple sensors for that reason so for example your LiDAR will always detect it as a car or your LiDAR will always detect it as a car or some obstacle a question I had was you've spoken a lot about detecting objects on the road what have you worked on for your navigation algorithms because presumably the car will be autonomously navigating from point A to point B and it has to make some decisions along the way so how do you train it to make those decisions or is that more deterministically programmed you mean the lowest level the control level I actually I think I may have stepped in a couple of minutes late into your talk so I may have missed that so if you've gone over that I can ask you offline okay so the basic thing is there are a few layers actually so you have your map and then from the map you have a policy and once you have decided what you want to do immediately so for example you want to stay on this lane or you want to shift into the next lane or you want to take a right turn then yeah you do have deterministic algorithm below that layer that actually drive the vehicle hi yeah so my question is can you tell us if this is being manufactured for the Indian market or abroad because the problems that you face especially in controlling the driving right because driving barrier is very different in India and abroad and also the infrastructure as well so existing autonomous vehicles depend a lot on a very specific infrastructure like lane markings turn signs stop signs right yeah so the thing is erratic driving is actually not bad for an autonomous vehicle because usually they can predict behavior better so that is usually not the problem the problem is because you don't have lane markings and other stuff the vehicle can't really decide what to do legally so for example if you take a simple case sometimes you may have a policeman at a manning intersection so how do you train for that sort of scenario so for that reason we are not currently applying to operate on public roads and we will also be selling overseas so you are first tackling the easier problem that's what you are saying because I think just doing this on Indian roads with traffic is very hard right yeah so the problem here is not the driving part of it it's just the way it's perception issues so for example if you have a stop sign man by a policeman or you just have you have a lot of implicit things like for example there is no lane usually here people just stick to the right side and the left side another question yeah ok so it looks like you are working on some Indian data as well like detection segmentation so most public data sets are not Indian right all these I think KTE and those vehicle data sets autonomous vehicle data sets so you know like annotating these data sets is very time consuming so do you put your efforts in that direction or do you put that aside so yeah we are collecting our own Indian data set and your labeling is hard but we use a lot of things like for example you can use a really high quality network to detect, to tag it and then a human can review it hey man I have a bunch of questions first question is your vehicle adaptable to all terrains because you said it's a cargo so I was wondering if it can be adapted to non-linear roads maybe some kind of a slope or something yeah so it's actually it was actually meant for such scenarios so it has very good off-road support so it can handle very steep inclines that's awesome ok and you mentioned Titan X I believe that's a extremely power demanding computer system how do you handle that yeah so we don't actually run that on our vehicle it's only on our server oh right and your vehicle is fossil fuel based or electric vehicle it's an electric vehicle ok and one last question you I mean you will be tackling a bunch of problems so for all the problems are you exclusively dependent on deep learning or you have explored other options as well we actually use a mix of all options for example just in vision we have deep learning we also have some classical computer vision techniques and various other methods as well so it's a mix of those thanks man so how does your learning transfer from the simulator to the real world do you only use it for policy or do you use it for other so we actually can use it for both so the simulator does have photorealistic graphics so we can actually use it to train networks as well and we also use it for driving policy so it's good for creating a lot of scenarios we can generate a lot of scenarios automatically we can generate random scenarios so you can possibly get a wider range of data than you can get in real life hi there so there are multiple players in this market already Google, Drive.ai and many of them coming up with their own stack so what do you think is the unique selling point of your company as a co-founder what do you think you plan to do differently to be competitive in this space so a vehicle is actually designed from the ground up so we haven't taken an existing vehicle and made it autonomous because that's making a driverless car a bit like a horseless carriage because if you think about it after you add autonomy to a vehicle it completely changes everything so for example a vehicle can do zero turning radius it's completely electric and it can handle a lot of interesting terrain okay so two years down the lane what do you think will make you really successful in the sense compared to Google and others let's say Uber and others who are in the space so if you think it's mostly we are a unique vehicle so for this the inputs will be coming from multiple cameras I believe so how do you make it like without a time lag a synchronous input to the network and make a decision so there are few things actually so we use synchronization hardware synchronization between all the cameras so we make sure all the frames do come in at exactly the same time the network will probably so we have a few cameras so most likely all the cameras they fit in as one batch to the network and then your networks do take some latency but we have like update we have some time for the update to happen hey what are the sensors and the computing hardware on the vehicle so we have a LiDAR we have a RedAR a few various cameras and a lot of other smaller sensors yeah we have our compute this is mostly a GPU yeah on the vehicle right now it's a GPU I see an arm up in the back oh no I see a no that's not me there's a question so in conjunction to your simulator don't you think it might be a good idea to look at photorealistic multiplayer games which are out there let's say forza horizon or something like that so you can somewhat get real human behavior yeah so there are actually a lot of people who do that so for example somebody wrote a deep neural network to drive a car in GTA do you have a name for your vehicle yet I'm sorry do you have a name for your vehicle yet yeah we internally call it the fuel are there any other questions sad thank you so much I think the picture is going to be very interesting