 Welcome to the SC4x Hangout. I'm here today with Lex. My name is Connor Mikowski. I'm the digital learning lead here at the Center for Transportation and Logistics. So a lot of the things that you'll see in the course I've been working on, for example, the different sandboxes that you guys are now seeing be put into the courses. Those are a lot of the things that I'm currently working on. So I'm here to introduce Lex, and Lex is actually currently working with the Agelab to help develop autonomous and semi-autonomous vehicles. He's working in machine learning and artificial intelligence. He's actually one of the experts in the field. He's going to be teaching a class here at MIT on that pre-sum. And a lot of what he focuses on is the interaction between the people and machines, as well as how the machines interact to the road. So Lex, I'm just going to give you the floor here and let you get going. Great. So thank you for having me here. I love talking about this stuff, and hopefully you guys love learning more about it. We're mostly using learning-based methods, so machine learning methods that use data. In order to figure out how to create artificial intelligence systems that operate in the real world. So that's in the cars. We want to build autonomous, semi-autonomous cars that use data to learn how to perform better and better. Safety is the most important thing in driving, but also providing an enjoyable experience is secondary, but also extremely important. So I'd like to talk today about some of the machine learning side of things that we use, how that can apply to autonomous driving and then further to other domains and supply chain. Every domain that has numbers and data. But start to talk about data. So one of the things that's essential to all machine learning, the problem with all machine learning approaches, is large-scale data. So hopefully you're seeing the screen now, the set of vehicles that we're collecting data from. If you folks are familiar with, there's a car called a Tesla Model S, Model X. Tesla is a company that produces cars that have some degree of autonomy. So these cars are able to autonomously control themselves on the highway and even in urban environments autonomously. They steer, they take in video from the phone roadway and are able to control themselves. And so you're collecting data on this, correct? Yes. So how are you collecting that data? So we've instrumented. We have 24 Tesla vehicles, these are $100,000 plus vehicles that we have grass students tear apart to instrument. So we put cameras inside the car that record everything about the driver and everything about the external environment. And those are just raw pixels that we're collecting, synchronizing. And those are all pixels put together, form data, a data set of the largest at MIT in terms of size. How big is it? So it's now, if you want a terabyte of video data, and that's compressed. So uncompressed, that's about 100 to 150 petabytes of data, which is multi-million dollars of just storing the data. So that's a lot, that's truly big data. But it's useless as raw pixels. We have to be able to have algorithms, machine learning, computer vision, algorithms that extract knowledge from that data. So every day there's shown here 500 miles plus, we have cars driving all over, all this data coming in. Then we're offload from hard drives, put it on the cluster, and that forms the data set based on which we train all our algorithms. So I'd like to sort of talk a little bit about the big vision of the subset of machine learning we've really focused on, which is called deep learning. Which is using neural networks for achieving some of the stuff that I think students have already been thinking about and learning about. The neural networks is a very special kind of machine learning. I'll mention or talk about, it's able to automatically remove some of the effort that humans have to put in up front in order to extract features from the data. So neural networks are able to deal with raw data a little bit better than other approaches. That's the unique quality of neural networks. As opposed to bringing in experts in the field to figure out what are the interesting features in the data. Neural networks are able to take just the raw data and extract useful features. So people are having to code in these exact items that are required to the network and learn some of these features to build from that. And no specific coding is necessary. That's right. So facial recognition, that means previously you had to encode the concept of a nose, ears, eyes, eyebrows, mouth. Neural networks take in hundreds of thousands of features that aren't encoded by human beings. About the face that are useful to perform the face recognition task. In the same way for any other kind of data, it's nice to be able to remove a human to some degree in that big expert. Some degree up front where those features need to be coded. The question is to which degree the human quote-unquote can be removed. In the sense that how smart can we meet these machine learning approaches? Are they limited? Are they limitless? So far we've been able to do very specific tasks with machine learning. Like everything from playing formalized games like chess to solving medical diagnosis based on images. To interpreting what's in external scenes or detecting pedestrians when driving. Or extracting patterns in the data of all kinds of numerical data including long supply chain applications. But are we able to take the next step and create a more general purpose system that is able to figure out higher order ideas in the data, insights in the data. So far we've grad students and brilliant people with PhDs to extract the true insights. And so that's the question that a lot of folks working on machine learning and deep learning are looking at. So shown here on the slide on the left are the sort of quote-unquote teachers and on the right are the students. So who is teaching the machine how to interpret patterns in the data? The teachers in the most successful and stuff you've been learning about I think so far it's called supervised learning. Where the human being is the only teacher meaning in order for machine learning algorithm to operate successfully you have to give it ground truth data in the training process. You have to teach it what a cat looks like what a dog looks like. So every piece of data for a classification task. Every piece of data has to be matched with a human attached label that allows the machine to learn how to map the data to the label. Do you do any of this in the Agelab right now? Yes, so everything that's successful in machine learning is primarily using supervised learning. If you want to detect pedestrians successfully, we're doing with Agelab with autonomous vehicles, detect pedestrians, detect lane lines, detect other vehicles in the following roadway. You have to provide it a lot of human annotated labels of what a vehicle looks like, what a pedestrian looks like, what lanes look like. So there's someone that's sitting there on a computer looking at this and identifying these things for you. Which then the machine can learn from corrects? Exactly. So a lot of the investment from companies, a lot of the work we're doing is collecting the data step one, and step two is investing hundreds of thousands of millions of dollars in the actual annotation process. A lot of human beings are sitting and clicking on vehicles, clicking on pedestrians and so on. That sort of, sometimes the unspoken messy detail of machine learning is currently, our systems are not able to efficiently learn from the data in the same way humans are. The efficiency of learning algorithms is orders of magnitude slower. For us, as human beings, the amazing thing about our brain is all we need is one or two examples of a particular idea or object, visual characteristics of an object to learn the context of that object. You only need to see a cat a few times in order to be able to then generalize the concept of cat, in order to detect it in future images. That's currently something machines can't do. They need thousands of examples of cats in order to figure out in future images to detect that task. So in the future you see this kind of going down this human input as the machines become more able to identify these things and then make a generalization based on less and less data. Yes, exactly. I see the future, the excitement of machine learning community is in letting the machine, the computer, the algorithm, do most a large percentage of figuring out what's going on in the image. This is where deep learning comes in to, it's called representational learning. It's learning how to, the fundamentals of the visual characteristics, the physics of the world underlying the image without having the human encode that physics. So the human only helps, sort of the same way you can think of it as parents help when you're first growing up, parents help teach a baby, a child as they're coming up with basic concepts. That's, we see that's the role of the human being should have in the TKD systems. Otherwise it's too costly and financially and effort-wise to involve the human in the annotation process. And that's where reinforcement learning and unsupervised learning really come in. These are this different branch of machine learning where most of the work is done by a machine. So, and just to give you a brief view of the world of machine learning, what can you do, is this sort of a little bit of a philosophical but also a practical question of what kind of things a machine learning system can take in and what kind of outputs it can produce. Really anything that can be converted to numbers or vectors and numbers or sequences and numbers is what a neural network or a machine learning system can operate on and it can produce in the same way numbers, vectors and numbers and sequences and numbers. I'd like to give a few examples of what that looks like in various domains. But first a quick comment about the deep learning methodology and what it is. And it's inspired by artificial neural networks, inspired by real biological neural networks, where our human brain has about a hundred to one thousand trillion synapses. These are connections between these region neurons and our state-of-the-art neural networks currently have about a thousand to ten thousand times less than that for the connectivity. But the idea is the same, the basic computational unit of a neural network is inspired, modeled loosely, should say inspired by our biological neuron. The input, the computational is beautiful in the sense that it's really simple. The individual unit is extremely simple. When we combine it together, it can be arbitrarily powerful. Hey, do you mind hiding the box at the bottom of your screen there? Sure. Let's get that so the students can see the full slide. The full slide. Yeah, absolutely. Thank you very much. So this is inspired by actual neural networks. And so what ends up happening is you're choosing whether or not to fire some signal. Is that the idea here? Just like a brain? Exactly. The idea is, as you see here with a little video, there's a neuron has a set of inputs and a single output and it assigns weights to those inputs and sums the inputs together and puts it into a nonlinear function. That nonlinearity is where the magic happens. So what it's doing is you're teaching the weights are the things that the neuron is adjusting and the task for the neuron is to learn to get excited at certain concepts and not excited at certain other concepts. To disambiguate to separate concepts. So you can think of a neuron, sorry I keep going into cats and dogs. It's the easiest example. You can teach a neuron that gets excited between its task of determining the difference between a cat and a dog. So that neuron assigns weights based on certain input features which would produce a very high output signal for features associated with cat and very low output signal for features associated with dogs. And when combined together, you can start to learn arbitrary concepts in this way. And so then it's actually learning how to fire those synapses and what to look for in the image or what to look for in the system. So in this case, one of our students is asking, do you mean that this is completely untrained learned in this sense or what's happening here? So the internals are not human supervised. So this adjusting of individual weights is not adjusted by human beings. This is where I mean by representational learning. It's the internals of the network figuring out what features are important for classifying a dog or a cat that's done by machine. That's automatically optimized. Those weights are adjusted. The high level concepts, that's where the human steps in and provides examples. For example, in images, examples of different objects and the network figures out the weights to tell the difference to those examples. That's called representational learning. And so there's something inside this network that's saying, I mean, they're going to kind of activate this thing or I'm going to turn it on. And so that's where it's not, where you're not programming that in. It's just saying, you know, if this is correct, make this a better activated function. And if this is incorrect, make this less activated. Less of a neuron firing, so to speak. Yes. And so as an example, again, I apologize for all the pictures of cats. It is the most popular concept in computer vision is using cats for all of these examples. I think this is an example of how a neural network learns. So first there's a forward pass. We provide an image. It doesn't know it's just a set of raw pixels to the neural network. And it's tasked with saying to classify cat versus dog. And randomly, so in the very beginning, the neural network knows nothing. And it just, based on the weights it currently has, randomly figure out weights, the network produces the classification of cat. And then the back propagation step, this is how the network learns the backward pass through the network. If the classification is correct of cat, this is where we need the human to say is it correct or not. If it's correct, then the signal, there's a positive back propagation. The signal that resulted in saying that this image is indeed a cat, all of those weights increased. All the weights that said it's not a cat are decreased. And so the machine is then saying, okay, things that are saying this is a cat, then have extra value, things that say it's not a cat, then have extra, almost negative value, pushing at the opposite direction. It's an extremely simple step, so computationally. But when nodes are connected together, there's a beautiful theorem that says when there's a single hidden layer, so a network with three layers, it can learn any mathematical function. Meaning all of us in this room, all of the people listening, can be modeled by a neural network with a single hidden layer. So it's a fascinating look into the power, the representational power of these things. So the individual unit is extremely simple, but when put together, it's really powerful. And so another example, so you have a picture of a dog here, the network doesn't know that. It still says it's a cat based on the weights. And then so there's a back propagation signal saying no, no, no. Everybody that said this is a cat, their weight's decreased. Everybody said it's not a cat, their weight's increased. And so then it's telling the system to do this. And if it does this over many iterations, it begins to actually learn how to classify that. Is that correct? Exactly. And it's one of the exciting things that the current neural network approaches are achieving state-of-the-art results. They're breaking all kinds of records. Just orders of magnitude, improvements in performance on image classification, voice recognition. For most, if you're using a smartphone, if you talk to your smartphone, that's the voice that goes into your smartphone is processed by a neural network in order to figure out what it is you're saying. So some of the problems in machine learning have been, have seen incredible improvements because of neural networks. But it's not right for everything and it's important to understand. The thing that neural networks are really good at is, as I mentioned, representational learning. This is the key step, I think, that we want to highlight that separates it from the other machine learning methods. It's able to, what's shown here, if you're drawing a line that separates blue circles and green triangles, it's very easy to do depending on the representation and very hard to do depending on the representation. So if you just take the Cartesian coordinates, drawing a line is impossible to separate the blue and the green triangles. If you map that same data into polar coordinates, simple trivial transformation, you can draw a line very easily. The thing is, for other machine learning methods, you have to have an expert step in, a human expert, and says to remap, to do that feature extraction, that remapping. Neural networks are able to help with that step. Help figure out, is there an efficient way to represent the data so that you could classify it? And so that would be one of those hidden lighters, so to speak. It would be re-representing the data in the form that can be better understood by even the next step, or get stuff after it. Exactly, exactly. So the earlier layers figure out the representation, the later layers are able to use that representation to make correct classification, estimation prediction. Just to give you some examples of what we can do with this, what these approaches, what broadly what this toolbox allows you to do very well is arbitrary image classification. So I give you an image and you figure out what is going on in that image. We're able to achieve better than human level performance, which means machines are now able to do better in figuring out what's going on inside the image given to it. But here's a famous image net, really one of the early breakthrough sources of excitement for this particular branch of machine learning is being able to figure out in this large data set with thousands of categories what's going on in those images. The leper that's a motor scooter container ship we take it for granted how amazing this task is for human beings to be able to do. We take our vision system for granted. It's actually an incredibly difficult machine learning task. Computer vision is very difficult. So how big with this neural hierarchy? So size in neural networks matters. The bigger the network, it takes longer to train, it results in better performances. For some of these networks, it takes months to train. Once trained, the actual passing in an image and producing a result can be achieved in just hundreds of milliseconds. If not, it's just milliseconds. So once it is trained and the ability to classify in the future is really high. And so that comes into the real-time needs of driving and so you can necessarily learn real-time but once you have learned you can then apply that real-time very easily. The real-time application is key. So the training takes a long time for real-time application. It's called inference. So running a neural network is really fast. So that's one of the main benefits of doing a lot of deep training. Yes, up front you do all the investment in training up front and then later pays off by being fast in your operation. And just quick examples of other things that are supposed to classify the entire image. You can separate different objects inside that image. That's called image segmentation. You can do the same thing in video here shown something we do a lot of in the driving scene up top is the raw video on the bottom is called again image segmentation full scene image segmentation where the pixels are colored based on the prediction of what that item is. Trees, pavements, roads, road markings, buildings, sky, the things that are important for a car to correctly identify in order to plan a safe trajectory. Then there's object detection, object localization is the efficient it's again sound sort of trivial for us human beings because we're so good at it but as opposed to figuring out figuring out what's in the image exactly localizing where that object is in the image is doing that efficiently so you can do it in faster than real time is really important and that's called image localization and with that you can do things like on your phone mostly you can do this real video to text translation and then video to video translation where you put your phone the input is a video of for example here a box of cookies where the original language is in I think German not sure I apologize but it's taking in the raw video extracting the different letters that's the classification task then converting the letters to text then translating the text again machine learning that's called machine translation and then mapping it back to text in English so converting it from wherever the source language is here to dark chocolate translation so you can see the world differently see the world in the language that you're most comfortable seeing that world do you see anything happening with virtual reality here in the future where you can have audio where you can actually go to a different country and then just hear that language in your natural language that's great application it's also being done real time translation virtual reality and augmented reality there's a lot of exciting applications in this domain so the trick is the big challenge is the thing we mentioned before the real time operation making sure the latency between the raw sensory input and the actual output the actual output of the data is so short so quick that when you take that information in you don't get disoriented by the lack that's a technical challenge but it's one that people are solving very successfully and you can start generating so because you form representation of different images you can start generating other content based on the source image but here's a popular application of taking old movies and recoloring them or the recoloring of images recoloring of video is something that you don't know can do quite easily the same kind of structure as opposed to mapping to a classification you're mapping to a set of raw pixels and you can generate text so this is a popular methodology of character level text generation from an AI perspective beautiful and fundamental set of results where the input to the network is just a set of characters the network itself has no concept of language syntax this is a whole field called natural language processing so as opposed to modeling the grammar and the syntax of the language it's able to just take in the raw set of characters and then read Shakespeare and learn how to generate characters based on Shakespeare and the beautiful thing here is that it's able to generate text that actually sounds is first law readable makes sense in tactically and semantically and sometimes produces quite interesting results here for example if you start in typing the meaning of life is that's the human input and I have the machine continue what machine has completed is quite clever the meaning of life is literary recognition and the other completion is the meaning of life is tradition of ancient human reproduction so there's a lot of sort of humor and interesting results but the most important thing is the machine is able to automatically find the patterns of language in order to generate successfully that language so this would then be a representation of the language that it learned from not necessarily yes exactly I'm just trying to say that this is what it's learning from essentially and it's almost mimicking that or thinking of that's the examples that it has to move forward it's not as if it's synthesizing something completely other than there's basis for what it's saying correct yes but this is also the philosophical open question of all of artificial intelligence whether the mimicking of intelligence is not itself intelligence so whatever the machine this is the famous touring test whatever the machine appears to be intelligent is a truly intelligent and we tend to be very human discriminated against machines being very humanist by saying that the moment we understand how the machine works we say it's not intelligent but if it exhibits certain interesting qualities it's difficult to say on a philosophical level whether the machine doesn't understand the meaning of life so it's a very fascinating area that we're exploring the better and better these machine learning algorithms get the more we're able to think about the essence of what intelligence is absolutely so I was expecting the answer to be 42 yeah exactly pop culture if you google the meaning of life 42 comes up and it's a reference to a movie but I think it's very interesting to see how this different kind of concept gets pulled out of like our actual natural language and what's written so one of the things just to mention 42 is you're sort of exhibiting the usual human characteristic of humor which is one of the hardest open problems for AI systems is this really that is the frontier of humor of expressing understanding emotion being able to convey it that is something that machines are not very good at currently able to translate language but to exhibit cleverness, wit and even understanding of common sense that's the open problem of a lot of these machines but that doesn't mean that you can't have tremendous amount of impact on society and that's what we're driving but here's an example of us taking of raw videos called end-to-end learning where the mapping is from the raw video input and the output is steering of the car so we're able to control a vehicle over hundreds of miles we have a lot of demonstrations of this you can check out on YouTube of driving or the vehicles controlled either by the neural network or by the internal systems of the vehicle okay this is just an example of that myself inside a Tesla Model S vehicle where the pink line is showing the control of the vehicle based on the neural network and the cyan line is showing the control of the vehicle based on its internal system and we can switch between the two and compare what we call this machines having multiple AI systems in the same car arguing against each other and then the human is sort of the supervisor says okay quite not everyone will take control so if the human grabs the wheel automatically let's go in all current driving systems if the human wants to take control they get control there's no very few situations in which an AI system will take control from the human without consent with the automatic braking people exactly so in when the the danger of crash is imminent with a split second where a human being can't possibly react in time that's the only time where a car will take control but in terms of saying well for example you're not you're too sleepy or more importantly you're too drunk to drive currently we don't have where as a society are not ready for AI to say I will take control from here that's ethical and at that point the questions of ethics are fascinating and interesting and I think that's for public and for students and for everybody to discuss how to approach these questions of ethics as these systems get smarter and smarter so let's try to wrap up in the next three minutes or so we'll let the students answer some questions in a breakout room and we'll come back and try to have about five to ten minutes of Q&A so another example is taking cars so taking cars and we have tens of thousands of runs like this where we have a here it's kind of a black car put an MIT logo on it with blue lights on front you see that's a single camera taking in the world outside it knows nothing in the beginning and it's tasked with learning how to avoid collisions so when it collides it gives a feedback and says you collided and so then it makes way to that hopefully it will prevent it from colliding it's a binary classification problem of did that collide or not and over time so we're currently in the learning cases it's very early work but it takes tens of thousands hundreds of thousands of crashes to learn how to avoid those crashes so that sounds perhaps like a large number but given that there's unfortunately 1.3 fatalities every year 1.3 million fatalities every year and there is about 100 times that crashes on public roads in the world that's plenty enough data to train algorithms to learn how to avoid those crashes of course we're doing in the fun safe setting of an MIT gym vehicle traveling at 30 miles an hour crashing into cars but where there's no humans being hurt but this is the kind of questions we're struggling with in the commas driving situation our goal is to save lives 38,000 people die in the United States every year in traffic accidents we want to decrease that number as close to zero as possible that's where AI can really help and a quick convention I think we can talk about this a lot but where deep learning methods and machine learning methods can help in the supply chain really the methodology itself is open to all kinds of data as I mentioned before the task is to collect the data clean it up to a degree that machines can interpret it and whether you're using machine learning generally speaking on your network methods then extract the features that are relevant for whatever the task is so you can detect anomalies and disruptions in the data so you figure out the patterns and data end up to detect the weird parts, the things where something that seems wrong the anomaly detection you can of course extend that data with the natural language generation we looked at to forecast the future you can plan to the operation research kind of optimization where you plan which areas based on previous data and you can understand and visualize the role of the human inside this entire supply chain and we can certainly look at that in supply chain how does the human make everything more complex how do they screw it up and how do we fix it how the machine fix it so an interesting kind of tie back into the supply chain when we have autonomous vehicles I think they'll have a big impact on the supply chain at least the logistics aspect and so we'll kind of tag back to that in the Q&A section hopefully and then I think we're going to let the students look at their rooms so we have a series of questions so I think Arthur can put those up on the screen and ask them in your rooms so if we'll go there we'll try to spend roughly about 10 minutes there so we'll try to come back at about 10.52 that's eastern time so we'll try to come back in about 10 minutes great