 All right, let's get started there's a lot of slides I think I counted a hundred slides on this lecture, so we'll burn through it pretty quick before we get there some logistic stuff the Pytorch recitation is tomorrow the slides for that will be released. Oh, no, it's tonight. I'm sorry The slides for that will be released But the recording we're kind of going back and forth on will try and get a recording out if possible Because I know not everyone can make it but at least the slides will be up and there's a lot of resources on Pytorch online, so hopefully shouldn't be too difficult to learn Pytorch for this homework assignment And then the next project if you wish to use neural network for the next project that's coming up Let's see what else Regrades for homework one are closing today. You were supposed to close yesterday, but we were closing some stuff last minute So we're gonna close them tomorrow give an extra day or we're gonna close it tonight Give it an extra one day and then your grades will be locked in homework one or two grades We're aiming for it to be finished by end of the week It seems like it takes like roughly two weeks to finish grading each homework set But we also have to go through like a hundred eighty submissions So, you know, it takes a while so Please be patient and then we're gonna try and speed it up as we get towards the end of the semester Because we have to grade your finals and then have to assign grades and everything too, so Well, we'll try and speed that up a little bit What else? Think that was about it. So if you haven't looked at the homework yet, please start now Especially because we're not just implementing equations anymore We're using a whole another software package to do things So there's a little bit more overhead and learning to use software packages If you especially if you're not a CS major and like you're not really familiar with that kind of reading Documentation and following tutorials and that kind of process for like the whole package So start that earlier than later, and I think you'll thank yourself So just just read through it just to see that's actually the first home The first question on that homework doesn't involve any coding You just go to a website that already has a neural network implemented and then you just run some simulations with parameters So I think you can knock that first one out pretty quickly. So you'll thank yourself later. Okay, cool All right, so I realized after yesterday's lecture that I Got the slides and like that homework didn't really align There were a couple things that are missing that you need for the homework So I'm gonna cover that right now And then there was a question on Tuesday's lecture about Activations functions having to be monotonic because I said that during lecture I you might have seen this on Piazza, but it doesn't have to be monotonic at your activation function Just has to be nonlinear There are certain functions that still don't make any sense to you as that activation functions for one But there's and you can you can find it on Piazza if you scroll down you'll see something about monotonic activation functions and you'll see that they don't have to be monotonic in fact Swish which which I mistakenly called swoosh because they're similar the swish activation is non-monotonic because it It goes like this. It's It kind of does that it's swishes so There's a little bit on the negative x-axis where it's not monotonic It overlaps a little bit allegedly that helps with more stable optimization and gradient loss But who knows right that paper is also on Piazza if you want to read through that It actually does a really good job of going over to history of Reilu and why we use it and what the kind of things We look for in activation functions are so if you're interested in that kind of part of neural networks Strongly recommend that swish paper and then you can go down that rabbit hole when you choose to Okay, the other couple things I want to cover. So first And I'll talk louder since I'm not at the microphone for I realized we talked a lot about what in your network is but Like for a given problem. I didn't really explain how to build in your network perfect I guess it was implied, but I wanted to explicitly address that specifically for this homework So if you have some input of feature length and in this case, it's five so zero to four, right? so your input You're gonna have to specify that you have an input of dimension five in pytorch and then your first layer your first hidden layer So when we say a one layer single layer network, it means we have one hidden layer in the middle here And you can have as many nodes as you want here They can be five it can be three it can be two hundred if you want because this is going to be a fully connected I don't think I missed the line here. Yeah, it's gonna be fully connected here So it you can expand this these dimensions as much as you want or in certain cases You want to reduce the dimensions so this layer it can be longer or shorter than the input. There's really no It depends on the task that you're trying to do and then finally at the end here This is the output node So this this is going to be like the classification pass that you're trying to do for binary Classification this can be one or this can be two so if this is if there's one here Then you're gonna activate it with Softmax, which is the same thing as sigmoid that's on piazza But we're gonna activate it's gonna return those value between zero and one one means True and zero means false obviously and that kind of gives you a posterior probability distribution as well You'll get a confidence value like that's 0.5 networks like I have no idea what this is, right? But if it's zero or one it's saying I'm confident. This is a negative class or I'm confident This is a positive class you have to define the positive and negative classes the other way to do this is to define I'm going up instead of down here But you can specify another node here and then that'll Also go here and then if you soft Mac across these two It'll normalize it so that these two outputs add up to one Right, so that's what soft Mac does soft Macs does is it makes these two values add up to one and then so You know, this means class one. This means class two. It's still binary classification. That's fair as well Theoretically they're equal if we're doing multi-class classification who have three classes for classes K classes You just build up more output nodes, right so Y2 white I'm gonna I'm not gonna do that but Etc right and then so if you have three classes you have three output nodes Four or five six and then you soft Mac a soft max across them And you get a probability Probability distribution of the likelihood of each class and then there's a whole thing about setting what threshold that should be right? Should it be more confident than like 50% like 70% like what's the right threshold there? There's a whole thing there There's also a thing with model calibration And this is also relevant to the previous methods as well where a model might be overconfident or underconfident with respect to these Probabilities that come out of here So for example, if you collected all outputs that has 70% confidence of its classification You would expect 70% of them to be correct, right? That's kind of Calibration so there's a whole field there so we're not gonna cover that in the class or yet it might come up later, but That exists as one to let you know the other thing I want to cover is one hot encoding So I realized in the homework data set I hadn't taken a look before we released it I would have I would have covered them on Tuesday otherwise, but I noticed there's some categorical inputs in the feature Right so all the inputs that we've worked with so far have been numerical right so 0 1 floats integers, right? but when you have Features like red green and blue, right? We need to convert these into some numerical representation to the machine learning algorithm can learn it, right? One way to do this is to assign numbers to each of these classes. So red is 0 Blue is 1 and green is 2 and anyone tell me why that might not work great And I posted some piazza already. So if you read it on piazza, don't cheat. Yeah Right. Yeah, exactly. So I don't got correlation but similar, right? So red is closer to blue than it is to green right when that necessarily might not be true So instead what we do is we do one hot encoding and that's where we expand this one column of colors Into a column for red a color for blue and a column for green, right? And then you imagine for red we do one and then for blue. It's not blue. It's not green. You know, it's not red It's blue. It's great. It's not green Etc. So we break it out into Boolean columns and then we feed that into the classifier and that helps quite a bit It also depends on what method you're using. So certain methods are more resilient to this There's some research that says neural networks are a little bit more resilient to assigning like 0 1 and 2 here but just to be safe and Definitely with things like decision trees and stuff. You definitely want to break it up So if this comes up in the project, we still haven't settled on a data set for the project yet But if it comes up, then you know what to do Okay. All right. I already burned 10 minutes. So we have to get through the rest of the slides here. All right, so We're gonna go through a quick summary and I have a laser now. So That doesn't show up on this video monitor and I couldn't get the projector working never mind I'll walk up to the screen. Anyway, so today's lecture is going to be on convolutional neural networks and recurrent neural networks so these are modifications to the architectures of the neural networks that we've seen so far and They're specialized for certain paths, right? These these methods were developed to Assist the neural network in a way to do certain tasks and we'll talk about how that works But before we get there, let's kind of review previous lecture. I'll get through this quick So the motivation is we want to learn non-linear decision boundaries Not everything is as linearly separable. In fact, most things are not linearly separable Some there's existing ways of doing this like adding features I also realize This curriculum doesn't cover the kernel trick which is like very relevant to SDM's and things like that But I think this course goes so fast that we don't really have time to cook Like I think we would need like two additional lectures to fully cover the kernel trick So write that down and Google it If you're kind of interested in that but that's pretty critical and that's also relevant for learning non-linear decision boundaries Specifically with SDM's. There's a huge thing in the late 90s and the early 2000s. So definitely cover that So anyway, so we want to learn non-linear decision boundaries and we do that by composing linear decision So we just draw a bunch of linear lines and then we combine them to do that And so we actually showed how So we learned, you know, how we can do that We also covered the XOR example previously where there's a whole thing about perceptrons not being able to do XOR because It's not linearly separable. And in fact, you can do that by composing linear Right, you can build an XOR gate from and and or gates Electrical whatever so You know networks we formalize the method for building these composed linear functions, right with the non-linearity added just to add expressivism So we saw how we have individual nodes which are perceptrons with activation functions or non-linear You know functions at the end and then we saw how we can add them up to be layers, right? So each layer is Collection of nodes and then we add these layers back to back to create a sequence of layers into a network, right? so node into layers into a network, okay, and Every time I want to go back I have to go through the sequence again Okay, and then we covered how deep networks are universal function approximators So given infinite width and infinite depth they can estimate any function ever Within some epsilon right and so that's kind of a theoretical point But the point that it's trying to make is that the models are very expressive now There's no guarantee that you can actually learn that model obviously But the point is that it provides the kind of framework to be able to do that Okay, and we also covered this very interesting geometric interpretation, which I really like where by Identifying that the dot product between the weight vector and the input vector is can also be geometrically Interpreted as the shortest this is between a point and a plane, right? So if we kind of define a hyperplane with the weights of a single neuron and then we do a dot product with the with a data point we get the Line that's orthogonal to the hyperplane that goes to the data point, right? The hyperplane kind of acts like a filter for those data points and then it stretches the metric space The It stretches the distance metric space in that direction to kind of push certain data points more into their own regions or into their Own clusters so by doing this sequentially it adds It kind of moves and shifts data points so that they become linearly separable after a certain iteration So the way we kind of visually showed that is we have a bunch of data points And then we have a on the first layer of this network We have a bunch of hyperplanes right and then if we take three of them And then we if we take the outputs of three of those and then we plot them again And then we do that again We keep splitting up this hyperspace of n dimensions with these hyperplanes and then we keep stretching the points Until eventually they become linearly separable and they're separable by a single hyperplane. That's exactly what's happening here, right? Because right we have these three hyperplanes that are stretching it and then we're assuming that Forget this right if we just have this one node and it's a binary classification This is just wx plus b, right? And so we're assuming that by the point all the data points get here They're linearly separable right because we're assigning a linear classifier with activation function But if it's like really it doesn't matter, right? So that's kind of the intuition Yeah, so the data becomes linearized with a sequence of these hyperplanes Okay, so Towards the end of the last lecture I kind of burned through these slides But the point is that now we can do this to do a lot of different tasks, right? So I'm kind of in the Some of the more popular domains in which neural networks are used for this type of tasks image the computer vision it's been hugely transformative for computer vision where we're processing images and video It's been usually transformative for audio so signals just one-dimensional signals over time right time series data sets It's been hugely influential and then even for robotics right control control theory and learning control algorithms for manipulators and other robotic systems have been hugely influential So this is what we're going to be talking about more today Okay, so how we can build architectures to better Be amenable to learning any questions about the summary Yes Good question. That's today's lecture Great question hold on to that. Yeah, any other questions? All right good Okay, so today we're talking about images we're talking about audio and text and like we said We're talking about virtual and physical control tasks the last part a little bit at the end on just because the Forgot his name the peachy student that that set these lectures. He was a kind of Reinforcement learning researcher so that it shows up at the end here It's a good combination of the first two, but the main focus is going to be on images and audio and text. Okay So The big theme of this lecture overall is that in order to scale deep networks to these domains, which are Massively complex right which will and I'll describe why that is we Often need to use inductive biases to be able to better learn these tasks We need to inject some a priori knowledge some assumption into the Architecture of the method itself to make it easier to learn these tasks Okay, so kind of keep that in your mind as we're going through these slides kind of identify what kind of inductive biases We're injecting to come up with these architectures right like we didn't come up with CNN's just kind of out of the blue right there's a reason that Those filters work and there's a reason that architecture works specifically for image data although they're using it for other things now but initially at least Okay inductive biases So as we develop these methods obviously there is a theoretical motivation to doing this right to prove Oh big o event we can Do this and if you ever take like a computational learning theory class that is what you will do You will prove that perceptron can learn within an X number of data points if the data set is limited and separable Right, but in the end really what we care about is solving tasks, right? And oftentimes these tasks are relevant to our daily lives These are tasks that we want to somehow automate or we want to do faster or better so one such example, I'm gonna go through a few examples the Classic one that I'm sure everyone's aware of is object recognition or image classification Right object recognition is more specific because image classification could be anything Right, you could be class of the classes could really be anything But specifically about identifying objects that appear in the image. This can be single class This can even be multi-class right and when I say multi-class. I mean the the detector could Identify multiple images multiple objects in each image so we can say there is a motor scooter and the person in this image, right? things like that Other thing one step further in this is object detection. So actually drawing boxes around each object in the image So you can imagine Not only does it have to know what objects look like it also has to be able to tell what where it is in the image which is surprisingly Not trivial and then one step beyond that is object segmentation right where we it can actually say this pixel is Part of this object right so we're drawing pixel-wise masks on top of images to say This pixel isn't this image So you get a tighter kind of outline of the object instead of just a box for certain paths that might be relevant Okay, and then away from the image domain one thing that You know is has been hugely influential and not just technology But also society is the ease of text text translation right Google translate usually awful But if you've used trans Google translate recently, it's like pretty good, right? It's it's like fairly decent and then text question answering this is like one step beyond that where you can Ask a model questions and I'll answer and that kind of gets into the Google assistant space chat GPT space and cetera, but the model needs to understand what you're asking right not only what you're asking But what kind of response it's expecting like is it expecting an object is a time place, right? It needs to be able to parse that Okay, a little bit more. I'll go through this quicker because like I said, there's a hundred slides in this thing There's also a and then as we go into the kind of reinforcement learning realm kind of agents is what we talk about where a AI agent is interacting with an environment. That's kind of the field of reinforcement learning We have an AI that can play Atari and we don't mean like that We don't mean it in like the traditional video game sense of AI where it's like plugged into the game itself So it like already knows all the pads and everything we mean literally giving it a picture of the screen and nothing else And just the controller inputs and the computer learns how to play the game, right? It has to know the score and it needs to learn how to learn to score But it can just learn without being plugged into the game engine itself Go. I'm sure everyone's familiar with like the huge AI Go AI beating like the top player in the world a few years ago The documentary for that is worth watching if you haven't watched it yet Object manipulation this is really hard. You put a camera in a robot and you say go Move this block from here to here, right? That's without telling it turn your you know, typically how you do this you would give it specific motor movements To go from a to b but without doing that Amazon Amazon is hugely interested in this because they would like robots to sort your boxes for them Or like pack your boxes automatically and then obviously autonomous driving is like the big Large amounts of money DC funding Elon Musk gets in trouble, you know that kind of stuff, right? So that's that's a huge field and I'm sure you know about that already Okay, and then this slide's interesting And the point here is that if you if you go one further level above from this and you kind of look at what? Open AI and deep minds those kind of organizations are trying to do is they're trying to abstract out from just learning specific defined tasks to more kind of biological society level Individual level paths, right? So survival and reproduction Sellers signal maintenance organ function most electration, you know all these other biological things that we kind of take for granted You know these things were also results of optimization, right of like evolution, right? Which what's evolution which is in and of itself an optimization scheme? So the point here is that you know, it doesn't have to be something like this where it's like a Strictly defined past right kind of in the high level. We're kind of thinking about these larger scale questions as well Okay, so for any task and not machine learning for any task in like society and like science and engineering in general there are kind of two things that we look for there's priors and there's learning and You have these things at different levels or different tasks, right? so for like Building a bridge, right? Like we understand physics to like a very very good degree So there's a very little amount of learning that we really have to do like it's mostly priors it's mostly prior knowledge and frameworks and physics and Civil engineering etc that we build on to do our engineering but Learning happens when we don't have those prior right or when we're unable to define those priors in the domain that we're working with so This priors again to find out the knowledge assumed beforehand The example here is there's a process called fine-tuning which I won't cover too much But where you start with an existing model and then you learn a new task on top of that Regularization like could be considered a prior like we're assuming that we don't want the weights to be too big Right, so we're putting a restriction on it. That is technically a prior The architecture itself is a prior because we're assuming that the task that we're trying to do is amenable to this Architecture we define that's right if it was infinite depth and infinite width, right? It wouldn't be a prior, but we're saying I think you can solve this problem with five layers and 20 depth, right? Or five depth and 20 width right that that is a prior in and of itself And then also like the activities and the outputs, you know the features are a prior like which columns We give you and then like the labels are a prior It's all prior Now learning on the other hand is knowledge. That's directly Extracted from data and when I say data you can kind of think of it as examples, right? So this whole gradient descent mechanism is a way of doing learning You know we've talked about this at extents extensive detail over the past election So I'm going to too much but the point here is instead of assuming something we're asking You know can we look at the patterns? Can we look at the correlations in the examples that were given and derive some knowledge from that instead of doing some? prior right instead of making an assumption or deriving from previously no knowledge So it's a balance right so if you have strong priors you don't need to learn so One thing I want to mention that is that you don't need machine learning for everything, right? And if you look at all the startups coming out recently you might be tricked into thinking that you need machine learning for absolutely everything But you really don't right there are some things where we really genuinely have enough priors did not need to use machine learning, right? So that means that means that it's fast and easy to learn and deploy right there's Minimal learning so even when if you're building a bridge, maybe you encounter something that you haven't seen before So you need to adapt to that like that could be an example of learning, right? But the majority of the bridge building process is has strong priors But that also means that it's very rigid right if you need to change something There's already so much there that you need a lot of signal to be able to change what you have already if there's a lot prior on the other hand if you have weak priors you have to do a lot of learning and That means it's slow not just computationally slow, but data slow right? You need a lot of data to extract the same amount of knowledge that you would have had if you had just strong priors in the first place But that that also means that it's flexible and adaptable and it's not kind of influenced by previous biases and your priors that you may have had right so For a desired level of performance on a task We want to balance these priors and learning to a pain a model that achieves the best performance in the minimal amount of time That's a balance, right? One example. I want to show is like a genetic algorithm. Anyone heard of a GA? Kind of ish so like a GA is a Optimization method that simulates biological evolution to find the optimal like So there's so there's like populations. There's like gene crossovers and mutations. It's very interesting. It's it's a genuine like Simulation of like biological evolution, but that's what you use when you have absolutely no priors at all And you have all the compute and time in the world. You also don't have data, right? So it's a pure like grid search random search method to find like your optimal solution So that's like one end of the spectrum and then here, you know, we might we still don't have a lot of priors But we have a lot of data that's kind of the assumption that we're making so that speeds us up So we can kind of find the optimal solution faster than a GA might right and that's the kind of prior Learning balance So priors are essential and then this slide has some biological examples in that You know, it's it's impossible to learn a model Completely from scratch without any priors, right? And so we shouldn't hesitate to introduce priors into our model to help things go faster With the acknowledgement that those things are priors. So, you know, we are all initialized from evolutionary priors And although we can learn a lot We're still kind of starting out from the base of what our body and our what our biology allows us to do Yeah So it happens less on like the later and you're on level and more on the architecture level Which is what we'll get into with the CNNs Good question Okay, so up till now machines are gonna go through this a little bit quicker So up till now machines have been truly based on priors now for the first time we can create machines that can learn right a lot of our Since the industrial revolution a lot of our technological process has been based on being able to design machines But up till now it's been I wanted to do this thing and I know exactly how I wanted to do it But now we get to make machines that also learn on their own and Even though these tasks these machines do tasks that we couldn't possibly program manually They're still based on a lot of priors, right? It's not like we don't have any priors at all. So What kind of priors are we talking about so the priors that we take advantage of are known structures in data, right so not all data is like a Vector right some some data has inherent structure. So it'd be kind of dumb to not take advantage of that structure. That's present We need to take advantage of as much Kind of a starting point as possible. So kind of the two examples that we're gonna cover today is Array data or spatial data where it's a two-dimensional array like an image, right? So that's one structure that we can exploit and then the other is sequential data or one-dimensional often referred to as time series data where We know Something's moving along some axis. There's some structure along a certain axis, right? Like when we give a one-dimensional vector to this kind of a network, we're not assuming that like adjacent Columns are like relevant, right? But we are for something like that where it's a time series data. So This really this is what really allows us to learn models on complex domains And this is why CNNs and RNNs have been so influential over the past several years Past decade, I guess we should say Okay, so let's cover convolution on your networks. Any questions up there? Great so we briefly mentioned some computer vision tasks and so the one that will kind of use is object recognition image Classification just because it's easy to think about and then what the example that we're going to cover is Professor Yisong because it was his PhD student that made these slides so We're going to use his image as kind of our example here Okay, so we want to build a model that looks at this image of Professor Yisang Ye and Is able to identify him as Yisong, right? So pretty straightforward, right pretty straightforward classification also called discriminative mapping from image to object, right? identifying the image so What kind of information is contained in this image that allow us to identify This image as containing Yisong Right, so Whatever that information is we need to extract it and then we need to define some model that then does a conditional probability distributions, you know based on whatever information is Include whatever relevant information is included in that image. That's an interesting equation with an image in the conditional space but uh, you know, that's that's kind of the point that we're making so Real briefly for about 30 seconds or a minute can you know Any kind of can can anyone describe what those relevant Factors might be what kind of things that what are you looking for when you identify this image as Yisong? as professor Yisong Yeah Base good Can we get more specific? Eyes, okay nose mouth right kind of facial features, right? Anyone else? How do we know this is a person? Because right now like think like a computer right this is just an array of numbers How do we even know that this is a person sure so there's like something round on top of like something square right ish? Humans have limbs so maybe you could identify limbs when they're present right so you know But even still like we're describing these in like super high levels, right? So imagine that what's necessary for a computer to be able to recognize that And then also notice that the image contains other nuisance information so things that aren't relevant So we don't care about the wall. We don't care about the chair We actually don't care about what color clothes he's wearing We don't care about what pose he's in is he sitting is he standing it shouldn't matter when we're identifying this as professor Yisong right so we need to be invariant and that's the term that you'll we'll use quite a bit We need to be invariant to this type of nuisance information while still locking on to the relevant information for identifying this as Professor you saw Okay, so obviously this mapping is too difficult to define by hand We can't even define it in words just now without resorting to high-level definitions that we've come up with as a society So we need to learn from data So what we're gonna do is we're gonna take a bunch of pictures of you song And provide those as positive examples And then we're gonna take a bunch of random images that we found online or some collection of images And we're gonna define those as not you saw right and we want to train a model as we've done before On this is he saw and this is not you saw on this binary classification. What did it say at the end there? Okay, so now we need to define the model architecture itself So the standard neural networks that we've covered so far require a fixed input size right so if we build a model with Input vector of five then we can't increase this right otherwise the weight vectors You can't do the dot multiplication multiplication like the dimensions don't work out. Yeah. Oh, that's a good question So the question was how do you define? the proportion of this data set Typically you get as much as you humanly can't but right, so the issue with that is With not you song that's very easy, but with you song that's relatively difficult We can only find so many pictures of you song online, right? And so you're fairly limited usually by the harder one to get and then the other one needs to balance that because otherwise, you know You might find that You're not you song Labels overpowered a song labeled if you have too many negative examples and not a positive examples You won't learn anything because your gradients will be too small Yeah, that is a whole field of research is like we call it label imbalance. That's the term and And especially in the sciences. That's a huge problem. So it's for example. Let's say I want to do A Mars dust dust devil like detector. There's only so many dust devils, right? So we can't provide enough positive examples to like truly train a model So that's where like image augmentation comes in and other strategies. So ideal is one to one Yeah, and then depending on the tax that ideal ratio changes So I said one to one and I can already think of like five examples where that's not the case But like it like a good baseline assumption is like you want one-to-one between the different class You want an equal distribution across all the different classes because otherwise You're kind of implying that a certain class is more important than another, right? Yeah. Yeah. Yeah Yeah, so you yeah, that's a good point and if not you song doesn't include any Images of humans, right? Let's say we picked the not you song data set incorrectly Then you'll find that you'll just find the human detector, right? It just finds people and then it just assumes they're all use so that cup the composition of the negative data set is hugely important This task is hard Like if we set this up like this way and we try to train a model on it It'd be fairly it'd be fairly difficult because it doesn't have enough power. Yeah, good question I have a question All right going forward so Standard neural networks like I mentioned require fixed input size, but images can be any size, right? And not only that there's three channels, which is a whole nother thing So an image is actually three-dimensional array not just a two-dimensional array So we're going to try and keep things simple just for this toy example and we're going to say Okay, sure so larger the image you have clearer patterns, but more parameters if you have fewer parameters You know, it's easier for the model, but you can't tell that it's e-song anymore on that like postage stamp size, right? So that's kind of a trade-off in the image space size And then with color as well if you have color, you know You might be able to tell this is e-song better than if you didn't have any color at all But for this example, we're going to convert this to grayscale and this actually works for most computer computer vision Paskets sufficient to have grayscale information unless color is like a very important thing for whatever you're trying to do So we're gonna say we're gonna convert to grayscale and we're gonna assume that our images are all hundred by hundred just for this example Okay, but clearly this is not the case so kind of the naive way of doing this with the model that we've already defined is to Take this two-dimensional array of this image and flatten it. So NP dot flatten, right? NumPy dot flatten into a 10,000 dimensional array Right, so that's like a very straightforward way of doing it and people do this Like if you're doing MNest like you can do this and it'll look like pretty well, surprisingly But this is like the easiest way of converting this image into something that fits our architecture, right that we've learned so far So if we do that now, we have a 10,000 space input And then we define nodes and we define layers The other question is how many units do we need for a 10,000 unit input, right? so the number of units times The number of weights that we ultimately need to learn Scale by not only the input, but also the weights that come afterwards. So if you You know if you want to say I want my first hidden layer to be a thousand neurons long Thousand nodes long in my first layer the number of weights that you have to learn is already 10 million Right, so that's a lot of weights that you have to gradient descent That's a lot of weights that you have to learn so you immediately You know come up with a very large model and then you have depth, right? So then it keeps multiplying so the number of ways just keeps multiplying as you specify more and more neurons in each layer Yeah, yeah, yeah, it keeps multiplying, right? Is that right yeah, because it's a fully connected network, right so it would just keep multiplying So because this is multiplicative yet We have to be very careful with the number of neurons that we have to define for this architecture So if we want to recognize even a few basic patterns because the emphasize is so large because it's 10,000 the number of parameters Explodes right so it becomes a really really Difficult model to train. That's the point here So instead we want to reduce the amount of learning we have to do We want to reduce the amount of parameters that we have to learn and we can do that by taking advanced Vantage of inductive biases of taking advantage of priors the priors that we're going to use for this test is The basic fact that an image is an array Okay, and not only is it an array. It's an array with spatial structure So pixels that are near each other we can assume Describe a similar thing right there within the same object Maybe they're on the boundary But in general we can kind of assume that there's additional information in this kind of spatial structure And there's some advantage to knowing that a pixel that is you know that are next to each other Contains similar kinds of information and we can leverage that to have to learn less number of total weights So we call that locality There's some local structure, there's some local relevance between between pixels Nearby areas tend to contain stronger patterns. So let's let's look at this picture of you song again Nearby pixels in the background, you know, they tend to be similar kind of there's no hard edges or anything Right, it's kind of a gradient. If you will, right, it's kind of smooth Right, whereas nearby patches that are kind of on edges of shapes and such Tend to kind of look like this right if they have a line this way or a line this way or to have a hard edge This way there's some contrast That that we can kind of identify so now we're identifying we're going from pixel values We're going into shapes already like we're going into lines. We're going to Edges and curves and things And then also, you know, if we go up to you sound space, right nearby regions have even higher level patterns, right? So things look like eyes things look like noses, right things look like mouths so this is the type of information that we want to leverage and I realize this seems kind of simple if you think about well, yeah, duh Like you want to make a eye detector. You want to make a mouth detector, but being able to learn this is the point, right? and being able to Use in biases to inductive biases to learn this quickly. That's the other point, right? And that's what CNNs are powerful for so So that's the first thing is locality. We care about patterns and local regions The other thing that we care about is translational invariance, which is that relevant positions relative positions are Relevant, but absolute positions aren't so what does that mean? So in an image regardless of where yee song space is Right, I should still identify that as yee song Right. So for example, if I flattened out an image into a one-dimensional vector and did it this way If yee song space is on the top left corner compared to the bottom right corner My vector completely changes right if I had flattened that But what we really want is for the model to treat all of these inputs equally Right because it doesn't matter where the face is. It's still yee song space, right? I feel bad like talking about him like that, but he's the example. So Yeah, but that's what we mean by translational invariance and I have some research on this I'll talk about at the end as well. This is actually really interesting So yeah yee song's identity is independent of absolute location of his pixels. I would hope so But that's that's the other kind of big inductive bias that that we're Facing our model off of so how can we use these inductive? We how can we use these concepts to design our inductive biases? So first locality we can do this by restricting each neurons field of view or restricting each neurons input To a specific region so we say At any point in the network you can't look at all of the pixels at once you can only look at The pixels in a square region in which I define in this case It's three by three five by five seven by seven, etc right But the point is I'm telling a neuron you can only look at this Square region and that's where you're gonna do your learning Because I know that there's some relevant pattern within this specific path. So that's one And I should mention of the weights here are not one dimension. The weights here are also two-dimensional Right, so the weights that we're learning themselves are also two-dimensional and now we're doing matrix multiplication Right instead of a weight vector where we're doing a dot product against another right The other thing that we talked about is translational invariance where relative positions are relevant What we're going to do this time is we can say we can have multiple of these neurons Looking at different parts of the image But we're going to force these weights to be the same Okay, so You know different You know we have different neurons looking at different regions in the image But when it's learning we're forcing the weights to be the same so that it doesn't react differently to any part of the image Right, it's treating all parts of the image equally even though we can only see a part of the image at each time Correct right these are these are essentially identical neurons. We're just formulating it this way so that they look at different parts of the image No, no, no of the neuron of the yeah the learned weights Yeah, so so the prior that we're Introducing here is into the way The model is architect so the way the method is designed the method still learns right This is a neural network, which the whole point is to kind of learn the task, right? But we're setting up the priors for it to do its learning better Right the question that you're asking whether you can do this with priors completely totally Yeah, there's a facial detection algorithm. That's I can't remember the name, but it's very famous Huh Something filter Anyways, so those define like little black and white filters that are like sides of faces sides of nose eyes Mouth and so they define filters manually and then they say if these filters appear in a certain order then it's probably a face So yeah, that totally exists for something like a face where we have enough You know kind of research to go off of but we're trying to learn that here. Yeah, great question Yeah Any other questions? Okay, moving on So um, this is really important You know, we're not I think I feel like when people learn CNNs We kind of learn about how it works, but we don't really understand the reason why this is why right? We're trying to get locality and we're trying to get translational invariance All right, let's see how to actually implement this into the model because this way you could implement this You could write this up, but it's kind of annoying right like now I have to you know define all these and then we have to fix the weights and blah blah blah, you know, it's it's kind of messy So These are the inductive biases of convolutional neural networks. It's just a special case of standard Neural networks so again comparing this to what we had already right so first of all We save on them on the number of weights because we have less lines. That's literally what that means and then You know if we were to do a fully connected and multiple nodes Again, you would have like a totally crazy amount of weights here But again since they're looking at different regions with the same weights fixed That's we're saving on the number of weights that have to be learned So overall we're we're saving a lot of parameters that have to be learned another Critical part of this is that the number of weights become independent of the input size I don't think this is obvious at this slide. So if that doesn't make sense, don't worry about it we'll kind of get there but With this kind of architecture we can input image sizes of any size and the model will totally work But we'll cover why I'm in a bit. I don't think right now that that makes too much sense. Okay, all right so How do we do this? In a smart way and you might have gotten a sense from the name of CNN's convolutional neural neural neural networks. We do this by convolving Convolving filters So we talked about These weights We're going to use the term filters going forward from now because it's two-dimensional, right? So these weights are two-dimensional matrices and we're going to have to find them up here Right and what we do is we take that matrix and we convolve it. Oh, I just this animation We convolve it throughout the input. This is the input Okay, so we're taking this one weight and we're convolving it, right? We're kind of scanning it over this Input image, so we're sliding windowing it. There's a lot of different terms like sliding window convolve Scanning kind of the same thing and then as the output we get out another two-dimensional array. So each of these squares Correlate will correspond rather to each location of this pixel over this input array Okay. Yes For now. Yep. So for some given fixed weight. Yeah for now and we'll talk about the learning Yeah, any questions? This is really important. You got you have to understand how convolution works Even if you've like never heard it before Yes So we're not standardizing the size But for image classification you might want to standardize the actual pixel values So if you want to standardize like mean zero standard deviation one you can do that But for this you don't have to standardize the image size anymore Yeah, well, yeah, we'll get into we need to cover pulling before we get there and like that's a whole lot of things Anyways, okay, so we have We've defined convolutions. You notice we have again. We have one weight They look at different regions in this input array and they're local, right? So it only looks at a certain number of pixels at a time. That's the concept. So with this we can you know implement this inductive feature of localization and shift invariance Got it, and then we get the future map Okay, okay So for learning this it's exactly the same it's just gradient descent right, so The gradient flows right back. It's a little bit confusing because now we're working in two-dimensional space, right? So instead of having instead of passing one dimensional arrays around We're passing two dimensional arrays around so you can feed this straight into another convolutional layer that uses two by two features instead Right, and then you get a two by two output out Right, and then you can feed that into another node and then get one number out and that's our like classification confidence, right? so and then you just Backpropagate all the gradients back using the chain rule the same exact way you would have done before and you get to learn these ways So now you're learning a two dimensional weight back weight vector weight might weight matrix. I shouldn't touch this This is like a TV But that it's it works exactly the same way as everything we've done before with these one dimensional networks Okay, so we're gonna get into a little bit of nuance here with CNNs So first you can use padding to preserve spatial size So you might notice when you convolve your output size is smaller than your input size if you want to keep it the same You just pad your input so that there's zeros on the outside And then that way your output is the same size as your input is like a dimensionality thing, right? For some reason if your output needs to be the same as the input There's different padding strategies. You can use zeros. You can you can duplicate these values and fill in the end There's all kinds of debate about what's the current correct padding But if the default is to use zeros assuming that your data has been normalized between zero and standard deviated one and so you're kind of assuming you're filling in the average value if you will right by filling in the zeros There you go There's also something called strides, which is instead of going one pixel over at a time One pixel over at a time you can skip So if you need to reduce your dimensionality super hard because the thing that you're trying to predict is kind of low low dimensionality You just skip right. That's what we mean by stride. I'm explaining this because this is on the homework. I think so Stride one is normal like fully Convolving around the input stride two we're skipping right slide three stride three etc and then When we have multiple channels, so we discussed how images can be RGB There's three channels, right? So now your filters are three-dimensional on Three-dimensional data Right, so we still convolve on the two dimensions So we only convolve in the X and Y axes, but your filters themselves are now three by three by three Because of the three RGB channels, so now you're weight learning nine weights Across RG and B right and then some products I work on it's like 26 You can you have 26 by 3 by 3 or you know you can have any number of channels you want it gets harder to learn, but it's possible And then also when you have multiple filters Right, so let's say at a certain convolutional layer. I want the same way we can do five or a hundred nodes We can have five or a hundred convolutional filters That means that your output now has five or a hundred channels, right? So a lot of times in like image classification models, you know You start with three channels going in the first layer has like 24 late 24 nodes in the first layer and now you have a 24 by X by Y Array and then you're convolving on top of that Okay So that that's kind of how channels play into this and then when you add batch it becomes four dimensional That's a whole thing Because remember you can batch data points right still works It's just four-dimensional and then you can't think about it geometrically anymore But at least know that we can do this with channel Pooling is something we also use to Aggregate values and feature maps so in this example This is this is max pulling Yeah, this is max pulling where in a two by two region We just take the maximum value and we just do that and that's because if you have a like a really large image like a Thousand by thousand image like to get to like one Classification value at the end like you need a lot of convolution to like reduce those dimensions, right? So at some point we're like, okay We're kind of assuming that these four pixels contain the same information So we're just going to take the maximum activation and keep that Right. So max pulling is very popular. You'll see that a lot also average pulling There's a note another strategy called global max pulling where you take every pixel not pixel every activation And then you just take the maximum value and then you use you use that for classification Which is why you can have any input image size because at the end it's you just take the maximum activation that you get and then you use that for classification Okay All right There's a pop quiz here. We don't have time because I like this topic too much. So we spent too much time on it but look at the slides and kind of try to understand why this math works out the way that it does because you Have to do this on the homework. Okay, and they have if you have questions come to office hours cool, so The kind of tasks that we do with CNN's To develop this and to move this field forward. There's a lot of these natural image data sets Caltech themselves have a Couple data sets that are pretty popular the number stand for the number of classes, right? So Caltech at Celtic 101 is a hundred classes with 9,000 images Celtic 256 256 With 30,000 images the most popular one that I'm sure you've at least heard of it's called C far Which has 10 classes with 60,000 images and their images are really tiny like 32 by 32 So it's like really easy toy problem to test your methods on we've kind of grown out of this at this point But you know it's there and then C far 100 just has more classes and then image net is like the Yearly competition image net like classification challenge where everyone all the universities and researchers compete to do best on this There are like on the competition data set There are 1.2 million images and a thousand classes and the classes are crazy They're like dog and then like different breeds of dogs I think different breeds of dogs are like 400 classes out of the thousand classes for fine for fine scale Classification tasks so like you can tell that it's a dog can you tell what kind of a dog it is right? Or it's like TV phone bicycle, you know has all these different classes the full data set That's been and these are like human labeled through like Amazon Mechanical Park By the way are 14 million images 14 million labeled images and 21,000 classes and those classes are hierarchical So now you have hierarchical information of this is a dog But this is this kind of a breed with this sub breed right that kind of a thing So it gets it gets really complicated. There's a and there's a lot of data sets out there Excuse me there we go. These are different convolutional models for classification Lane that was like the original but it didn't really take off because of computational limitations But this is like the first, you know Confnet and then Alex net it really took off that it won ILS VRC it beat everyone by like 10% or something It was ridiculous And it did really well one interesting thing is that you'll see that it splits into two branches where there's like One branch that goes up and one branch goes down. That's because Krzevsky only had two GPUs and so he traded like one model on each GPU, right? And then like send it through or you don't have to do that anymore But that's kind of interesting that it was formulated that way VGG afterwards. It's really deep. So that's that's interesting Google a net has a thing where it like splits into branches and the model gets to decide what filter size is correct So, you know, you know how we said convolutional areas three by three It can be five by five seven by seven whatever right, but it's not really obvious at which stage what filter size is appropriate So here they're like we'll just give you all of it and you as the model you can decide which one's more important, right? So that's kind of the inception module. That's that's how that works. Resnet is just really deep Some models have like a hundred thirty four layers or something like that So they have skip connections residual connections to help propagate the gradient inception v4 is the same thing But just deeper longer, right that kind of get the sense here res next I Don't know what it is and then this net has a lot of skip connections that go everywhere between like every layer Right, so there's a lot of innovation and a lot of research into this field we've kind of converged at this point to be honest in my opinion where Googling that or inception is gonna get or resnet is gonna get you like 98% of the way there and then any kind of optimization is specific to a certain task in my opinion Okay, there's also different models Okay, I'm not too far beyond I thought I was more yet. There's also other models For so we've only talked about image classification But for segmentation for drawing bounding boxes for all these different tasks. There's all these other different methods that you can use, right? So RCNN You have an image and then the model itself proposes where it thinks the bounding boxes should be And then using that it classifies what it is faster RCNN They just optimize it so it's faster some clever tricks going on faster It's just faster. These are actual names of the models by the way, like we like doing this and the machine learning computer Machine learning conference circuit. It's like who can come up with a more clever name Mask RCNN It does masking not just bounding boxes. So it like pixel wise Segmentation YOLO is called you only look once. This is really fast. It runs in real time The guy that developed this Joseph Redman his website is real interesting You can go look at his resume. There's like my little pony is drawn on there and stuff But he's like a genius. So like who cares, right? But like there's YOLO models running on your phone like I can probably guarantee it for like whatever of fun that you're using FCNs are fully connected networks. So there's no Fully connected layer that one thing I forgot to mention is in a convolutional layer usually convolutional networks in the early models you typically had like Fully connected layers at the end to do the classification not the case anymore Whatever and then you net here is interesting because you make the image smaller smaller smaller And then you make it bigger bigger and bigger and then you go across too and then at the end You know again, you get a segmentation mask to use this a lot for medical applications like x-rays and MRI scans and things like that. Here's a tumor or something like that, right? That's really sad. I shouldn't said that but you know, I mean it's used a lot for medical applications I use it a lot for earth science GIS applications as well Convolving, pooling Yeah, pretty much Yeah, these are deep enough that they can do that. There's also there's also a bunch of other strategies I'm happy to talk in office hours, but yeah, this is a huge field of just like coming up with new architectures All right, so there's a demo. We're not gonna spend too much time on it because again, I spent so much time But this is a mask RCNN working on the cocoa dataset, which is an autonomous driving dataset of just like webcams Sorry driving cameras mounted to Cars so they use this a lot for autonomous driving benchmarks Of course for autonomous driving you have to decide what to do afterwards So this isn't all of it But this is at least the vision part of it where the car knows what objects are around them So I think this is boring so I added another one here. This is YOLO version 2 This is a video that Joseph Brandon posted on YouTube when he was like releasing the model He put ads out of research fit whatever This is running in real time So it's it's a Titan GPU running at 40 frames per second. I guess it's the music that flags it. Yeah, all right You get the point Okay, I'm gonna stop tripping my water bottle just a second. All right, let's see if I can get back to my presentation There we go Okay, oh and then one more that we wanted to show so this goes further even And this is one example of something called pose estimation. So specifically for humans We're interested in getting poses of like where Head is and like your arms are and when your legs are because this helps for like animation and things like that where like Wigging for like animation and CGI and things like that. So this is of interest So this is just from the video there's no tracking So typically how would you do this you would pick key points and then you'd use optical flow where the pixels moving to track This isn't doing any of that It's just looking at each individual frame and identifying where the limbs and the Bodies are and for each each person in the video and this is not real-time the six forever Or at least this specific instantiation But it's pretty impressive that it can do this at this scale and at that precision Considering it's just one camera angle, right? It's not like you have stereo video with like 3d depth mapping or anything like that We're just getting it. We're just retrieving poses directly from this Okay, all right continuing doing a little bit faster because I'm halfway through these lines We have 25 minutes left. All right. So We also have models for image generation So I'm sure you've heard of GANs deep fakes have been in the news for a while So you kind of know that this is impossible But basically it's the same CNN technology We just kind of arranged them in a special way so that they can the models learn to generate images And the one big thing that came out of this is like the adversarial network where we have a discriminator and a generator So the generator generates an image and then the discriminator model tries to figure out if that image is real or not and then this whole stack learns at the same time all the Way so as the generator improves the discriminator improves and therefore the models encouraged to make more and more realistic images So that's kind of what this example is So so we have a lot of celebrity faces as a data set Because there's lots of pictures of celebrities and so we can feed that into a neural network and ask it to generate Generic celebrity faces for us. So it's interesting because you know it's able to do these faces But also you can tell it tries to kind of try and get the background as well Because a lot of times the celebrities are like against walls with logos on them. So it tries to come up with logos as well Anyways, you can you can you do this I'm gonna keep moving on. Oh, that's cool. It's pretty creepy. All right moving on This is a favorite project of mine. So I threw it in here. This is called Everyone can dance I think was the name of this conference of this conference paper It might not be the best for everyone There we go. All right, so this out of UC Berkeley. This is out of all the oceans group The idea is if you have a video of a dance and then a random video of someone else completely random It'll transfer the dance on to that random onto that video template So it does the pose detection first and then it generates that video from the pose and the source video that you provided it That's fun, but it's pretty good right considering right Okay, where's my slides Okay All right, I'm gonna burn through these a little bit faster again So these filters that we learn we can visualize them directly which is useful for figuring out what the model has learned So it turns out the model is kind of looking for color patterns. It's looking for edges Right. These are the filters that are convolving on your input. Okay, so that's that's pretty cool And from an interpretive explainer of sense, that's pretty interesting. Yeah, sorry. Oh Forever. Yeah, yeah, yeah lots of GPUs Yeah weeks. Yeah, I think these are research scale anything made by Google or Microsoft is like unobtainable by anyone else That's one of the things that in equity and like AI research is like do you have 10,000 GPUs? No, okay Well, you know, you can't really do anything So there's a lot of research into kind of resource scarce AI deep learning as well. Yeah, okay. Anyways, so this is That's right So we're taking each of these filters and we're convolving it over the entire image and then so it's able to say For example, like this one, right? There are this parts in the image that have this pattern and the cool thing is as we go deeper in the network You know, these things turn into shapes, right? So you combine Right these kind of basic building blocks into shapes like circles and blocks and things like that And then eventually as you get deeper into the model They seem to resemble common objects or common patterns or things like that And so the network is combining these and looking at okay If there's two wheels like this and then like the block you shape here That's probably a car and kind of things like that. Does that make sense? Yeah Thanks for the segue into the slides So so these images that are interspersed in between here These are like images that best activate each filter So that's like an also a way of interrogating what the model has learned is by providing it a Set of images and saying okay, this neuron really likes this image. So it must have learned The Features on on this image, right? So this kind of goes into interpretability and explainability Okay Okay, so for sequences which we have 20 minutes to cover We could convolve over them as well and we do do this, right? So, you know, we can treat it as one dimensional convolution instead of two dimensional convolution And then we can also convolve in non-euclidean spaces So we've been talking about arrays so far, but you know that graph convolution is a thing, right? So if you have graphic graph data and like graph machine learning is like been taking off fairly recently I don't know if it's going anywhere, but it's certainly taken off and you can kind of Convolve over graphs to do this kind of a thing and then that is very relevant for animation I'll show this really briefly where companies like EA Have an interest in Taking like a voice actors recorded image and converting it into a character animation To vocal Yeah, so anyway, so you get the point so given just the audio file It's able to generate these mouth shapes and animate your characters And that's why you're kind of very recent triple-a video games kind of look like this because that's the technology that they're using Okay Okay, so quick repap on CNN's We're trying to take a standard neural network We're using these concepts of locality and translation and variance as these prior knowledge of Structures that we can leverage to do image class to image tap to do image tasks This this like totally limits the number of weights We have to learn and it encourages the network to do a lot better on this type of data And then using that backbone we can scale it to a whole bunch of tasks on based on image and video data All right, any questions on CNNs before I move on to RNNs and blow through it All right, great. Okay, so Recurrent neural networks are in the same way that CNNs work for two-dimensional arrays recurrent neural networks are for sequential data, right so For something like speech recognition where right now my phone is taking my recording and transcribing everything that I'm saying into words Right for stuff for things like this You have to take away form which is how loud was a certain frequency at a certain point in time? And then you have to convert that into syllables or into words, right? So this is the same thing same thing we did with eSongs image, right? Where there are relevant sounds in In audio, there's irrelevant sounds in audio if I cough that's not a word, right? So it needs to lock onto the important signals while also ignoring irrelevant signal things actually Also the volume right just because it's louder doesn't mean it's a different word, right? It needs to ignore volume it needs to pay attention to everything else that defines what words are being spoken So Again, same same same structure as we're going with the images with eSongs example The mapping is really difficult to do by hand, right? I'm not gonna say any of those things and like trigger all of your phones But for like Google and Siri and things like that, you know, it needs to be able to take just that audio and identify But how do we define the network architecture for this kind of sequential on data? The main problem with sequential data is that inputs can be a variable size and we mentioned how Images can be variable size as well, but this is especially the case for sequences, right? Audio clips are always different lengths almost all the time, right? So we need some architecture that's able to kind of be amenable to that and The point here being made is that you can use convolutional neural networks But your networks have a fixed input window size right convolutional neural networks I mean so CNN's you know how we said like three by threes and five by fives if we do one dimensional convolution that Window over which we're convolving is only also like three nodes wide or five time steps wide or things like that Right, we can't take like global information from the entire sequence sequence into account when we're doing our modeling so turns out Sequential data has its own Inductive biases that we can leverage first part here locality is that nearby regions are usually related, right? So data points that are at nearby time steps are generally related to each other, right? The nearby audio sounds are part of a single syllable, etc So that's locality and then the other one is trans a translational invariance No matter when I say the word Apple at the beginning or the end of a recording It still needs to be recognized as Apple, right? So on a different axis of time this Problem structure has the same kind of similar inductive biases that we consider for CNN so To mirror the sequential structure of the data we can also set up our model to process data sequentially so What we're doing here is we have the input at each time step and then we're applying the same way at each time step But we're keeping a memory We're remembering what we saw before through this green arrow so that we can take that into account when we predict the next step Okay, so imagine this is audio at the bottom and then at the top. It's a transcription of the words, right? We're generating words from from the audio if this is like syllables Combinations of syllables form words, right? So that's what this green arrow is Remembering what came before in the sequence to make predictions after. Oh, well, that's a whole come to office Yeah, yeah attention you would add on to this or else teams are attention essentially, but you would have attention layers on top of this Okay, so the way we're going to come on The way we're going to formalize a recurrent neural network is We have some input and then we have this single node here, which is just w w times x But also we're taking in another input and that's the previous node Okay, so that updates the hidden state. So we have some state and Then we're updating that state with the previous state and the new input That gives us to a new state new Memory right and then from that memory we predict the output, right? So we do w the weights times the previous memory and the new input and then on that updated memory We're predicting the new output with another layer Okay, and obviously you can do this as many times as you want so You can also unfurl this like this if you want to kind of Envision this in a in a standard neural network sense I think this is more confusing I like I find this easier to understand personally But if you really need to formulate that into this kind of a format, that's what that looks like, okay? and So the point that they're trying to make here is because we can unfurl it we can do backprop Exactly the same way now remember at each of these orange nodes. These are all shared weights It's not like these have different weights This is the same w right, but the way we update this w is now we need to backpropagate over time So this w gets updated on this on this sequence and then on this sequence and then that w gets updated Same w gets updated again blah blah blah, but then at the end whatever w you end up with here You use going forward the same on each one, right? We're not learning different w's for each time step It's the same one and That's the shift in variance that we're talking about So the primary the primary difficulty of doing this is the longer your sequence gets the more you have to backpropagate over time Right and remember we were talking about the vanishing gradients problem with CNNs with deep CNNs as well Same exact thing with RNNs, right? Your sequence is too long you forget your gradients and you can't learn as well, yeah Not necessarily. It's not weighted It's just that every time we take the derivative of something because of the way Rayleigh works, right? It's like zeroed out at the beginning and because of the way different non-linearity works It just shifts until like there's no gradient left until the gradient the literally the derivative literally become zero Yeah, so the way we solve this You can do this by adding an infinite number of steps skip connections, which is not a good idea Because it gets in a very complex What we did instead is we added memory watch this This is something called a LSTM So I'm going to cover this very briefly because we're out of time. We have 10 minutes left, but this is real fun This is real interesting. So they added basically what is RAM to an RNN, right? So the input informs the following first it informs whether we should forget what we knew before Right, so this is saying how much of the before state should I remember and it's also informed by the immediate before step But this is like a continuous memory RAM. That's maintained throughout the entire network Okay, so that's that's cool So we can forget it if we need to if we don't want to forget it instead it goes into the input Right, and then it updates that memory that internal memory Did I get that right? Yeah, okay, so if so whatever combining with whatever we decided to forget or not forget We update that internal memory, and I'm not exactly sure why there's oh, thank you Appreciate it Appreciate it. Okay. Good. Thank you. So this is the actual input itself And this decides how much to use of the input when updating the memory. Thank you And then this gate so and then you use that information as well as again The input again to generate the output that gets produced So it's just as a bunch of latches and gates to decide how much to use of the memory how much to use of the input Etc. That's that's all you need to know Yeah, but it's an advanced version of the RNN where it kind of does a better job of deciding how much memory to use How much memory to keep? Etc. But it is more complicated more weights and But this was a big revolution at the time as well it performed significantly better than RNNs on specific tasks Especially like text NLP that kind of stuff. Yeah No, because as we mentioned like this is the same W So the longer it gets you just end up doing longer back propagation So you just use this same W as many times as you need to as long for as long as your sequences, right? Yeah LSTM I'm so not gonna walk through that Okay All right, and then we decided to be even fancier and generate all kinds of memory networks Hopfield networks was like the OG and then GRUs are simplified versions of LSTM so appropriate for different things And then we got neural turning machines Which is fun because they use tapes like the turning machine if you think in theory and then it gets way more complicated from here I'll kind of leave it at that But you know it they start to have their own internal Registers and rams and you know the memory transfers and you shift things and ins and out of memory and Okay, so up till now we've considered output of the network to be only a Function of the preceding we only go in one direction, but if we think about Like language translation, right? If you if you're bilingual, you know that the orders of nouns and verbs is like not always the same, right? So if I say something in English and I want to translate it to Korean, right? I can't just go word by word by word because something that shows up later might need to show up first Right and things like that. This is pretty famous for like Latin languages like Spanish versus like English and stuff to right like the order of phrases and such are totally different So what we can do instead of just going in one Specific direction and this helps for other tasks as well Right it helps to know the global sequence before making your decision is we can go in both directions at the same time And that gets real fun So yeah, you have two layers of hidden's each going in a different direction and both of them help influence what the output should be Okay, and of course infinite number of variations on that as well So, you know something some different tasks that use also this kind of structure Audio classification obviously we talked about that handwriting classification if you like, you know how you sign on like credit card things Right, that's like a vector So instead of an image of a signature you actually get like an order of the pixels that were drawn over time So based on that you can actually do handwriting recognition based on like The angle and the speed at which you are riding when you're doing that and then text classification Text tagging if you've if you're kind of know into an NLP. There's some problems where you would like to know if a word is a noun or a verb Because it can be either sometimes right but Bidirectional that means you can get the full context of the sentence in order to determine if the word is a noun or a verb Yeah, okay, and again, there's tons of options that you can do with this You can do one-to-one one-to-many many to one you can delay it shift it a little bit so that it waits a little bit before it starts Making its decisions many many you can there's a lot of different variations that you can do with this kind of architecture And you can layer them So you can have multiple layers of RNNs on top of each other to do your thing if you your task Is so complex that you need like multiple layers of memory Right, and then it gets bigger and bigger and bigger and now you need a server farm to trade your model And then this is this was initially when text generation was a thing so like chat GPT These transformers now they're like way past this but initially when we were doing that kind of research You would use the previous output as part of your input when you're doing your generation And then that lets you write like stories and things like that. So initially when we were doing text generation That's the technique that we use Okay You can also do this with pixels with images If you treat an image as like a vector I'm not a big fan, but this is back because they want to use so transformers We're not going to cover that in this class because it's not deep learning course But transformers have been hugely influential for neural net for NLP Now people are trying to use transformers for images and you kind of need to do this kind of a thing To do transformers on images Similarly, but you're kind of considering that each row of pixel is a sequence and going off of that You can do MIDI music generation, which is cute This is Google Magenta. You can Google that they have a lot of different kind of AI art music projects That are open source that you can go around and play with it will generate music for you and things like that But yeah, I fit in in time so Just to quickly recap for space for RNNs we exploit sequential structure. So again localization Points at a similar time step are similar or relevant and then shift in variance, right? You know structures at any point in the sequence needs to be considered the same, right? By coming up with this kind of an architecture where we're taking the same weight node and kind of using it across the entire sequence And again the point is we're reducing flexibility. This is less flexible than that for sure 100% Right that is that has more weights that can express more functions But for the task of working with sequential data, this learns much faster Right, and so we're actually able to reasonably train it with a reasonable amount of data and a reasonable amount of compute And then we can do a lot more stuff with it same thing with CNNs So just to briefly recap before you start packing We've used priors and inductive biases to scale networks from this to other Tasks on image and sequences by taking advantage of those priors The world that we live in is spatial temporal, right? Those are the axes that are that we work in So we're constantly getting sequences of spatial sensory inputs So actually when dr. Lucas mandrake was here he talked about this as well spatial temporal data is Is everything even with sex satellite data as well? It's Satellite images over time as the satellite orbits over earth, right? So if we're gonna have like robots living in our homes doing tasks for us We're going to have to design machines that are able to take advantage of those spatial and temporal structures and CNNs and RNNs are the building blocks that allows to get there and when we combine them We can end up with a model that can serve as an agent for interacting with an environment So CNNs gives us the vision portion and then RNNs gives us the overtime Version and then you can train something like this By deep mind 2016 this is pretty old Where it learns how to play like an FPS kind of a game where it has to pick up all the apples, right? so again the only inputs to this model is the image of each frame Right and the reward function So these numbers are given backs because the model has to know whether or not it's doing well or not But the input is just the image right and yet the model is able to learn something like this Just again fairly complex We're doing this and if you if you Google reinforcement learning stuff like I think open AI is the one doing no open AI is the one trying to learn Starcraft 2 and Deep mind is trying to learn Dota and again like they're not plugged into the game engines They're just using cameras and trying to get the model to learn and I think like that Starcraft model beat a couple pro players Recently, but they were saying it presses more buttons than a human can possibly can so there's a little bit of a cheating thing going on But there's a lot of work Is this That okay, all right, so we covered a lot of We covered a lot of different topics If you want to come to my office hours and talk to you about specific things that I'm interested in I had when I was in college I had some interest in interpretability explainability and robustness So can you explain what a model learned by actually looking at the weights and the different characteristics of that model? There's a huge field into that because right now your networks are black boxes, right? So it learns but we don't know why it's making the decisions. It's making We just kind of know how well it's doing and that's dangerous for certain applications So there's a considerable amount of interest in going into the network Surgically tearing it apart and figuring out what it actually learned and why it's making the decisions that it's making Also explainability Interpretability, they're slightly different some models can be designed to be explainable from the start For example decision trees are extremely explainable because you can follow down the nodes and the trees and Figure out why it made the decision that it did right but neural networks a lot harder to do that with and then robustness This is you know, how can we design your networks? It's not just kind of a model but as a system that is robust to our world, right and doesn't error out on Kind of common edge cases. So the example I added here, you know, you can slap a stop You can slap a Staker on a stop sign and you can trick a neural network model to detect it as a 45 miles per hour speed limit Sign from any angle even if you move the camera, right? So that's like a kind of an interesting thing like why does it do that? Obviously to us it looks like a stop sign So why doesn't it to a neural network model? That's like a pretty cool question I also do a lot of deep learning for Earth and space science I do deep learning models for methane plume detection using a perspective imagery or imaging spectrometers models for global vegetation structure how taller trees are on the world That's of import for a carbon stock measuring how much carbon is stored in forests and how much we're losing to do deforestation Models for biasing this to detection and potential missions to install this in Europa their ocean world So there's a lot of water if there's like things that are alive in it We want to train models to detect that there are things that are alive in it And then we do some Mars rover orbiter image classification Which is fairly straightforward where we have millions of images from curiosity and perseverance and we want to help the Scientists find what they're looking for. So we run them through image classification models So if you want to think of talk about that I have office hours after this lecture for one hour in Annenberg We're just gonna walk over right now Come by our my next office hours after that is tomorrow morning at 10 a.m. To 12 p.m I'm just gonna be working and then you can come by ask questions about anything like yours homework All right, we got through it. Thank you four minutes over again. We'll work on that