 Yes, and let me welcome Prateek Chaudhary from University of Pennsylvania, who will be delivering three lectures on the principles of deep learning. There is teaching material already uploaded on the website and there will be more over the next lectures. Thank you Prateek, please. Good afternoon everyone. Nice to meet you. I'm sorry I cannot be there in person. I cannot travel at the moment. But over the next three days, I would like to tell you about some ideas in what is called deep learning. Deep learning is a subfield of machine learning. And for those of you who may not have heard about this, the very zoomed out version of thinking about this is when people began to think about how to make machines learn in the sense that we know that the biological brain learns, it looks at patterns of things around us and then makes inferences when it sees new kinds, new data that is similar to the data that is on the past. This is what one would define as learning. Deep neural networks, which is what people call them, are machine learning models, artificial learning models that are inspired from how the human brain learns. They consist of things like neurons or mathematical abstractions of neurons. And there is a long history of studying these kinds of learning machines and so we'll talk a little bit about the history first. I will tell you a lot of concepts and ideas in how people think about deep learning today. This was the content of the first lecture. The next two lectures will be more of a research lectures where we'll talk about some more modern ideas or cutting edge ideas on how deep networks work or when they do not work when they work etc. Before we go to the ICTP website, I have uploaded these notes. This is a PDF. It is a rather long PDF. So do not be scared it's not as if I'm going to go through the entire PDF in one and a half hour. And we'll do it in bits and parts. There are also three Jupyter notebooks and we will go through those notebooks very briefly at the end of the class to give you some kind of appreciation for how code in deep learning looks like. If you have never done deep learning before. Let us begin. I call this series of lectures very boldly principles of deep learning, but it's a little bit of an oxymoron because deep learning is a very new field and there is no understanding or there is not so much agreement on what the principles of deep learning should look like. So this is my attempt, a long standing program at this point of time to lay down some principles for how we should think about deep networks. Okay. Just to tell you my personal perspective on what learning means. Let us take a few minutes to set up the spiritual facade. Intelligence. Well, each of us would define intelligence in very different ways. But most of us will agree that humans are intelligent, all of us are clearly intelligent that is how we define ourselves to be humans. So you will also agree that a dog is a tiny bit less intelligent than human beings. It cannot do all the things that cannot do integrals, but it can do something that can fetch a frisbee when you throw it, and that counts for something. Because it can walk, it can sense, it can smell, and that is a tiny bit less intelligent than a dog and can affect the world around them in lesser ways than dogs or human beings. But they can do things, they can work in communities, they can achieve things that are much larger than their physical capabilities by using their environment wisely, and this is the kind of intelligence that these biological organisms have. Now, when you, if you define intelligence is broadly then there is one goal for what it means to be intelligent that stands stock in front of us. And that I would argue is the survival so if you're an entity that would seek that that needs to survive and that needs to survive in spite of changes in the environment, in spite of things that you may not be suited to deal with optimally, then you require some intelligence to adopt. And this is how we would like to think of intelligence, this is biological intelligence, and many things in nature have flavors of such intelligence things can gather food, things can find mates can reproduce. And at the end of the day, survive with respect to changes in the environment. And also lesser beings, let's say plants, which would be a tiny bit less intelligent they cannot move around, but they can do certain rudimentary things you would have a pocket plant on your desk. It moves towards the window over a few months. And that is a notion of it's trying to seek out more nutrients on the environment. Just as a way to begin the lecture, I wanted to give you an example of this plant. This is what is called a tunicette tunicates are plants that live on the ocean floor. And when they're born, they're actually animals, so they have a nervous system of ganglion cells, and they crawl around on the ocean floor until they find a nice big fat piece of rock with a lot of moss on it. And once this plant finds this nutritious rock, it goes there, and then it digests its own brain, because it has no need for that brain anymore. And after that it is vulnerable to its surroundings so it grows this thing tunic around it to protect itself from its environment and that point it becomes much more like a usual plant that we know on the surface of the earth. Okay. I usually have this picture here as a metaphor for professors. They're all very smart when they are young and then they grow old and then they start to stop using their brain too much. I can make these kinds of jokes because I am still too young. This is one notion of intelligence which morphs through its lifetime. It sees us to be walking and acting animal and becomes more vegetative. So this is the way I would like to think of intelligence and when you define it like this, things like a chess playing program or AlphaGo, AlphaGo is a machine that a company called DeepMind made. DeepMind is owned by Google now and they're a company based in London that works on artificial intelligence. They created this machine to play Go. Go is a game much like chess. It is very popular in station. And after they made this machine, I don't want to show the video, you can look at this video later. After they made this machine, they had this very famous match with the Korean Go champion called Lee Sedol. And it defeated Lee Sedol in the first three matches. This is a very beautiful movie. You can find it on Netflix to watch of their process of developing it and then all the little tricks that they did to make sure that it performs well etc. And it was a very capable machine that could defeat this human player who was widely regarded to be among the best ones around. And if you think about it, if you look at the movie, you will see that there is a little scene where an American Go champion is talking about how a male human, the best human at Go, can possibly defeat a machine as good as AlphaGo. And he says this, if you want to defeat a chess playing program or a Go playing program, all you have to do is pull the plug and the machine cannot play anymore and you win by default. This is a very tried statement but it has a nugget of truth. So chess cannot fight back and this kind of intelligence is very narrow in how it interacts with the world. It can do one task, which is playing chess in a very, very good way, but it cannot do any other task and it certainly cannot do it, do the diversity of tasks that humans can. So a key indicator of intelligence and this is how I define it as a roboticist is the ability to move around and affect the world around you. And if we say this, as soon as we define intelligence like this the ability to move around and survive the world, a few things become very integral to this definition. Yeah. The first one that I would call is perception. Perception is the ability to see things in the environment. And this comes from various things that comes from different sensors. These sensors, eyes, ears, nose, smell, and touch and different intelligent beings will have different kinds of sensors. Once you obtain information from the sensors about your environment, you would like to do something to this information, you will have to crunch it. And so cognition is the way that you crunch information of your surroundings. But then you have to do something with this information otherwise you cannot move otherwise you cannot affect the world around you. So that is what I would call as action. Action is the ability to affect things. Action is the ability to move around. And these three, they act in unison. It is not a feed forward sequential decision making process in the sense that if you just watch something on the video, you think about it and then you act, then you will not be very intelligent because if you think about how you behave when you lose your phone or you are trying to search for the keys before leaving the house, you actively take actions that will give you new kinds of information. Once you take the action, once you have a certain element, kind of information, you know what action to take based on this. So there is a loop that connects perception, cognition, and action, and the interconnectedness of this loop is central to how you and I live in the physical world. These series of lectures are not going to focus on the entire loop. There is many ways of thinking about this. People in computer vision will primarily deal about this. People in control theory would deal about the third part of this loop. What we will think about is a very narrow problem which is coordination or which is intelligence. We are not going to worry about where the data comes from. We are not going to worry so much about how to use the data. We are only going to make predictions on this data. And we are going to define some principles for how we should crunch information that comes from our sensors to make decisions without ever actually checking how good the decisions are. Okay. So here is the goal of the following lectures. The goal of learning or goal of machine learning to be more specific is to crunch past data and build a prior for what you may see in the future. It is very crucial to realize that all that learning can do or to have it firm in your mind that all that learning can do is build a prior. The kind of actions that you take will necessarily depend on what you have perceived at the very moment. So we should always think of learning as a way to summarize past data and make the process of taking actions more efficient. Virginia gave a talk on inference. If you should think of learning as enabling inference rather than replacing inference. A chess playing program necessarily uses the move of the operant, the current move of the operant to take the next move. Right. It would be a very bad program if it's simply used the statistical distribution of all the moves that people have played in the 17th move of the game and played that move to win the game that wouldn't work very well. So you should never think of a machine learning model as something that takes a data set from a hard disk and then makes predictions. This is how we like to formalize it sometimes, but that is not enough if you want to think of a real problem. If you want to think of a real system that makes predictions and takes actions. We always have to think of inference in addition to learning learning is simply summarization of past data, and then inference is the process that uses a prior to actually take the actions. The better you are at punching past data, the less work you have to do at inference time and the better actions you will get. Okay. So any questions or any comments with this motivation. So speaking, let me give you a very, very, very brisk summary of how people began to think of these ideas. The way I am thinking, the way I introduced it, these ideas that we began in the early 40s. So, these are two gentlemen, this is Varen Makulok was a neuroscientist and this is Walter Pitts who was a logician. Walter Pitts was in Chicago, Varen Makulok was in MIT for some time. They together built what we would call the first abstraction of how a biological neuron would work. A biological neuron is a complicated thing it has then writes an axon and synapses and it interacts with neurons around it in complicated ways using biochemical reactions. One very, very course way of abstracting away all these computations that are performed by our neurons and the biological brain is to imagine that it's a machine that has some inputs. This one x2 x3 to xn are all the pixels of the image that you're trying to process. These are all the inputs that you get to compute some function of these inputs. It could be a complicated function it could be a simple function that is just your choice and you predict some output it could be a complicated output it could be simply zero or one whether or not there is an orange in this image or not. This is, as you can appreciate a very course abstraction of how biological neuron works. And this is what Makulok and Pitts discussed in this paper, which many would call the beginning of thinking about neural networks. They wanted to imagine how we can build systems, we can feel artificial systems that capture the computations in the brain. And around the same time, Alan Turing was also thinking about very similar things. And he also wanted to capture to summarize the binary nature of the neuron in the brain whether it fires or whether it does not fire after receiving certain stimuli from his brain. He wanted to capture this principles into more abstract notions of computation that happened. This was after he had developed ideas on computation that was late 30s. This was roughly early 40s. And the neurons that Makulok and Pitts developed or Alan Turing developed, you can read these papers to know more are essentially containing all the germs of the neurons that we use today for artificial learning, the kind of neurons that we use in artificial learning that hasn't changed very much. Of course, neuroscience has advanced quite a bit over the last eight decades, since this happened in understanding how the biological neuron can be modeled and can be understood, can dig deeper into the little channels that open up when these neurons can communicate with each other. But at the mathematical level, we haven't started playing too much with these models yet. Around the same time, in Cambridge, Massachusetts, Norbert Wiener, who was a mathematician, was trying to think about the kind of words that I said at the beginning of the lecture. And people would call this cybernetics, which is a word that he coined, to capture the notion that an intelligent agent is something that has sensors that takes actions and performs some computation as to how it would take those actions based on the inputs. Okay, so these are the four interesting people if you want to read like super oil stuff in neural networks. Levine Makulok, and then this is Gray Walter from England who made some of the first few robots that could do some autonomous behaviors. And then this is Pitts. Well, that is the super early era. I would like to now move to the 60s in the 60s. Let's say in the late 40s or early 50s, Claude Shannon, who was an mathematician, developed what is called what we know today as as information theory and he essentially began the field. And the premise of information theory for us, when we think about learning is slightly different from how Shannon was thinking or how communication theory would think. In communication theory, you would like to say I have a piece of data, I would like to summarize or encode this piece of data, and then add some redundancy and then transmit it across a channel. This channel is a physical thing the wireless signal that is connecting your phone to the Wi-Fi has disturbances that come from other phones connected to the same Wi-Fi and then em radiation from the lab above and many other things. And whatever way you use to encode your message has to be resilient to these disturbances. The Shannon was interested in principles that define how much I can encode things and how well I can transmit things how well I can use the channel so speaking to transmit information from one place to the other from the source to the destination. In machine learning, we think about this in a slightly different way. We are not so much interested in simply encoding data. We are interested in encoding data for a specific purpose. We would like to take signals that consist of images, text, sounds, etc. Which are roughly speaking continuous signals and understand from them certain abstract concepts and you can give names to these concepts, ideas, objects, categories, phonemes, call it what you will. These are discrete objects. Why do we want this to focus? Well, we would like to do manipulations with them. We would like to say how many dogs there are in this scene, what is the dog standing on, etc. We would like to make logical inferences upon such data. So we in machine learning are necessarily interested in throwing away a lot of redundant information from the data in order to reach this more abstract inferences about the images. And in that sense, it is a little different from how classical information theory thinks about classical information, information theory does not like to throw away stuff because why would you throw away by choice? What you want to do is protect yourself from the disturbances that can happen in the channel and get a lot of redundancy. We in learning want to throw away stuff because we don't want the fact that a dog is a dog to depend on whether or not it is standing on the grass or whether it is standing on the beach or whether it is catching your first peak. We want the concept dog to be encoded in our representation and the rest of it would be a nuisance to this representation. So this is how we will think about representation learning. And you will see more in the coming lectures. The first I would argue computational way of building a neural network was done by Frank Rosenblatt in the late 50s at Cornell and a Navy lab. And what he built, you can read this very beautiful article on it is a five ton machine that could perform a very simple task. It could distinguish between punch cards that are punched left and punch cards that are punched right. Now, all of us are way too young to ever have seen a punch card, but it's a punch card the size of the credit card, slightly bigger, I think it holds. And with which which is used for performing certain kinds of companies because encode what is the in what what they term is a term on that card. Okay, and the, the same abstract model of a neuron that we saw for Maculog and fits is what Frank Rosenblatt coded up x one to x d think of it as inputs, input pixels. You can multiply them by some learned weights. This should be wd by some learned weights these are parameters of your neuron. And this is just one particular way of writing down the model of a neuron. It's a linear model. Apply a sign function assign function is something that takes in takes its argument returns a one if it is positive returns a minus one if it is negative. And at the end of the day, this entire function I'm using it just to denote some notation at this point is a linear operation on the inputs x using the parameters w. And we will call it like this, it returns a Boolean at the end of the day if you're interested in classification so if you wanted this machine to say oranges versus apples then oranges would be plus one apples will be minus one and then x would be the image and all the pixels in the image. And we will find the best w that is good at predicting apples and oranges correctly, or a bunch of data that we obtained for images and the corresponding ground talk labels. Yeah. Now this is a linear model. In, in, in the 60s, Marvin Minsky was a famous artificial intelligence professor at MIT and see more pepper. They began to study this and they said, Oh, we know that this is a linear model so it cannot obviously solve problems that are like this. So, to classify this problem, this is the XR problem that does not exist as hyperplane or a straight line that splits this cleanly which is what we are fitting here at the end of the day. And this clearly indicates limitations of such linear models. Somehow it's a quirk of history that while they were simply saying that saying this as a, as a fact not necessarily dissing upon the perceptron other people around them understood this as an obvious limitation for what will be called the connectionist approach, which is the way of writing down these models of artificial neurons and they're doing computations with them. And so basically people began to people essentially stopped working on neural networks in the late 60s and then there was a heyday of artificial intelligence in the 70s where people were using logic and motion planning and stuff like this to say or build intelligent behaviors. Essentially at the, at the end of 80s in the late 80s deep learning became had a big resurgence, and it is quite similar to what we see today. Everyone was working on this everyone was very excited about this, primarily because room and heart and hint and rediscovered back propagation. We discovered it because it already been discovered about a good 20 years ago in Japan. I didn't know it so much, but people could train roughly speaking neural networks that are multiple layers in the late 80s and the early 90s. At the same time, people began to understand ideas from neuroscience, who will end result work in the 60s, and we can to capture these understandings of how the human brain is structured into artificial neural networks. One big innovation of this kind was a convolutional neural network. And just like we had a linear function of the that connects inputs and outputs, and you apply a sign to it, a convolutional convolutional neuron would convolve w with x and return some summarization of this convolution vector. Convolutional neural networks were created because people knew that convolutional filters can give you similar features that that you see in the visual cortex. And so neocrojectron is probably one of the more famous models. And Yann LeCun, who is a professor in NYU. We know he will some of the first good convolution networks in the early 90s. And now again, like, essentially, if you if you talk to some of the older folks working on neural networks, they'll tell you that in the mid 90s. It was, it was, let's say appreciated that neural networks did work well, as well as any other machine learning model is very difficult to use them. It was extremely challenging for people who did not know what deep learning is or what neural networks is to get good results. So Yann LeCun would get good results on with a different convolution network and nobody else would because he was good at using them. And that don't help right so you need models that many people can use well to solve their own problems. And that is why models like support vector machines are designed for us became so popular because they were very straightforward models that did not require you to think very much or do very much to solve a new problem. And they worked quite well maybe not as well as the best neural network even back then, but they did work quite well. So in some sense, we will look at a library called pytorch in a bit, but we should give equal credit to all these libraries which popularized deep learning in the mid 2010s, and made it so easy to use that everyone in this room can be can run a smaller network in a couple of days without having to unnecessarily do a PSD in deep learning. So they deserve there's a lot of encapsulation of ideas or complex ideas that is happening in these libraries. And the goal of these lectures is to give us the ability to dig deeper into the libraries than what the syntax allows us to understand. Okay, so these libraries are very complicated. What we want to try to do is understand some of the moving parts so that we become better at using these models not just can use these models. Okay. The watershed moment that leads us to today is is in the summer of 2012. This is a competition called image net back in the day. And this is a competition where you are required to classify about 1.3 million images. These are images taken from a website called flicker which people might know about might not know also. And it has these are images, large number of images 1.3 million is a pretty large number so if you save it on your heart this people require about 50 gigs, speaking to save this of 1000 different classes now these classes are like different kinds of dogs that is about 118 different breeds of dogs some most of which I do not know. There are different cars there are different planes and many other things that we see that are both man made, and that are both naturally occurring objects again. The goal of this competition or the reason they have this competition was to build a machine learning model that could classify this data set, as well as big one cat. Until about 2011, the best methods which were essentially large ensembles or random forest if you know what those are would get something like a 25% error with a lot of grief these are large complex systems. In the summer of 2012. Jeff Hinton along with a few students in this group at the time. They built a convolutional network that dropped this error from 25 to 15.3 to give you an appreciation for why this was important. It's feels pretty silly. These computer scientists talking about all their numbers and percentages, but in the five years or so before 2012 this number had come down from 30% to 25%. In one year or in the next iteration of the competition they just decreased to 15 and in the success of years in after that this number is all down all the way to like three or four now, which is pretty crazy. So that is what got everyone's attention and that is when people realize aha, there might be something that these deep networks are good at image classification specifically. We should, we should work on that. The reason for this success is roughly speaking the availability of large GPUs GPUs were becoming very popular or very powerful in the mid 2010s. The shader course were being very fashion to perform some computations because of a lot of parallel threads. So taking the availability of GPUs and nice data sets so image that did not exist in the early 90s. So you would never know that the neural network works well, even if you had one on you could train one GPUs did not exist in the early 90s so you wouldn't be able to train one to begin with. And so these two things roughly speaking led to this modern version of deep learning that we look at next. The problems that we are using to train this networks, the understanding that we have gleaned as to how best we can build this models is not that different. It is basically the same ideas that we had in the 90s put on steroids, because now we have access to better computation so we can do a more precise experience understand which ideas are work well which I just do not work well, but essentially, they are the same ideas. Okay. So is that is that okay. Any questions before we go to the technical stuff. Questions. Yes. We don't know. Okay. No please go ahead. Okay, cool. Thank you. So, hopefully that will give you some pointers to as to what to read next. We can formalizing or at least writing down what a neural network is the way we think of it today. And, and this might be repetitive for some people who have seen machine learning before but let me say it anyway in brief. In machine learning we are interested in solving problems of the following kind. You have inputs x, and you're making predictions, why we always use this rotation. So, nature gives us some ground truth predictions. It could be objects that are annotated in images by a group of people and those would be a ground truth. It could be an experiment that you designed where you took a slice and a slice of a cell and then put it under a microscope and stained it in different ways to say that these are the back cells and these are the good cells that would be why it could be the different phonemes that you have when you listen to an audio signal many different kinds of things, but the structure in which that we are going to work in is always imagine that someone else gives you the ground truth labels and our job as a machine learning model or a machine learning researcher or user is to build a machine that can predict why accurately given new inputs x. Okay, when I write us a new inputs, we are going to have access to a data set. This data set comes with a promise. It comes with the promise that all the inputs and the outputs are drawn from the same distribution so P is a joint distribution of x and y this is how we classically formalize things. And, given that they're from the same probability distribution, I'm going to ask you for 10 n samples. This will be a part of my training data set. I will do whatever I want to present samples I can remember them as a hash map I can build a model that predicts why on these input images. But that is not what I want to do at the end of the day I want to make predictions on new data not this data. Okay, this is what I have to work with, and I am expected to do well on new data. Yeah. And as I said the task in machine learning is to do well on new data and just to appreciate why this is interesting or difficult. So imagine that I give you images of size 100 cross 100. And then within every image. There is, let's say an apple, an apple, okay, or an orange. So if I give you 50 images, this is 50 times 100% pixels RGB pixels so they take values, they take these many different values, every one of these 10,000 pixels takes 255 256 cube values. Okay, you can build many ways, you can develop many ways to take an image like this, and predict correctly the output of every one of these 50 images. I mentioned one here but in words you can simply create a hash map and say, if so and so set of pixels have so and so values, then it's an orange. If so and so other set of pixels have so and so other values then it is definitely not an orange. When we were in high school we built, most of us might have built something that says, if there is a large patch of orange color then I couldn't I call it an orange in this image if not I do not call it an orange. And this would be the slightly more grown up version of that same idea, but the moral of the story that I want to say is that, given a data set like this, given this much information that exists way to predict perfectly the labels why for each one of these it can be called a hash map you can simply memorize it every time you get a new image you would compare it put it in your hash map and then try to guess the answer that is lying that is stored in the hash map. This doesn't help us very much because using this kind of a hatch map, you can get perfect accuracy on the training data set. It's extremely bad at predicting new images. So the, the, the point to understand here is that designing a predictor that does well on the training data set is trivial. That is not what we want to do. We want to build predictors that can generalize to new data outside the training data set. And of course, this doesn't make a lot of sense if you ask me to predict where an arbitrary new data, I give you a data set of oranges you asked me to predict a data set of grapes or new images that have grapes or do not have grapes in the inside it and that doesn't make any sense. So what, and so in machine learning we have another fundamental assumption that the test data set, which is also a bunch of images, let's say it is also drawn from the marginal of pfx of y y. Okay, it comes from the same distribution as the input images in the training data set. And that is why you are able to say when or when you cannot predict well on this new data. If the data set does not come from the same distribution then all bets are off. It is very important to understand that this assumption underlies everything that we will ever do in machine learning. It is probably important to understand that this is only an assumption, and it essentially will never be true in practice. If you're a biologist looking at some biological data. Things evolve quite differently with time. If you are even Google answering image search queries or text search queries, the kind of stuff people search for changes by the day. And Netflix serving movies to people, people watch different movies on different days of the week, different months even your taste change quite quickly. So the test distribution on which we want to make predictions correctly always evolves, and that is the root of all grief in machine learning. Machine learning is hard, even if the test distribution was the same, but the fact of the matter is it is never the same and that is what makes it really hard. Okay. So, this is what we would like to do we are searching for a predictor, we can also call it a model, we are searching for a model that generalizes well in the sense that it predicts well on new data. We have the test training data to check what it does. We do not have access to the new samples so we can never make statements about the new samples, other than doing some hacks around how to use the training data. Okay, so this is really the root of the problem. Another thing to appreciate is, we are building a predictor or a model, which I will denote as ffx, it is parameterized by some parameters w, just like our perceptron, or just just like the neuron in maculokan pets model was parameterized by the weights w. The hash map that I talked about is a very complicated function hash maps are designed to be complicated. You can always find very complicated functions that fit your training data. It doesn't matter even if your training data is very large, but then the, the name of the central problem here is that there is many complicated functions as you increase the dimensionality of the weight space, or the parameters that are required to search in a larger and larger space, and there are many, many solutions in this larger space, the larger the set of functions, the larger the number of solutions. And you will make mistakes in which solution you pick, because all you have to check whether or not you have a solution is how well the function does on the training data. The abstract question in machine learning is to fit a model to the training data set, but make sure that the model is not too rich. If the model is too rich, then we will make mistakes in picking elements from this very rich class, and those elements that we pick may not work whether the test data so we should be as conservative as we can in selecting the size of this model. And we don't even get to check how the model works on the test data because we don't have access to it. Yeah, so in some sense it seems a very, very ill post problem. I tell you that the true labels for created by some model presumably, but you don't know the size of this model. I, you are basically making guesses in the blind. I don't tell you the kind of problems that I will check you on in the future. So again, you do not know how the future will look like. Okay, and so what we'll do is try to reason about this a little more carefully and then work under the right set of assumptions and constructions to make sure that this entire process is well defined. Questions. Okay, no questions. So let's do a quick recap of linear regression everyone here has seen this, I'm sure, but this is just for notation, a linear regression model would be a function or predictor that looks like this it is. It runs on X when you give it as an argument, it is a function of two parameters, W and B, and it's an affine function it is W transpose X plus the bias B, W is called the direction of the hyperplane. And B is the bias of this hyperplane if you want to think of a two dimensional line it would be the intercept and W would be the slope. In pictures it will look a little bit like this if X is two dimensional X one and X two. This is the output why that you're regressing this is no longer a classification problem we are trying to predict a real valued thing, why is real valued here. You will fit a hyperplane that that minimizes the distance of the red points to the hyperplane this is how well for this particular input this is what you predict this is what the true output is in your data set, and your your goal is to fit a hyperplane or select a hyperplane from among all possible hyperplanes that minimize the length of this vertical arrows. That makes sense, we would like to predict on average over all the samples or all the samples in the training data set. So we can write down an objective function that looks like this is an objective function that tells us how can we find the best W and B. What do we use to measure when the W and B is good. Well, we look at the true outputs that I have in my training data set, the red points here. We check them against the predictions that the model makes, which is, which is what I denote by W hat I, which is the prediction of the model on the I sample. And you may use some reasonable way to measure this discrepancy between the two in this case I'm using the squared loss. And also tell me the average of this loss over all the samples in my training data set, which tells me that I'm not interested in the maximum error or the quantile error but I'm interested in one specific one which is the average error. Yeah. At the end of the day, we would like to find a weights and biases that minimize such loss functions and that is what that is why we'll use tools in optimization. We, this loss function will be is a quadratic loss so we know a close form answer you have you will recognize this expression from some time ago from high school maybe. But the loss function that we will use in deep learning will be a bit more complicated, and that is about the problems on an abstract sense will look exactly like this. Yeah. Now, that you can also set up polynomial regression using the same formula for linear regression by creating features that are one x, x square, x cube, all the monomial terms. If you have one dimensional data dimension data you'll get multi dimensional terms like this. And to give you an example, it will look a bit like this. So if the true function that nature created its outputs was the green line so this is a one dimensional polynomial or it's a sinusoid to be specific. And the data set that you had are these blue points. Depending on what model you fit, you may get different kinds of answers. This is a model that is simply a constant so w zero bias is something that you choose to minimize the errors on average, and that would fit the bias to some value here. If you fit a linear regression model, you will try to match the point this way and of course both of these models have a lot of errors the blue points don't lie on the red line at all. If you use a more complicated model in this case this is a ninth order polynomial, then you will fit the blue points, but now you see the issue with this business. If you can measure, you can only measure the function the red function at the blue points, you have done your job perfectly, you fit them perfectly you have zero residuals you have zero error on the training data set. But what the red function does on stuff that is not a part of the training data set is very different from the green line. And this is exactly what it means to do to not do well on the test data set test samples have been going to be different from the blue circles. You will see a discrepancy in what the red function predicts and what the green function predicts on the test samples, and we would like to avoid such kinds of situations. We are slightly happier to have situations that look a little bit like this where we may let go some points, but we capture the trend, a little better, and that way we make slightly fewer errors on the test samples. Okay, it is not very easy to understand whether you are living in this world or whether you are living in this world, just because we only have the blue points to measure things by. In terms of the blue points is actually this is actually a worse model than this one. Okay, so let us look at our first neural network it was very simple neural network. It is something called as the perceptron. It is exactly what Frank Rosenblatt did. It's a linear model W transpose X you apply sign function to convert it into classifier. So if W transpose X is greater than zero, this function predicts a plus one, if it is less than zero it predicts a minus one. And now you would like to ask yourself, how do I fit this model, how do I fit a perceptron for aggression, we know that we can minimize the squared error across all the samples. And the mean square error for classification. So this is a quantity that you would like to minimize so this are the true labels, if your function, if you predicted does not predict the same thing as a true label, then then you penalize it by a unit value, and you're trying to minimize the average number of mistakes over your end samples your end samples and this would be some number between zero and one that is what you're trying to minimize this is the zero one loss. The problem with this kind of thinking, the problem with the zero one loss specifically is that it is not differentiable. This is an indicator of whether these two things match or do not match and we cannot really use gradient based methods that are very powerful in optimization theory to understand how to tweak the weights W to fit to minimize this loss. In optimization, there was no such problem, we could solve the problem in close form analytically if you wanted to do gradient descent on this we could have done gradient descent just fine. Here there is no gradient and that is why we cannot use gradient based techniques and that is why we cook up losses that are proxies of this loss so zero one loss is what we really want to minimize, we cannot, but then we can define different kinds of proxies. So the first one, it is something called as the hinge loss and I will draw the hinge loss on this case, and I will draw it in this slightly funny way. So, I will draw why times W transpose X, and the reason I'm trying this is that if W transpose X and why have the same sign in the sense that if it is positive, then I'm making good predictions because when I apply the sign function to W transpose X, I will get the same thing as why. So everything here is stuff where the model predicts correctly on this input X for this output Y okay this is the ground product. And so we will draw the loss here, the hinge loss is the maximum of mine a zero and minus y times W transpose X, and so it will look a little bit like this. Okay, this is at the edge. You can think of different losses, the zero one loss will look a little bit like this. One, it will be so you'll get up, you will get a penalty of one for this particular sample, if W transpose X and why do not match in sign, if they do match in sign that you do not get any penalty. So that is the zero one loss. This would be the hinge loss. So different losses that people use so this is something called the exponential loss which is e to the negative y times W transpose X. Again, it will look it is it is one when W is when y times W transpose X is zero. So it will look a little bit like this, and there are there are many many other process like this. Okay. The name of the game now is to set up a parametric form for the model and we have used one particular parametric form, namely a linear function with some nonlinearity at the top to convert it into classifier and use a chosen surrogate loss to figure out which weights or which biases are the correct one. Okay, I will ignore the biases in everything that we do. You can think of it as W transpose X plus B is equivalent to some other W transpose times different vector where I append one to an X, and that is why I will not worry about the biases. I'll always like assume that I've appended my inputs and do not have to worry about them. Okay. Let us now think of one way to minimize things to fit things. And we are going to use a very simple algorithm. Let's think of the hinge loss the hinge loss is the maximum of zero and minus y times W transpose X. If I am here, if I have weights that give me a penalty that give me a loss that give me an error, then by moving in this direction, which is the direction of the negative gradient, I improve my performance on the model. So I reduced the amount of penalty that I get if I move to the right. If I am over here, I don't need to move anywhere because I'm already making the correct prediction on that sample. So the gradient of the hinge loss is simply with respect to W is minus y times X. This is a vector the same dimension as W. If W is a hundred dimension vector minus y times X is also 100 dimensional because X is 100 dimensional. And we are going to use a gradient like this to move. Okay, you're going to move in the direction of the negative gradient, which is equal to y times X, and keep iterating over this to fit this model. And this is what is called the famous perceptron algorithm people knew it back in the early 60s or the late 50s. And it works as follows. So we are going to take our data set at each iteration. Let's let's say that we are in the teeth iteration right now, they're going to sample a datum with an index omega t, omega t is an index that lies between one and and so when I say I'm going to sample a data datum, I'm going to put a uniform probability mass function from on elements one to n and I'm going to sample random variable which is the index from this distribution. And let us say that we sample this particular data x omega t and y omega t. We are going to update the weights to improve the loss on this particular datum if we are making a mistake on this data. If you do not make a mistake on this data then the weights stay the same. And you can do this for many many iterations. What what what is going to happen if you have samples that are let's say plus ones over here there's a minus ones over here. When you fit a perceptron you're finding a hyperplane. So if your hyperplane begins with something like this to sample this image, let us say everything to the right of the hyperplane or everything to the left of the hyperplane is plus one, everything to the right of the hyperplane is denoted is predicted to be minus one when you apply the sign. If you happen to sample this point, we will update the weights to classify this point a tiny bit better. So this hyperplane will be moved to fit the point that I need it better this particular point that I need it better. It need not always directly move to this space that we perfectly classify the point but it will definitely make it better. So now given that you are at this location in your second it after your first iteration, you will sample another point randomly, this one is also making a mistake. And so you will move the hyperplane a tiny bit further in to classify this point again, and you can do this forever. But does this process always stop. Well, it doesn't have to. If the data set is something that you cannot classify like the XOR data set, then you know that no matter which hyperplane you predict you couldn't possibly have exactly zero error. You will keep on changing your hyperplane by tiny and tiny amounts until you and never stop. But it is, you can show very easily that if the data set is actually something that can be classified cleanly by some hyperplane, then this algorithm to generatively modify the hyperplane to fit successfully newer samples or successive samples in the data set will find the solution. It may not find the solution did not be unique. So in this case there are indeed two hyperplanes, or many to be precise, that will all cleanly divide the data set. And so this algorithm will find one of them depending on where you initialize the weights W, but you will get a solution. So the perceptron algorithm works when you have linearly separable data sets, these are linearly separable data sets, and it need not converge if the data set is not linearly separable like so. Yeah. This is a short proof of when the perceptron algorithm works. What I want to end this section with is that we have seen a very powerful way of fitting models. This is something called as stochastic gradient descent, and I showed it to you without actually saying the words. This algorithm is called stochastic gradient descent for the hinge loss. So mathematically now, and this is a very old algorithm. It was discovered in the early fifties in the operations research culture, and the perceptron algorithm is simply one particular of SGD. SGD is what we are going to use a lot in the following lectures we're going to spend a lot of time talking about it enough for you to get bored by it. But in a crisp sense it will look a bit like this. These are our training data points. We would like to minimize some loss for our particular model of our labels on these data points so L superscript I I will denote to be the penalty that you get for having make incorrect predictions on the item in your data set. W are your weights and your job is to find the best ways W that minimize the average loss over the entire data set. For each iteration, we are going to update the weights w t plus one to be the old weights w t minus the gradient, which is the gradient calculated on some randomly chosen input data in your data set with respect to w and coefficient this coefficient will tell us how much we move in the direction of the negative gradient, you can imagine that if you move a little bit slowly, then you do not overshoot location that you may wish to find if this will be a loss you want to find the smallest value of the loss in your domain. If you take a very large step with the gradient then you may end up on this part. Okay, and that is why you would like to use a step size. This is something that you choose using the data set the best steps and then we'll see some ways of choosing this properly. But in in in case in succinctly, this is stochastic gradient descent where at each iteration you're sampling and input data and updating the weights in the direction of the negative gradient. This input data is sample uniformly over the data set and that is that is quite important. Yeah. So this kind of notation, instead of carrying around this big gradients. And there is nothing more here it is in the gradient assembly vector the size of the same same size of the number of weights, and we are going to update the weights using the data. Okay, any questions. No, no questions. So before you proceed, I will talk about one particular representation of the weights of a perceptron. We know that the weights of a perceptron. So pardon my omegas and let us call them omega bars. The weights of the perceptron are updated this way, if you're making mistakes on that particular sample, right. So, in a sense, the weights are always a linear combination of all the mistakes that you're making on the data set. So I can rewrite the weights at the end of training if I do the iterations of this weight update business, I can rewrite the weights in a slightly different way and say, if I have alpha I is the number of mistakes that I made while splitting on a data set on the Ith data remember that the hyperplane keeps on moving across many iterations. And what was a correctly labeled sample first according to our model may be incorrect later and then the model try to fix it because it'll sample that point at some later iteration. But after you choose to stop but after the perceptron algorithm converges. The alpha I be the number of mistakes that you make on the, the item. This can be zero if you never made a mistake while fitting the data set on that data, not just at the end but during the course of training. Or it could be some other value. Okay, so the final weights are some linear combination of these dual variables they are called times y times I which is directly coming from this additive term here, plus whatever your weight initialization was and if you think of your weights to be initialized simply at zero, then you can write down the final solution in this slightly less useful sense like this because it started if you know the values of alpha is before you train so it is not a closed form answer for what W star is, but it's a very powerful way to think about what W star is. Okay, it's a linear combination of the inputs and the outputs of all your training data points. And the function that we are after which is the sign of my hat where my hat was equal to W transpose X at this point. And replace it, rewrite W as the summation of the average of why I Xi over these weights alpha I, then you will be able to write the output like this. Now, the interesting thing to note here is that the function that we have built to predict on images X is very peculiar function, it's a convex combination. It's a combination of outer products of the test datum X with respect to all training data points. So, roughly speaking, this inner product between Xi which is one of your training data points and X which is your test data point is checking how similar and this is to one particular Xi. If these two are similar, then I up with the their contribution in this summation and if they're dissimilar I down with the contribution in the summation in this very precise sense, the perceptron is checking the similarities between the inputs in with respect to training that this data set and then using the outputs of those points cleverly combining them cleverly to make the prediction. Okay, so it is not, we can think of it in terms of putting a hyperplane and so, but you can also think of it in terms of some local interpolation of your input data points. Okay. This particular pattern which is extra and Xi transpose X the similarity or the inner product between the two is very special and you will see it many many times when you do machine learning. And that is why people give it a name. This is something called as a kernel machine. But first before we get to that. So how to use a linear model to create a non linear model. Okay, so we know how to fit a linear model. All you have to do is write down a hinge loss for classification and then take run stochastic and listen to fit this model. And let us see if we can make it a tiny bit richer of no more non linear linear models may make some mistakes so these are the red points and these are the blue points then all the ones that are marked blue. So these red points will be misclassified. These blue points will be misclassified. The concept of a feature space or the concept of a kernel that corresponds to feature space is to take the points X and then map it to some other space just like when we did polynomial regression, we took the polynomial, which was like this and then created over 1x x square. And this was your new future feature. And then now you did this. Now you fit and weights in using a linear model for this new feature set to fit a polynomial to this function. You can play the same game with more than that instead of creating monomial features, you could create some arbitrary map. So you can see between your original inputs X and your new inputs which are free of X every input in your training data set is mapped to a new input fee of X. And now the data set that we have is effectively a fee of X and why. And this is the one that we're trying to classify. So this is how a nonlinear map would look. This is an exponential kernel applied radio this is function kernel applied to these data points, and you will notice that the kernel or the model is still linear in the features so you're getting exactly this model sign up W transpose fee of X, but now because he is a nonlinear function of the inputs X, the effective decision boundary that you get for the model, the stuff that the markets the positive points and the negative points is not linear anymore. It is a bit richer you will be able to classify more complicated data sets using these functions, in particular the X or data set that we saw the one that linear functions could not classify. So you can imagine trying to classify it with this model, any guesses as to what feature we would use to classify this. Let us say that this is the origin. And these are our X one and X two this is a two dimensional data set. These are four points. What are the four points defined what points you get. What feature fee would you use to classify this correctly using a linear model. You can use fuller coordinates for those of you for thinking. So now the, as we said this is if this is our data set then fitting the model to this data set is the same thing again we are doing stochastic in the event where we pick an image of picking input where the model makes a mistake upon and instead of doing why times X as the great. Why times X as the gradient we now do why times fee of X, that is the gradient that has been used to update the model, just like the weights were combinations of why I excite weighted by alpha I the dual variables now their combinations why I feel excited by your variables alpha. Okay. Just like making the prediction was using inner products of Xi transpose times X before that X is the test data. Now you're using inner products of fee of I times fee of X. This has changed in principle. But now we have a somehow a more powerful way of fitting functions before this we could only fit hyper planes now we can fit more nonlinear functions. The it is important to realize that this is not just. Like a cheap trick right you're still fitting a linear model but now you can fit any data set and just pretend that it is a linear model of this data set but in some feature space. Of course, the problem of the problem, the issue of the problem being complicated of the data set being complicated doesn't entirely go away. Now, that if you want to fit a function in high dimensions, then you will have to pinch the function down at many many points that are exponential in the dimensionality of the function the support of the dimension the support of the function. So just because we throw our input images X into a higher dimensional space, the one dimensional polynomial was being thrown into a larger space with the terms being the monobiles doesn't mean that we can do machine learning well. We can fit the function well, but in order to find a true function the true nonlinear function, we still have to get more data if the feature space in which you throw stuff in is large, then you will also have to get like this it really large amounts of data. And that is why the concept of a feature space, even if it allows us to find one class of functions that fit well is not just the end of machine learning. It doesn't mean that you can always do a fit complicated data using simply linear models. That brings us to the concept of a kernel. So as you can imagine, if you want to run this predictor with an expression like this, then every single time someone gives you a test data. You're first going to compute this function fee of X. You have pre computed all the functions fee of Xi for all your training samples calculate this entire summation and then give you the and then output the predictions who I had right. This is a expensive thing to do at inference time, because if you think about this. So presumably, Google image source is using a neural network, or even Facebook is using a neural network to check whether two images have the same, but in the early 2010s. These are the kind of models that are being used to essentially serve search queries and check the similarity between high dimension images. This summation becomes complicated because it explodes linearly in the number of training data points if you have a risk data set you have lots of training data points have to do a lot of work at inference time. That makes it complicated. And these vectors can also be quite high dimensions of fee for typical images that people used to use in the 2000s will be a few thousand dimensions and that these inner products quickly get very expensive. Yeah, kernels are a way to think about this kernels are simply a name given to the inner product between two features of two input points Xi and X, and this is let's say a polynomial kernel. It will take it will take quadratic features. So if you have an input datum X, which is a real number. It will create a feature which is a three dimensional quantity now so you're solving a different classification problem where the inputs are three dimensional, and the outputs are one dimensional again, but now you can fit quadratic functions. The outer product between three of X or the inner sorry the inner product between three of X and three of X prime. If you write it down, you will see that it is some function in this case one plus X times X transpose square. Again, this function measure the similarity between X and X transpose X and X prime. So if X and X prime are very different from each other, then this entire number is very large remember that X and X prime are real numbers. So if they're very different from each other, they're far away from each other on the real line, and the kernel will be quite large between them. If they're close to each other, if they're identical in fact it will be one. So kernels are simply functions that are functions of two things X and X, X cross X that return a real number that and that measure the similarity between two things just like this inner product measure similarity. Different names. So if you're, if you're working, if you're trying to find quadratic features of high, high dimensional vectors, you can write down a kernel that looks like this again it measures similarity between X and X prime. This is the radial basis function kernel, which is again, as you can see, measuring the discrepancy between X and X prime. Now, the perceptron, we can again now write down, we know that it is simply SDD for the for Hain's loss, we can again write it down simply as updating the weights or updating the dual variables alpha i every time you make a mistake, you update you increase the value of alpha i by one for that particular input item. How do you check whether you make a mistake well you will check the sign of why and your prediction for that particular sample. And you will notice that these are all the features of your images in the training data set. These are the features of the images also in the training data set because X omega T during training is one of the inputs of your training data set. At the first time it's a different thing but for training, it is simply one of the elements of the training data set. So, you can create this large matrix, which is K of X i comma X j, whose size is number of samples times number of samples. And you can simply read off elements of this matrix to check whether or not you have made a mistake. Okay, so this model is called a kernel perceptron, and it is simply a way of fitting a nonlinear perceptron. It is still a linear model, but in this large picture space. Okay. Now, let us write down. So how much more time do I have, maybe about 10 minutes then. Yeah. So, now let us write down a slightly different kind of kernel. Okay, we said, we were very ambiguous about what this function for your face is. And we are going to now take a specific form of what the effects are going to say that I would like to pick a fee that takes in my input X uses some matrix S, and some nonlinear function sigma to write it down. This is simply a choice. I mean that it's a good choice. It is simply a choice. Okay, so sigma is the nonlinearity as is the matrix and X is a vector. This is also a big vector. If X is a vector. So let's say that if X lies in the dimensions, then the features that you create which has equal to fee of X lie in the different dimensions. So now you can ask yourself, okay, I want my function fee to be something that I know how to compute. How do I choose S and how do I choose sigma. As I said, there are many choices for this two things. People in the past are experimented with things that are like random features, where they will say that I will choose sigma to be let's say a sigmoid. It's simply a function which is one plus e to the negative minus X. So it's a it's a function that acts on real numbers and returns a real number in return, but they will choose features of the kind. So S which is a big matrix of size, let's say the dimension the columns, p rows, every element of S is draw from a random this from a distribution. So S is a random matrix, which multiplies your inputs X X, and then sigma acts upon all the elements of the result, which is a large vector and simply runs them like this of sigmoid is a function that looks a little bit like this. So S is zero as you go off to negative infinity as you go to positive infinity it is one. And so this is one particular way of creating a feature. Just like we created features using polynomials for fitting a quadratic kernel or quadratic features. Can think of this as some other feature space that corresponds to some kernel the kernel being simply P of X transpose times a fee of X prime. Okay. A random matrix S is great, and you will see some examples of how people have used such random matrices. Again, you can do the same kind of business. Just like you had w transpose fee of X as a model. Now you have w transpose sigma times S transpose X that goes in. F is not only a function of w because S is simply some matrix that we choose our priority before we begin training and freeze it to that value so S is not a parameter. Okay, so this is the predictor it is still a linear predictor in this feature space the feature space simply happens to be something that has this form. Again, you can minimize the loss over your data set for this particular chosen function. Okay. And you can read this very beautiful paper called that they love these random features, which were quite popular in the mid 2000s for problems ranging from text for speech, etc. And they showed that per shift invariant kernels, you can use sigma to be a cosine nonlinearity and with S being a random matrix. And that has certain nice approximation properties in the Fourier space. Okay, so we have talked about kernels the reason we talked about kernels is because we had a linear model and we wanted to secretly make it a nonlinear model but still fit only the linear function. We talked about how to choose a particular kind of feature, this curtain kind of feature. And then now I'm going to simply change the game a tiny bit and say look, I'm not interested in selecting S to be a random matrix, I will select S to be a parameter itself. And that will be our first neural network. So we are going to learn this feature matrix, which is S, in addition to the weights. Yeah, the reason for doing this is that we can always cook up this large feature, large feature space, but because S is frozen to be a fixed matrix. It doesn't give you the ability to it doesn't give us the ability to adapt this feature space to the particular data set. What features are good for images may be quite different for the features that are good for text may be quite different for the features that are good for images taken at night versus images of hundred and digits. And different data sets would like certain different features just like you were fitting data from polynomial, you know that you should use polynomial features x instead of let's assign results, and that will make your life a tiny bit better. The ability to writing our predictor F as a function of both W and S allows us the freedom to choose S in different ways, instead of simply freezing it. The problem has changed quite dramatically by just doing so. Now a function if you're now trying to find two parameters W and S that minimize the average error across the training data set the loss doesn't change too much but now it's a function of both W and S. And there's a few things that that are important to appreciate just by this small choice. Before this we said that it was fitting a linear function. Now we cannot say that anymore. It is a nonlinear function because W and S interact multiplicatively even if Sigma identity, they would still interact multiplicatively with each other. Okay, so it's a nonlinear function of the bits now it is no longer a linear function. It may also be a large function so before this we were only interested in finding W star. Now we are interested in finding W star and S star S star can be a large matrix it's a D dimension inputs and p weights. This was finding only the p weights plus a bias maybe, and this is finding the cross d weights and plus maybe a bias if you wish. So, depending on what we and we are this one is much larger than this, and this is why it may also be a slightly larger problem. I say hi dimension here in general but it is a larger problem for sure. The most important thing to appreciate about this is that while the hinge loss was a convex function the hinge loss is simply max of zero, why times W transpose x, this is a convex function of W. The new loss is not a convex function of W and S because they interact with each other multiplicatively so. Nonconvex optimization problems are much harder than convex optimization problems roughly speaking convex optimization problems look like a parabola, and that you're trying to descend down into and find the smallest value that a parabola takes. Nonconvex problems can look like this, if you begin here, then descending down will give you a solution that does not have as good error as some other location, if you begin here everything is nice. But if you begin even here, then you will reach a bad location, and this is an instance where you will not even fit the training data set correct. So gradient based optimization of nonconvex objectives is much harder than gradient based optimization of convex objectives. And that is why, by simply making this one little choice, we have increased the ability to fine tune the features to the data set we don't have to pick them by hand. The training process will automatically pick a good value of S, but we have made our life much harder, because now we are solving a much harder optimization problem. Okay, it's a nonlinear high dimensional nonconvex optimization problem, and such problems are very challenging. But this is a two layer neural network. And that is our first view into what a neural network is. The neural network is simply something that uses this specific form for the features of the so called first layer combines them using a linear function w transpose the features for the second layer and then predicts an output. And that's it. Okay. A deep neural network. I will change the notation because I want to use w for something else later. A deep neural network is simply a function where you do this business many many times. Instead of having the V transpose your features which was which is what we did in the previous section. You will do V transpose these kinds of features you take the first layers features, you again multiply them by another matrix S to you again multiply by another matrix S3 and you keep doing this for a few times, until you have now a predictor that depends on V. It depends on S one all the way up to SL if you have L plus one layers. So, this is a deep neural network. It is not a very complicated object it just looks complicated written like this but this is the only kind of expression you can write there is not much else you can do to this expression itself. And this is a very natural extension of a kernel perceptron a kernel perceptron was written so that we have the ability to define more fancy fancy features than just an affine function of the inputs. This is a feature that is that allows us to give us the ability to tune the features to the data set. Okay, so deep network in simple words, creates new features by composing together old features these are the old features and these are the new features that are being used to fit the function later. Okay, this combination of old features will be quite powerful, and it will give us very different kinds of features than what linear models can typically learn. And this is really where the true power of deep learning comes in the ability to learn features that are specific to the data sets that we're dealing with. Yeah, so I will stop here for today, and then tomorrow we'll look at some more jargon around deep learning and go closer towards understanding or trying to formalize the key questions in deep learning. Okay. Thank you. Thank you. See you. Yeah, let me know if you have questions feel free to send me an email. Okay, so I'm sorry I missed a few questions in the chat. Maybe I can answer them now. Yes, please. Yeah, so I would like to know if it is actually possible to investigate the dynamical system of a biological system using deep learning and how does deep learning taking into account nonlinearity. So this is very true so you can think of investigating the behavior of a dynamical system using deep learning. We can certainly make predictions on outputs of biological systems, whether it is time, whether it is a function of time or not that is just a way of the models that we will use will be different for systems. In the third lecture, we will look at some pretty cool ideas of how these models behave of how models of biological systems behave in general. And that is very similar somehow to how neural networks also learn. Okay. Why is it that the test data is unknown, shouldn't we split data prior to model fitting, you can split your data in any way you want the problem is that after you have done whatever you want to fit the model someone will run it on a real problem that is unknown. So what data will be fed to the new model that is what I would like to call as test data that is unknown the training in a set you can certify it using many different ways that people will call it cross validation. And that is simply our way of fitting the model. Okay, that doesn't change the fact that we don't know the test data. Cool. Thank you so much. I will see you tomorrow. Same time. Are there any questions in the audience. No, okay. If that's the case. Thank you. Thank you very much for taking. See you tomorrow at 2pm. See you to time. Yeah, thank you.