 In this chapter, we will see how to use neural networks not only to deal with vectors, but also with sequences. We will start by viewing the differences between a vanilla classic neural network and a recurrent neural network. Then we will see a sequence of combinations of vectorial and sequential input and output and their respective applications. For example, we will see how to map a vector to a vector, a vector to a sequence, a sequence to a vector, a sequence to a vector, and again back to a sequence, and finally a sequence to a sequence. So here we will introduce a new notation. Each of these circles represent a whole layer of a neural network. Previously, we were representing with each and every circle each and every scalar activation of a given layer. From now on, we will represent with a circle the whole layer, so it represents a vector. The input is called x, the hidden vectorial representation is called h, and then the output of the network is called y hat, whereas our labels are still y. We can make a very nice correspondence with the combinatorial logic here on the right hand side, where the output depends solely on the current input. A recurrent neural network introduces a recurrent connection coming from the previous state or previous internal representation. In order to be able to represent this network without recurrence, we will replicate the network over time indexes, where all the parameters are shared among the same nodes. Here the analogy goes to the sequential logic, where the current output of a system depends on the current input and on the specific state in which the system was. Neural networks are very powerful abstractions, which allow us to play with different kind of data. The simplest kind of network we have seen so far deals with vectors and produce vectors. Let's see how to represent this kind of first type of neural networks. So we have an input here, which is then mapped into a hidden layer, and then finally we produce an output. All of these kind of networks are used for image classifications. Bounding box is regression, so that we can perform localization, and in general any kind of f of x regression, so it can also be called as function approximation. And basically classical neural networks can be thought as systems that are optimized over function regression. This case was vector to vector, and can be represented by the following notation. So we go from a vector x to a vector y prediction. So y would be the label, y hat is going to be now our prediction of our network. So we go from one vector x to one other vector y. Either one can be a one element vector. That means just a scalar. For example, we could predict the price of an apartment, so that would be like a scalar given a set of features. Or given a scalar instead as an input, like would be the x axis, we could predict a y, which is a multi-dimensional function, for example. Let's see how a classical neural network is able to classify different images, representing the digits from 0 to 9 of the MNIST dataset. In this chart we will see the projection of the vectors representing each and every digit from 0 to 9 on to the subspace, on which we have maximized the variance. So this is the view in which the digits look the most separate between each other. And we can see that it's kind of hard to find separation planes that are able to group individually each and every separate digit. If we train a neural network in order to do this task, it will learn to move the space across the different layers in order to achieve a representation where the classes are linearly separable. And the linear separation will be performed by the last logistic unit. So after I train my network on this dataset, I reproject the output representation corresponding to each of these samples. And what we get is this. We can clearly see now that each and every digit can be easily separated by each other using simple linear separation hyperplanes. What if we provide one input, but then we would like instead to have a sequence in terms of output. Let's say, so we have one singular input here, our X, and then we have a sequence of internal states or hidden variables, which then produce a sequence of outputs. So we have our input here, and it produced the first output. Then given the first input, we produce the second output. And moreover, we can produce, for example, a third output, an example of an application that would use a similar scheme is image captioning, where we just provide one input image, X. But then we actually expect a sequence of words which are describing the content of the image. So in this case, I will write here y hat function of an index t. And this is vector to sequence. Here we see an example of vector to sequence mapping. The input vector is an image like we can see here. In the output is a sequence of symbols representing words in a sentence. We have different performance from the left to the right. We have descriptions without errors. Then we have descriptions with minor errors. Then the captions are somehow related to the image. Or on the right hand side, we see completely unrelated captions. This task would have been impossible to tackle with simple neural network, and is our first example of how our current neural network can be used in practical applications. We have now two elements. We have vectors and sequences. So we can do all their combinations. Now let's do a sequence to vector. So in this case, I will provide to my network a sequence 1, 2, 3 of symbols. Which are fed into the hidden layer. But then we produce an output only at the end. So we insert the first symbol into the network. And then we insert the second symbol, which will be affected by the symbol that we inserted before. We insert up to the last one. And again, this one will be affected by whatever we have input before. At the end, we can ask the network to output a specific result. Practical applications are equation solvers. Or even better, we could even create a system which is able to learn to program. We could call this program executor. So in this case, before we saw that the classical neural network, which is mapping a vector to a vector, can be thought as a system that is optimized over functions. A recurrent neural network can be thought as a system which is optimized over programs. We teach a system how to perform simple tasks. Given some inputs and the internal state of the system, we expect an output. And everything is just learned by gradient descent. So we are creating a system which is learning how to perform tasks given our data. And given our expected result. So in this case, we said we input a sequence. And we expect a vector output. So we have our input sequence. And then we expect our prediction, which I will just call last element of our possible output sequence. So it would be like sequence to sequence where I just considered the last element of the sequence. Here we are talking about sequence to vector mapping. Our input data, it's a sequence of instructions. More specifically, here it's a sequence of characters which are making up a program. Our output will simply be the print value on screen. So in this case, the recurrent neural network has to learn the semantic of this programming language and also has to learn how to compute each and every instructions in order to make a reliable prediction. In this case, we can talk about optimization over programs. Let's combine these last two schemes into a new system so that we can input a sequence which is then condensed into the state of the system. And then when the state of the system is updated with the last element of the input sequence, it will start outputting an output sequence. Let's draw it just to have a better understanding of what I just said. So we have our first sequence. We have one, two, three, which is input to our system. And then after the system has accumulated a history of inputs and its state has been updated given the input sequence, we start outputting some results. So here we propagate to the next state. Let's say we have three states update after the input has stopped. And then we output here our output sequence, which can be of any kind of length. It doesn't have to match the input length. This kind of scheme reflects the life of a PhD student. You get much information across the years. So we have first year, second year, third year, and maybe a fourth and even a fifth. And then it becomes said. But okay, let's go over. And after these many years of input from the specific field and domain of studies, we improve our internal state to a better and better and better representation. And then afterwards, when you have crafted your perfect state, you have someone that wants to hire you so that you can actually spit out years of good work based on your experience. And all the companies are looking for is basically this very nice representation. Even in the same way, we'd like our network to perform a similar task with input, some specific input. For example, there could be a sequence of symbols that can be like the words in a sentence. Then we have a representation of a concept that the sentence tries to express. And then we can think to output this concept in a different domain, perhaps in a different language. This scheme reflects not only the PhD student life, but of course the neural machine translation, which internal state shows very, very interesting characteristics, as we shall see. In this case, we went to a sequence, to sequence. But then we also came through a condensed representation, which is actually a vector. So in this case, we can call this scheme also encoder decoder. So we have an encoding of a sequence into a vectorial representation. And then we have a decoder, which transforms and recovers the sequential nature of the domain we were working with from the internal vectorial representation. So in here, we can write this one as our input, which is a sequence. We go through a hidden vectorial representation without time index. And then we again expand into a Y hat, so prediction, which is a vector over the time index T. As we said, the state of a encoder decoder scheme has very interesting properties. Here we can see a 2D projection with maximum variance among embeddings of different words that we have provided to our network that has been used as a neural machine translation. We have input one sequence of words representing a sentence in a given language, got a compressed representation of the sentence, and then spit out another sequence of symbols representing words in a sentence in another language. This chart represents the projection on the 2D plane with highest variance of the words that are sent into this system. Let's see, for example, the green cluster. You can see here that all these green terms are month's names. Not surprisingly, they occupy a very, very close region with respect to the other words because they can easily be swapped without turning the sentence incorrect. Although the meaning will change because of course different months are corresponding to different time periods, but they don't influence the correctness of a sentence. Moreover, if we instead insert sentences and then we project the embedding space on a 2D space with maximum variance, we can see clusters of very similar meaning. For example here, we have period of times appearing all close together. You can find more illustrations if you follow the reference below. Finally, we have the last combination, which is direct conversion of a sequence into a sequence. So let's draw it. We have here our input sequence. Let's just draw three symbols, which are fed into our network state hidden layer. And automatically it starts outputting as well our output sequence. So we first send the first symbol and then we expect already some kind of output. Then given my first input and my second symbol, we expect to have a second prediction again. Given the previous state and the new input, we expect the output to change again. And I'm very sure every one of you are very, very familiar with a practical application of this kind of scheme. And yes, it's your lovely autocorrection, which tries to predict the next word. Given the character, we have just inserted. So the pink circles represent the symbols representing each and every character I input into my phone. And the blue balls represent the completion of the current word and a suggestion perhaps for the next word. We can call this one T9 or autocorrection, which outputs a sequence of symbols, which are, for example, the words. Given that I input a sequence of other symbols, which are in this case characters. So we can mix and match, we can input characters and we expect perhaps a probability distribution of the most likely word completion choices. And therefore here we go from our final combination, which is sequence to sequence. Finally the sequence to sequence mapping. We said can be used for making the autocorrection on your phone. Here I just picked a very nice project where the author has made a package for the Atom text editor, which can be used to write novels in a fantasy style, since it's been trained on fantasy corpora. Here we can see the rings of Saturn glittered while the, and then we ask for a prediction, harsh eyes lead two men looked at each other. And then you can go on. They were enemies, but let's see what happened. The server robots were in concern. So you can have fun with this project you can find on GitHub on the link below. Each circle represents a vector and in depth, so going from the bottom layers to the high level layers, we represent different layers of the network. On the X axis instead, I represented the time index. So as we move from the left to the right, we see different time steps. Before in the previous lessons, I was representing with a circle each and every activation of a specific layer. In this case here, each circle represent the complete layer with all its own activations. Stay tuned because in the next video we will see the equations that govern the recurrent neural network and how they relate to the equations of our vanilla neural network. Moreover, we will see how we can train this system with backpropagation through time.