 On a surface level, the key feature that makes sequences or time series be hard to handle and different from standard vectors like we're used to using in deep learning, is that sequences are of different length. Sentences have more or fewer words or characters, utterances, dictations to a phone can be of different lengths, time series. So all these sequences come with different lengths, but our fundamental deep learning architectures expect something of a fixed length. We're going to have to deal with that if we want to handle time series, and there are in fact a number of ways to do that. So we need to take something like a sequence of words, convert it to some input which is of a fixed length, so we can put into a standard deep learning architecture like we know. There are five main ways to do that which we're going to walk through quickly. Troncation, bag of words, embeddings, convolution, and then what we'll mostly cover, recurrence. Let me walk through each of these one by one because they're all simple but super useful. Well, recurrences are tricky, but the other ones are simple. So one version says I've got a set of sequences like abcaca, these are character sequences rather than word sequences, but it doesn't matter. If we wanted to make them have a fixed length all be of length four, we could just simply take ones that are too long and truncate them. So here I've kept the first four letters, four words of this abca, I could have kept the last four. You can truncate the end or truncate the beginning. Often with text, say with tweets you truncate and keep the beginning. Oh, but tweets are nice because they're fixed length, yay. Or some are too short. The second line cdd, too short. We could pad it with some special padding character here I used a zero, cdd zero. Once I've done this, I now have four observations for sequences. All sequences are the same length because we truncated or we padded. Seems embarrassing as a way to do things, but it works actually sort of embarrassingly well. Second method, classic method used in natural language processing, is to have what's called a bag of words. Or here you could think a bag of characters. If I have only 26 characters or only 100,000 words, I can just count how often each word or each character shows up in the sequence. In this original sequence here there are three a's, a, a, a. There is one b, there are two c's, and there's zero of everything else. So I now have a 26 dimensional embedding, I can call it, vector representation that captures how often each word. The second sequence, zero a's, zero b's, one c, two d's. Okay. So this works well for characters where there's only 26 of them. Doesn't work so well for English words or Chinese characters where there are tens of thousands or hundreds of thousands of them. Not used so much in modern natural language processing anymore. Third one, embeddings. We're going to talk a lot this week and next week about embeddings. The idea of an embedding is to take every word or every character and map it somehow. We'll see how. To a real valued vector, here I've magically called a represented by the vector 1.102.5 minus 2. How did that happen? Well, we'll see how to embed words in the future. We'll cover that. But now if I have an embedding that maps every individual observation, a, b, c, or every word to an embedding, now I could just embed the whole sequence trivially by saying, okay, let's take how often does a happen three times or take three times the a embedding? How often does b happen once? Take one times the b embedding. How often does c happen two times? Take two times a c embedding. How often does d happen zero times? Take zero times the d embedding. Add them all up. Or average them is probably better. So now we could make a combined embedding for this whole sequence, which is of the length of the embedding. We're going to see a bunch of fancier ways to embed. But this one is one that I use surprisingly often. The trick is we'll have to learn, which we'll cover today, how to go from any word in any language. We'll cover that next week to an embedding. Convolutions. We could also take this sequence as being a sequence of local pieces. And here you might take these as sequences of size three. We could do a one dimensional kernel. If we take in the first three characters or fours, a, b, c, fit those as a fixed input into a kernel like a CNN convolutional net, that would give us some hidden output. We can then take that kernel, move it down a stride, say in this case of one character, take in b, c, a, put out h two. Step one further, c, a, c, take h three, put each of those out. So we get a sequence of hidden outputs. We're going to run those through a neural net. And we will eventually get out, as we go deeper and deeper, some sequence of outputs, right, which will depend upon how long the overall original sequence is, right? So we've got some compression here, but we've ended up taking something which compresses local regions. And now we still need to take these outputs and convert them to a single number or vector, but we could take the maximum of them, which might be a good way to detect, is there a person in this sequence? Or we could take the sum of them, which might take something to say how positive or negative is a sequence, but some way to process those. And you'll see examples of these in the assignments. And finally, but we're going to come back and cover that entirely separately. The whole week really is about doing recurrence.