 So I was going to start by going back to clustering. We're going to talk about clustering again in the next lesson or two in terms of an application of it. But specifically what I wanted to do was show you k-means clustering in tensorflow. Let's just stay on. There are some things which are easier to do in tensorflow than PyTorch, mainly because tensorflow kind of has a more complete API so far. So there are some things that are just you look in tensorflow and it's like, oh, there's already a method that does that, but there isn't one in PyTorch. And so some things are just a bit easier, neater in tensorflow than in PyTorch. And I actually found k-means quite easy to do. But what I want to do is I'm going to try and show you a way to write custom tensorflow code in a kind of a really PyTorch way, right, in a kind of an interactive way, and we're going to try and avoid all of the, you know, fancy session-y, graphy, scope-y business as much as possible. So to remind you, the way we kind of initially came at clustering was to say, hey, what if we were doing lung cancer detection in CT scans, and these were like 512 by 512 by 200, volumetric things, which is too big to really run a whole CNN over conveniently. So one of the thoughts to fix that was to run some kind of heuristic that found all of the things that looked like they could vaguely be nodules, and then create a new data set where you basically zoomed into each of those, maybe nodules, and created a small little, you know, 20 by 20 by 20 cube or something, and you could then use a 3D CNN on that, or try a planar CNN. And I, this general concept I wanted to remind you about, because I feel like it's something which maybe I haven't stressed enough, I've kind of kept on showing you ways of doing this. Think back to the lesson seven with the fish. I showed you the bounding boxes, and I showed you the heat maps. The reason for all of that was basically to show you how to zoom into things, and then create new models based on those zoomed-in things. So in the fisheries case, you know, we could really just use a lower-res CNN to find the maybe fish and then zoom into those. In the CT scan case, maybe we can't even do that, so maybe we need to use this kind of mean shift clustering approach. I'm not saying we necessarily do, it would be interesting to see what the winners use, but certainly particularly if you don't have lots of time or you have a lot of data, heuristics become more and more interesting. You know, the reason a heuristic is interesting is you can do something quickly an approximate that could have lots and lots and lots of false positives, and that doesn't really matter, right? Because, you know, those false positives means just, you know, extra data that you're feeding to your, you know, your real model. So you can always tune it as like, okay, how much time have I got to train my real model, and then I can decide how many false positives I can handle. So as long as your kind of pre-processing model is better than nothing, you know, you can use it to get rid of some of the stuff that like is clearly not a nodule, for example, for example, and a thing that is like in the middle of the lung wall is not a nodule, and a thing that is, you know, all white space is not a nodule, so forth. Okay, so we talked about mean shift clustering and how the big benefit of it is that it allows us to build clusters without knowing how many clusters there are at a time. Also, without any special extra work, it allows us to find clusters which aren't kind of Gaussian or spherical, if you like, in shape, but that's really important for something like a CT scan, where a cluster will often be like a vessel, which is this really skinny long thing. So K-means, on the other hand, is faster, I think it's n squared rather than n cubed time. We have talked, particularly on the forum, about dramatically speeding up mean shift clustering using approximate nearest neighbors, which is something which we started making some progress on today, so hopefully we'll have results from that maybe by next week. But the basic naive algorithm is certainly should be a lot faster for K-means, so there's one good reason to use it. So as per usual, you know, we can start with some data and we're going to try and figure out what where the cluster centers are. So one quick way to avoid hassles in TensorFlow is to create an interactive session. So an interactive session basically means that you can call dot run on a computation graph, which doesn't return something, or dot eval on a computation graph that does return something, and you don't have to worry about creating a graph or a session or, you know, having a session with clause or anything like that. It just works. So that's basically what happens when you call tf.interactive session. Okay, so by creating an interactive session, we can then kind of do things one step at a time. So in this case, the first step in K-means is to pick some initial centroids. So you basically start out and say, okay, if we're going to create however many clusters, so in this case n clusters is six, right, then start out by saying, okay, well, where might, you know, where might those six clusters be? And for a long time with K-means, people picked them randomly. But most practitioners realize that that was a dumb idea soon enough, and a lot of people have various heuristics for picking them. In 2007, finally a paper was published actually suggesting a heuristic. I tend to use a very simple heuristic, which is what I use here in find initial centroids. So to describe this heuristic, I will show you the code. So find initial centroids looks like this. Basically, and I'm going to run through it quickly and then I'll run through it slowly. Basically, the idea is we first of all pick a single data point index, and then we select that single data point. Okay, so we have one randomly selected data point. And then we find what is the distance from that randomly selected data point to every other data point? And then we say, okay, what is the data point that is the furthest away from that randomly selected data point? The index of it and the point itself. And then we say, okay, we're going to append that to the initial centroids. So say that I picked at random this point, as my random initial point, the furthest point away from that is probably somewhere around here. Okay, so that would be the first centroid we picked. Okay, we're at now inside a loop and we now go back and we repeat the process. So we now replace our random point with the actual first centroid and we go through the loop once more. So if we had our first centroid here, our second one now might be somewhere over here. Okay, so we now have two centroids. The next time through the loop, therefore, this is slightly more interesting. This all distances, we're now going to have the distance between every one of our initial centroids and every other data point. So we've got a matrix in this case. It's going to be 2 by the number of data points. So then we say, okay, for every data point, find the closest cluster. So what's its distance to the closest initial centroid? Okay, and then tell me which data point is the furthest away from its closest initial centroid. So in other words, which data point is the furthest away from any centroid? So that's the basic algorithm. So let's look and see how we actually do that in TensorFlow. So it looks a lot like NumPy, except in places you would expect to see NP. We see TF, and then we see the API is a little different. But not too different, right? So to get a random number, we can just use random uniform. We can tell it what type of random number we want. So we want a random int because we're trying to get a random index, so which choose a random data point. It's going to be between zero and the amount of data points we have. So that gives us some random index. We can now go ahead and index into our data. Now you'll notice I've created something called VData. So what is VData? When we set up this K-Mins in the first place, the data was sent in as a NumPy array, and then I called TF.variable on it. Now this is the critical thing that kind of lets us make TensorFlow feel more like PyTorch. Once I do this, data is now basically copied to the GPU, and so when I'm calling something using VData, I'm calling this this GPU object. Right? Now there's one thing problematic to be aware of, which is that the copying does not actually occur when you write this. The copying occurs when you write this. Okay, so any time you call TF.variable, if you then try to run something using that variable, you'll get back an unanationalized variable error unless you call this in the meantime. Okay, so this is kind of like performance stuff in TensorFlow where they try to say, okay, well, you can like set up lots of variables at once and then call this initializer and we'll do it all at once for you. Okay, so earlier on we created this K-Mins object. We know that in Python, when you create an object, it calls underscore underscore init underscore underscore. That's just how Python works. Inside that, we copied the data to the GPU by using TF.variable and then inside find initial centroids, we can now access that in order to basically do calculations involving data on the GPU. In TensorFlow, pretty much everything takes and returns a tensor. Right? So when you create, when you call random uniform, it's giving us a tensor, you know, an array of random numbers. In this case, we just wanted one of them. So we have to use TF.squeez to take that tensor and turn it into a scalar because then we're just indexing into here to get a single item back. So now that we've got that single item back, we then expanded back again into a tensor because inside our loop, remember, this is going to be a list of initial centroids. It's just that this list happens to be of length one at the moment. So one of these tricks in in making TensorFlow feel more like PyTorch is to use standard Python loops, right? So in a lot of TensorFlow code where it's kind of, you know, more serious performance-intensive stuff, you'll see people use like TensorFlow-specific loops like TF.wile or TF.scan or map or so forth. The challenge with using those kind of loops is it's basically creating a computation graph of that loop. You can't step through it. You can't, you know, use it in the normal Pythonic kind of ways. So we can just use normal loops, normal Python loops, if we're careful about how we do it. Okay, so inside our normal Python loop, we can use normal Python functions. So here is a function I created which calculates the distance between everything in this tensor compared to everything in this tensor, right? So all distances is, looks very familiar because it looks a lot like the PyTorch code we had, right? So we for the first array, for the first tensor, we add an additional access to access zero and for the second we add an additional access to access one. So the reason this works is because of broadcasting. So A, when it starts out, is a vector. Now, and B is a vector. Now, is, is A a column or is A a row? What's the orientation of it? Well, the answer is it's both and it's neither, right? It's one-dimensional, so it has no concept of what direction it's looking, right? So at this, so then what we do is we said expand dims on access zero, so that's rows, right? So that basically says to A, okay, you are now definitely a row vector, right? You now have one row and however many columns, same as before. And then where else with B we add an access at the end, right? So B is now definitely a column vector, right? Because it now has one column and however many rows we had before, right? So with broadcasting what happens is that this one gets broadcast to this length and this one gets broadcast to this length. So we end up with a matrix containing the difference between every one of these items and every one of these items. So that's like this kind of simple but powerful concept of how we can do, you know, very fast GPU accelerated loops and less code than it would have taken to actually write the loop. And we don't have to worry about out of bounds conditions or anything like that. It's all done for us. So that's the trick here, right? And once we've got that matrix because in TensorFlow everything is a tensor we can call squared difference rather than just regular difference and it gives us the squares of those differences and then we can sum over the last axis. So the last axis is the dimensions, right? So we're just creating a Euclidean distance here. And so that's all this code does, right? So this gives us every distance between every element of A and every element of B. Okay, so that's how we get to this point. So then let's say we've gone through a couple of loops, right? So after a couple of loops, R is going to contain a few initial centroids, right? So we now want to basically find out for every point how far away is it from its nearest initial centroid? So when we go reduce min with axis equals zero, then we know that that's going over the axis here because that's what we put into our all distances function, right? So it's going to go through well, actually it's reducing across that into that axis. So it's actually reducing across our centroids. So at the end of this it says, all right, this is for every part piece of our data how far it is away from its nearest centroid. Okay, and that returns the actual distance, right? Because we said do the actual min. So then there's a difference between min and arg, the arg version. So arg max then says, okay, now go through all of the points. We now know how far away they are from their closest centroid and tell me the index of the one which is furthest away. All right, so arg max is a super handy function. We used it quite a bit in part one of the course. But it's well worth making sure we understand how it works. I think in TensorFlow there, I think they're getting rid of these reduce underscore prefixes. I'm not sure. I think I read that somewhere. So in some version you may find this is called min rather than reduce min. I certainly hope they are. For those of you who don't have such a computer science background, a reduction basically means taking something in a higher dimension and squishing it down into something that's a lower dimension. For example, summing a vector and turning it into it into a scalar is called a reduction. So this is a very TensorFlow API assuming that everybody's a computer scientist and that you wouldn't look for min. You would look for reduce underscore min. So that's how we got that index and so generally speaking, you know, you have to be a bit careful of data types. I generally don't really notice about data type problems until I get the error. But like if you get an error that kind of says, oh, you passed an int 64 into something that expected an int 32, you can always just cast things like this, right? So we need to index something with an int 32. So we just have to cast it. And so this returns the actual point, right, append. And then this is very similar to NumPy stacking together the initial centroids to create a tensor of them. Okay, so the code doesn't look at all weird or different, but it's important to remember that when we run this code, nothing happens. Okay, other than that it creates a computation graph. So when we call K dot find initial centroids, nothing happens. But because we're in an interactive session, we can now call dot eval and that actually runs it, right? And it runs it and it actually takes the data that's returned from that and copies it off the CPU and sorry, off the GPU and puts it back in the CPU as a NumPy array. So it's important to remember that after you call eval, we now have an actual genuine regular NumPy array here. And this is the thing that makes us be able to write code that looks a lot like PyPorch code because we now know that we can take something that's a NumPy array and turn it into a GPU tensor like that. And we can take something that's a GPU tensor and turn it into a NumPy array like that. So I don't know, I suspect this might make TensorFlow developers shake at how horrible this is. It's not, you know, really quite the way you're meant to do things, I think, but it's yeah, it's super easy and it seems to work pretty well. This approach where we're calling dot eval, you do need to be a bit careful. If this was like inside a loop that we were calling eval and we were copying back a really, really big chunk of data to the GPU and the CPU again and again and again, that would be a performance nightmare. All right, so you do need to kind of think about what's going on as you do it. So we'll look inside the inner loop in a moment and just check. Anyway, the results are pretty fantastic. As you can see, this little hacky heuristic does a great job. And you know, it's a hacky heuristic I've been using for decades now and it's a kind of thing which often doesn't appear in papers. In this case, a similar hacky heuristic did actually appear in a paper in 2007 and an even better one appeared just last year. But it's always worth thinking about like how can I process my data to kind of create, get it close to where I might want it to be. And often these kind of approaches are useful. There's actually, I don't know if we have time to maybe talk about it someday, but there's a approach to doing PCA, principle components analysis, which has a similar flavor, basically finding random numbers and finding the furthest numbers away from them. So it's a good general technique actually. All right, so we've got our initial centroids. What do we do next? Well, what we do next is we're going to be doing more computation and TensorFlow with them. So we want to copy them back to the GPU. Okay, and so because we copied them to the GPU, before we do an eval or anything later on, we're going to have to make sure we go global variable initializers.run. The question, can you explain what happens if you don't create interactive session? So what the TensorFlow authors decided to do in their wisdom was to generate their own whole concept of namespaces and variables and whatever else. So rather than using pythons, there's TensorFlows. And so a session is basically kind of like a namespace that holds, you know, the computation graphs and the variables and so forth. You can and then there's this concept of a context manager, which is basically where you have a with clause in Python and you say like with this session. Now you're going to do some stuff in this namespace and then there's a concept of a graph. You can have multiple computation graphs. You can say with this graph, you know, create these various computations. Where it comes in very handy is if you want to say like run this graph on this GPU. Or, you know, stick this variable on that GPU. So without an interactive session, you basically have to create that session. You have to say which session to use using a with clause. And then like there's many layers of that. So within that you can then create namescopes and variable scopes and blah blah blah. So then the annoying thing is the vast majority of tutorial code out there uses all of these concepts. Right? It's as if like all of Python's OO and variables and modules doesn't exist and you use TensorFlow for everything. So I wanted to show you that you don't have to use any of these concepts pretty much. Like thank you for the question. I have a quite thinking through this, but have you tried? So you've got your, what are there? There's six clusters up there. And if you had if you had, if you had initially said I have seven clusters or eight clusters, what you would find after you hit your six is you'd all of a sudden start getting centroids that were very close to radius existing centroids. So it seems like you could somehow intelligently define a width of a cluster or kind of look for a jump and things dropping down and how far apart they are from some other cluster. Yeah. And programmatically come up with the way to decide the number. Yeah. Yeah, I think you could. You know, maybe then you're using k-means. I don't know. Like yeah, I think it's a fascinating question. I haven't seen that done. There are certainly papers about figuring out the number of clusters and k-means. So maybe during the week you'd check one out, put it to TensorFlow. That'd be, that'd be really interesting. And I just wanted to follow up what you said about sessions to kind of emphasize that with a lot of tutorials you could make the code simpler by using an interactive session in a Jupiter notebook instead. Yeah, I remember when Rachel was going through a TensorFlow course a while ago and she kept on banging her head against the desk with sessions and variable scopes and whatever else. And we kind of, yeah, that was part of what led us to think, okay, let's simplify all that. All right. So step one was to take our initial set troids and copy them onto the GPU. So we now have a symbol representing those. So the next step in the k-means algorithm is to take every point and assign them to a cluster, which is basically to say for every point, which of the set troids is the closest. So that's what assigned to nearest does. We'll get to that in a moment, but let's pretend we've done it. This will now be a list of which set troid is the closest for every data point. So then we need one more piece of TensorFlow concepts, which is we want to update an existing, we want to update an existing variable with some new data. And so we can actually call update centroids to basically do that updating and I'll show you how that works as well. So basically the idea is that we're going to loop through doing this again and again and again. But when we just do it once, you can actually see it's nearly perfect already. So it's a pretty powerful idea as long as your initial cluster centers are good. So let's see how this works assigned to nearest. There's a single line of code. And the reason it's a single line of code is we already have the code to find the distance between every piece of data and its centroid. Now rather than calling tf.reduce min, which returned the distance to its nearest centroid, we call tf.argmin to get the index of its nearest centroid. So generally speaking, the hard bit of doing this kind of highly vectorized code is figuring out this number, which is what access are we working with, right? And so it's a good idea to actually like write down on a piece of paper for each of your tensores. It's like it's time by batch, by row, by column or whatever. Like make sure you know what every access represents. But I'm creating these algorithms. I'm constantly printing out the shape of things. And another really simple trick, but a lot of people don't do this, is make sure that your different dimensions actually have different sizes. So when you're playing around with testing things, don't have a batch size of 10 and an n of 10 and the number of dimensions of 10, right? I find it much easier to think of real numbers. So like have a batch size of 8 and an n of 10 and a dimensionality of 4, right? Because at every time you print out the shape, you're finding out exactly what everything is. Okay, so this is going to return the nearest indices. So then we can go ahead and update the centroids. So here is update centroids. And suddenly we have some crazy function. And this is where TensorFlow is super handy. It's full of crazy functions. And if you know the computer science term for the thing you're trying to do, it's generally called that, right? And so it's the only other way to find it is just to do lots and lots of searching through the documentation. So in general, taking a set of data and sticking it into multiples chunks of data according to some kind of criteria is called partitioning in computer science. So I got a bit lucky when I first looked for this, I googled for TensorFlow partition and bang, this thing popped up. So let's take a look at it. And this is where like reading about GPU programming in general is very helpful because in GPU programming, there's this kind of smallish subset of things which everything else is built on. And one of them is partitioning. Okay, so here we have tf.dynamic partition. Partitions of data into some number of partitions using some indices. And generally speaking, it's easiest to just look at some code. So here's our data. We're going to create two partitions, we're calling them clusters, and it's going to go like this. Zero's partition, zero's first, first zero's. So 10 will go to the zero's partition, 20 will go to the zero's partition, 30 will go to the first partition. Okay, this is exactly what we want, right? So this is the nice thing is that there's a lot of these, you can see all this stuff, right? There's so many functions available. Often there's the exact function you need. And here it is, right? So we just take our list of indices, convert it to a list of int32s, pass it our data, the indices, and the number of clusters, and we're done, right? This is now a separate array, basically a separate tensor for each of our clusters. So now that we've done that, we can then figure out what is the mean of each of those clusters. So the mean of each of those clusters is our new centroid, right? So what we're doing is we're saying, okay, which points are the closest to this one? And we're kind of going, okay, these points are the closest to this one, okay, what's the average of those points? All right, that's all that happened from here to here. Okay, so that's taking the mean of those points, and then we can basically say, okay, that's our new partition, that's our new clusters, so then just join them all together, concatenate them together. Okay, so except for that dynamic partitions, well, I mean, in fact, even including that dynamic partitions, that was incredibly simple, but it was incredibly simple because we had a function that did exactly what we wanted. So because we assigned a variable up here, we have to call initializer.run, and then of course before we can do anything with this tensor, we have to call .eval to actually call the computation graph and copy it back to our CPU. Okay, so that's all those steps. So then we want to replace the contents of current centroids with the contents of updated centroids, and so to do that we can't just say equals, everything's different in TensorFlow, you have to call .assign, right? So this is the same as basically saying current centroids equals updated centroids, but it's creating a computation graph that basically does that assignment on the GPU. How could we extrapolate this to other non-numeric data types such as words, images? Well, they're all numeric data types really, so an image is absolutely a numeric data type, so it's just a bunch of pixels. You just have to decide what distance measure you want, which generally just means deciding, you're probably using Euclidean distance, but are you doing it in pixel space or are you picking one of the activation layers in a neural net? For words, you would create a word vector for your words. There's nothing specifically two-dimensional about this, this works in as many dimensions as we like, and that's really the whole point, and I'm hoping that maybe during the week some people will start to play around with some higher-dimensional data sets to get a feel for how this works, particularly if you can get it working on CT scans, that would be fascinating using the five-dimensional clustering we talked about. Okay, so here's what it looks like in total if we weren't using an interactive session. You basically say with tf.session, that creates a session, but as default, that sets it to the current session, and then within the with block, we can now run things. And then k.run does all the stuff we just saw, so if we go to k.run, here it is. So k.run does all of those steps. So this is how you can create a complete computation graph in TensorFlow using a notebook. You do each one, one step at a time. Once you've got it working, you put it all together. So you can see find initial centroids, dot eval, put it back into a variable again, assigned a nearest, update centroids. Because we created a variable in the process there, we then have to rerun global variable initializer. We could have avoided this, I guess, by not calling eval and just treating this as a variable the whole time, but it doesn't matter. This works fine. And then we just loop through a bunch of times calling centroids, dot assign, updated centroids. Oh, I think I see a bug. What we should be doing after that is then calling update centroids each time. There you go. I'll fix that during the week. And then the nice thing is because I've used a normal Python for loop here and I'm calling dot eval each time, it means I can check, oh, have any of the cluster centroids moved? And if they haven't, then I will stop working. So it makes it very easy to create dynamic for loops, which could be quite tricky sometimes with TensorFlow otherwise. Okay. So that is the TensorFlow algorithm from end to end. And Rachel, do you want to pick out an AMA question? So actually, I kind of am helping start a company. I don't know if you've seen my talk on Ted.com, but I kind of show this demo of this interactive labeling tool. A friend of mine said that he wanted to start a company to actually make that and commercialize it. So I guess my short answer is I'm helping somebody do that because I think that's pretty cool. More generally, I've mentioned before, I think that the best thing to do is always to scratch an edge. So pick whatever you've been passionate about or something that's just driven you crazy and fix it. If you have the benefit of being able to take enough time to do absolutely anything you want, I felt like the three most important areas for applying deep learning when I last looked, which is two or three years ago, were medicine, robotics, and satellite imagery. Because at that time, computer vision was the only area that was remotely mature really for machine learning, deep learning. And those three areas all were areas that very heavily used computer vision or could heavily use computer vision and were potentially very large markets. Medicine is probably the largest industry in the world. I think it's three trillion dollars in America alone. Robotics isn't currently that large, but at some point it probably will become the largest industry in the world if everything we do manually is replaced with automated approaches. And satellite imagery is massively used by military intelligence. We have some of the biggest budgets in the world. So yeah, I guess those three areas. And can you keep going? No. I found some higher voted questions. Okay, next time. All right. I'm going to take a break soon. Before I do, I might just introduce what we're going to be looking at next. So we're going to start on our NLP and specifically Translation Deep Dive. We're going to be really following on from the end-to-end memory networks from last week. One of the things that I find kind of most interesting and most challenging in setting up this course is coming up with good problem sets, which are like hard enough to be interesting and easy enough to be possible. And often other people have already done that. So I was lucky enough that somebody else had already shown an example of using sequence-to-sequence learning for what they called spelling bee. And basically we start with this thing called the CMU pronouncing dictionary, which has things that look like this. Zewiki followed by a phonetic description of how to read Zewiki. So the way these work, this is actually specifically an American pronunciation dictionary. The consonants are pretty straightforward. The vowel sounds have a number at the end showing how much stress is on each one. So zero, one or two. So in this case you can see that the middle one is where most of the stress is, so it's a wiki. So here is the letter A and it is pronounced ah. Okay, so the goal that we're going to be looking at after the break is to do the other direction, which is to start with how do you say it and turn it into how do you spell it. This is quite a difficult question because English is really weird to spell. And the number of phonemes doesn't necessarily match the number of letters. So this is going to be where we're going to start and then we're going to try and solve this puzzle and then we'll use a solution from this puzzle to try and learn to translate French into English using the same basic idea. So let's have a 10 minute break and we'll come back at 7.40. So just to clarify, this will make sure everybody understands the problem we're solving here. So the problem we're solving is we're going to be told here is how to pronounce something okay and then we have to say okay here is how to spell it right so this is going to be our input and this is going to be our target. So this is like a translation problem but it's a bit simpler. So we don't have pre-trained phoneme vectors or pre-trained letter vectors so we're going to have to do this by building a model and we're going to have to create some embeddings of our own. So in general the first steps necessary to create an NLP model tends to look very, very similar. I feel like I've done them in a thousand different ways now and at some point I really need to abstract this out into a simple set of functions that we use again and again and again. But let's go through it and if you've got any questions about any of the code or steps or anything let me know. So the basic pronunciation dictionary is just a text file and I'm going to just grab the lines which are actual words so they have to start with a letter. Now something which I have actually let's come back to that okay so we're going to go through every line in the text file and here's a handy thing that a lot of people don't realize you can do in Python when you call open that returns a generator which lists all of the lines so I feel if you just go for L in open blah that's now looping through every line in that file right so I can then say filter those which start with a which start with a lowercase letter right so sorry just start with the uppercase letter they're all uppercase and then strip off any white space and spit it on white space so that's the basically the steps necessary to separate out the word from the pronunciation and then the pronunciation is just white space delimited so we can then split that and that's the steps necessary to get the word and the pronunciation as a set of phonemes so as we tend to pretty much always do with these language models we next need to get a list of like what are all of the vocabulary items so in this case the vocabulary items are the all the possible phonemes so we can create a set of every possible phoneme and then we can sort it and what we always like to do is to get an extra an extra character or an extra object in position zero because remember we use zero for padding right so that's why I stick I'm going to use underscore as our special padding letter here so I'll stick an underscore at the front so here are the first five phonemes this is our special padding one just going to be index zero and then there's uh uh uh with three different little stress and so forth okay now the next thing that we tend to do anytime we've got a list of vocabulary items is to create a list in the opposite direction so we go from phoneme to index which is just a dictionary where we enumerate through all of our phonemes and put it in the opposite order so from phoneme to index I know we've used this approach a thousand times before but I just want to make sure everybody understands it when you use enumerate in python it doesn't just return each phoneme but it returns a tuple that contains the index of the phoneme and then the phoneme itself so that's the key in the value so then if we go value comma key that's now the phoneme followed by the index and so if we turn that into a dictionary we now have a dictionary which you can give it a phoneme and return it an index here's all the letters of English again with our special underscore at the front and we've got one extra thing we'll talk about later which is a asterisk so that's a list of letters and so again to go from letter to letter index we just create a dictionary which reverses it again okay so now that we've got our phoneme to index and letter to index we can use that to convert this data into numerical data right which is like what we always do with these language models we end up with just a list of indices we can pick some maximum length word so I'm going to say 15 and so we're going to create a dictionary which maps from each word to a list of phonemes and we're going to get the indexes for them yes Rachel okay so this dictionary comprehension is a little bit awkward so I thought this would be a good opportunity to talk about dictionary comprehensions and list comprehensions for a moment so we're going to pause this in a moment but first of all let's look at a couple of examples of list comprehensions so the first thing to note is when you go something like this a string xyz or this string here um python is perfectly happy to consider that a list of letters so the python considers this the same as being a list of x comma y comma z so you can think of this as two lists a list of xyz and a list of abc so here is the simplest possible list comprehension all right so go through every element of a and put that into a list so if I call that then it returns exactly what I started with okay so that's not very interesting what if now we replaced o with another list comprehension okay so what that's going to do is it's now going to return a list for each list okay so this is one way of pulling things out of sub lists is to basically take the thing that was here and replace it with a new list comprehension and that's going to give you a list of lists now the reason I wanted to talk about this is because it's quite confusing in python you can also write this which is different so in this case I'm going for each object in our a list and then for each object in that sub list and you see what's different here I don't have the like in square brackets right it's just all laid out next to each other so I find this really confusing but the idea is you're meant to think of this as just being like a normal for loop inside a for loop right and so what this does is it goes through um xyz and then abc and then in xyz it goes through each of x and y and z but because there's no embedded set of square brackets that actually ends up flattening the list right so we just saw I think we're about to see an example of the square bracket version and pretty soon we'll be seeing an example of this version as well these are both useful right it's very useful to be able to flatten a list it's very useful to be able to do things with sub lists and then just to be aware that any time you have any kind of expression like this you can replace the thing at the here with any expression you like right so we could say for example we could say o dot upper right so you can basically map different computations to each element of a list and then the second thing you can do is put an if here to filter it uh if o zero equals x um what did I do wrong there definitely sorry thank you great okay so that's basically that but the the idea you can create any list comprehension you like by putting computations here filters here and optionally multiple lists of lists here the other thing you can do is replace the square brackets with curly brackets in which case you need to put something before a colon and something after a colon the thing before is your key the thing after is your value so here we're going for oh and then there's another thing you can do which is if the thing you're looping through is a bunch of lists or tuples or anything like that you can pull them out into two pieces like so so this is the word and this is the list of phonemes so we're going to have the lowercase word will be our keys in our dictionary and the values will be lists so we're doing it just like we did down here and the list will be let's go through each phoneme and go phoneme to index okay so now we have something that maps from every word to its list of phoneme indexes all right so that's that we can find out what the maximum length of anything is in terms of how many phonemes there are and we can do that by again we can just go through every one of those dictionary items calling length on each one and then doing a max on that okay so there is the maximum length right so you can see like combining list comprehensions with other functions is also powerful all right so finally we're going to create our nice square arrays normally we do this with a keras.pad sequences just for a change we're going to do this manually this time so the key is that we start out by creating two arrays of zeros because all the padding is going to be zero right so if we start off with all zeros then we can just fill in the non-zeros so this is going to be our all of our phonemes this is going to be our actual spelling that's our target labels so then we go through all of our and we've permitted them randomly so randomly ordered things in the pronunciation dictionary and we put inside input all of the items from that pronunciation dictionary and into labels we go later to index all right so we now have one thing called input one thing called labels that contains nice rectangular arrays padded with zeros containing exactly what we want okay i'm not going to worry about this line yet because we're not going to use it for the starting point so anyway you see deck something just ignore that for now we'll get back to that later train test split is a very handy function from sklearn that takes all of these lists and splits them all in the same way with this proportion in the test set and so input becomes input train and input test labels becomes labels train and labels test so that's pretty handy well we've often written that manually um but this is a nice quick way to do it when you've got lots of lists to do um okay so just to have a look at how many phonemes we have in our vocabulary there are 70 how many letters in our vocabulary there's 28 that's because we got that underscore and the star as well okay so let's go ahead and create the model so here's the basic idea the model has three parts the first is at embedding right so the embedding is going to take every one of our phonemes okay max lan p is the maximum number of phonemes we have in any pronunciation and each one of those phonemes is going to go into an embedding right and the lookup for that embedding is the vocab size for phonemes which i think was 70 and then the output you know is whatever we decide what dimensionality we want and in experimentation i found 120 seems to work pretty well i was surprised by how high that number is um but there you go um it is we started out with a list of phonemes right a list of phonemes and then after we go through this embedding we now have a list of embeddings so this is like 70 and this is like 120 okay so the basic idea is to take this big thing which is all of our phonemes embedded and we want to turn it into a single distributed representation which contains all of the richness of what this pronunciation says later on we're going to be doing the same thing with an english sentence right and so we know that when you have a sequence and you want to turn it into a representation one great way of doing that is with an r and n now why an r and n um because an r and n we know is good at dealing with things like state and memory right so when we're looking at translation we really want something which can remember like where are we right so let's say we were well let's say we were doing this simple phonetic translation the idea of you know have we just had a c because if we've just had a c then the h is going to make a totally different sound to if we haven't just had a c right um so an r and n we think is a good way to do this kind of thing and in general this whole class of models remember is called seek to seek sequence to sequence models which is where we start with some arbitrary length sequence and we produce some arbitrary length sequence and so the general idea here is taking that arbitrary length sequence and turning it into a fixed size representation using an r and n is probably a good first step and you're using drop out in your lstm is it best practice to do drop out across time so um okay so looking ahead um i'm actually going to be using quite a few layers of r and n so to make that easier i've um i'm still in drawing mode um so to make that easier um we've created a get r and n function which just and so we can you can put anything you like here gru or lstm or whatever um and yes indeed i am using drop out um the kind of drop out that you use in our in an r and n is slightly different to normal drop out um it turns out that if uh the particular things you drop out it's best to have them the same things at every time step um in an r and n um there's a really good paper that explains why this is the case and shows that this is the case so this is why there's a special drop out parameter um inside the r and n and care us is because it does this proper um r and n style drop out um so yeah so we'll you know i put in a tiny bit of drop out here and if it turns out that we um uh um over fit we can always increase it um if we don't we could always turn it to zero um so so what we're going to do is um yes rachel oh one more question about that um can you explain consume less equals gpu so yeah always do that um basically um i don't know if you remember but when we looked at like doing our an instant scratch last year um we learned that you could like actually combine the matrices together and like do a single matrix computation um if you do that it's going to use that more memory but it allows the gpu to be more highly parallel um so basically um if you look at the care us documentation i'll tell you the different things you can use but um since we're using a gpu you probably always want to say consume less equals gpu okay the other thing that we learned about last year is bidirectional r n n's um and maybe the best way to come at this is actually to go all the way back and remind you how r n n's work um we haven't done much revision but it's been a while since we've looked at r n n's in much detail so just to remind you this is kind of our drawing of a totally basic neural net um square is input circle is um intermediate activation hidden and triangle is output and uh arrows represent um affine transformations with non-linearities we can then have multiple copies of those to create deeper um convolutions for example and so the other thing we can do is actually we can have inputs going in um at different places so in this case if we were trying to predict the third character from first two characters we can use a totally standard neural network um and actually have input coming in at two different places um and then we realized that we could kind of make this arbitrarily large um but what we should properly do then is make everything where an input is going to a hidden state be the same matrix so this color coding remember represents the same weight matrix and so hidden to hidden would be the same weight matrix and hidden to output and is a separate weight matrix so then to remind you we realized that we could draw that more simply like this okay so r n's um when they're unrolled just look like a normal neural network in which some of the weight matrices are tied together and if this is not ringing a bell go back to i think it's lesson five where we actually build these weight matrices from scratch and tie them together manually um um so that'll hopefully remind you of what's going on now importantly we can then take one of those r n's and have the output go to the input of another r n and these are stacked r n's and stacked r n's basically give us you know richer computations in our recurrent neural nets and this is what it looks like when we unroll it so you can see here that we've got multiple inputs coming in going through multiple layers and creating multiple outputs but of course we don't have to create multiple outputs um you could also isn't that working um you could also get rid of these two triangles here and have just one output and remembering keras the difference is whether or not we say return sequences equals true or return sequences equals false this one you're seeing here is return sequences equals true this one here is return sequences equals false so what we've got is input train has 97 000 words each one is of length 16 it's 15 characters long plus the um um plus the padding and then no sorry 16 16 phonemes long full stop possibly with padding if necessary and then labels is 15 because we chose earlier on that our max length would be a 15 long spelling so phonemes don't match two letters exactly so after the embedding so if we take one of those tens of thousands of words remember it was of length it was of length for phonemes length 16 right and then we're putting it into an embedding matrix which is 70 by 120 and the reason it's 70 is that each of these phonemes contains a number between 0 to 69 right so basically we go through and we get each one of these indexes and we look up to find it so this is five here then we find number five here right and so we end up with 16 by 120 and then part two of the question says are we then taking a sequence of these phonemes represented as 120 dimensional floating point vectors and using an r and n to create a sequence of word-to-vec embeddings which we will then reverse to actual words so we're not going to use word-to-vec here right word-to-vec is a particular set of pre-trained embeddings we're not using pre-trained embeddings we have to create our own embeddings we're creating phoneme embeddings so if somebody else would either on wanted to do something else with phonemes and we like saved the result of this we could provide phoneme-to-vec and you could download them and use the fast.ai pre-trained phoneme-to-vec embeddings so this is how embeddings basically get created right there's people build models starting with random embeddings and then save those embeddings and make them available for other people to use I may be misinterpreting it but I thought the question was getting at the second set of embeddings when you want to get back to your words right so let's wait until we get there because we're going to create letters not words and then we'll just join the letters together so there won't be any word-to-vec here so we've got as far as creating our embeddings and then we've then got an RNN which is going to take our embeddings and attempt to turn it into a single vector that's kind of what an RNN does so we've got here return sequences by default is true so this first RNN returns something which is just as long as we started with right and so if you want to stack RNNs on top of each other every one of them is return sequences equals true until the last one isn't right so that's why we have false here right so at the end of this one this gives us a single vector which is the final state the other important piece is bidirectional and bidirectional you can totally do this manually yourself you take your input and feed it into an RNN and then you reverse your input and feed it into a different RNN and then just concatenate the two together so Keras has something which does that for you which is called bidirectional and bidirectional actually requires you to pass it an RNN right so it takes an RNN and returns two copies of that RNN stacked on top of each other one of which reverses its input and so why is that interesting well that's interesting because often in language what happens later influences what comes before for example in French the gender of your or like your your definite article depends on the noun that it refers to so you need to be able to look backwards or forwards in both directions to figure out how to match those two together or in any language with tense you know what what verb do you use depends on the tense and often also depends on the details about the subject and the object so we want to be able to both look forwards and look backwards right so that's why we want two copies of the RNN one which goes from left to right and one which goes from right to left and indeed we could assume that when you spell things I'm not exactly sure how this would work but when you spell things depending on what the latest dresses might be or the later details of the phonetics might be might change how you pronounce things earlier on does the bi-directional RNN can cat to RNNs or does it stack them it you end up with the same you end up with the same number of dimensions that you had before but it basically doubles the number of features you have so in this case we have a 240 so it just doubles those and I think we had one question here so is it the recurrence levels right toward the recurrence level of RNN or LSTM would be the the input the length of the max length underscore right that's the 16 yeah okay and the number 70 here is it like all the possible characters it could yeah all the possible phonemes we're going from phoneme to character oh yeah okay all right thanks okay okay so let's simplify this down a little bit um maybe stop again and basically say we started out with a set of embeddings and we've gone through two layers uh well we've gone through a bi-directional RNN and then we feed that to a second RNN right to create a representation of this ordered list of phonemes right and specifically this this this is a vector okay so x at this point is a vector okay because return sequence is equals false that vector once we've trained this thing the idea is it represents everything that important there is to know about this ordered list of phonemes everything that we could possibly need to know in order to spell it right so the idea is we could now take that vector and feed it into a new RNN or even a few layers of RNN right and that RNN could basically go through and be with return sequences equals true this time it could spit out at every time step what it thinks the next letter in this spelling is right and so this is how a sequence to sequence works is once one part which is called the encoder takes our initial sequence and turns it into a distributed representation into a into a vector using generally speaking some stacked RNNs then the second piece called the decoder takes the output of the encoder and passes that into a separate stack of RNNs with return sequences equals true and those RNNs are taught to then generate the labels in this case the spellings or in our later case the English sentences now in Keras it's not convenient to create an RNN by handing it some initial state some initial hidden state that's not really how Keras likes to do things Keras expects to be handed a a list of inputs problem number one problem number two if you do hand it to an RNN just at the start it's quite hard for the RNN to remember the whole time what is this word I'm meant to be translating right it kind of has to keep two things in its head one is like what's the word I'm meant to be spelling and the second is like what's the letter I'm trying to spell right now so what we do with Keras is we actually take this whole state and we copy it so in this case we're trying to create a word that could be up to 15 letters long so in other words 15 time steps so we take this and we actually make 15 copies of it okay and those 15 copies of our final encoder state becomes the input to our decoder RNN so it seems kind of clunky right but it's actually not difficult to do in Keras we just go like this we take the output from our encoder and we repeat it 15 times right so we literally have 15 identical copies of the same vector and so that's how Keras expects to see things and it also turns out that it's actually you get better results when you pass into the RNN the state that it needs again and again at every time step so we're basically passing in saying something saying we're trying to spell this word we're trying to spell this word we're trying to spell this word we're trying to spell this word and then as the RNN goes along it's generating its own internal state figuring out like what have we spelt so far and what are we going to have to spell next with a question why can't we have return sequences equals true for the second bidirectional LSTM not bidirectional for the second LSTM we already have one bidirectional LSTM we don't want return sequences equals true here because we're trying to create a representation of the whole word we're trying to spell so there's no point having something saying here's representation of the first phoneme of the first two of the first three of the first four of the first five because we don't really know like exactly which letter of the output is going to correspond to which phoneme of the input and particularly when we get to translation it can get much harder like some languages totally reverse the subject in object order or put the verb somewhere else so that's why we try to package up the whole thing into a single piece of state which has all of the information necessary to build our target sequence so remember these sequence to sequence models are also used for things like image captioning right so with image captioning you wouldn't want to have like something that created a representation separately for every pixel you know when you're trying to capture an image you want a single representation which is like this is something that somehow contains all of the information about what this is a picture of or if you're doing neural language translation here's my English sentence I've turned it into a representation of everything that it means so that I can generate my French sentence we're going to be seeing later how we can use return sequences equals true when we look at attention models but for now we're just going to keep things simple the question is I have an underlying question regarding why don't we just treat text problems the same way we do images images have relationships between pixels and shapes that are complex and rely on positional information why doesn't that work with word or phoneme embeddings well it does it absolutely does and indeed we can use convolutional models but if you remember back to lesson five we talked about some of the challenges with that so if you're trying to create something which can pass you know some kind of markup block like this it has to both remember that it's oh you know you've just opened up a piece of markup you're in the middle of it then in here you have to remember that you're actually inside a comment block so that at the end you remember to close it this kind of long-term dependency and memory and stateful representation becomes increasingly difficult to do with cnn's as they get longer it's not impossible by any means but rnn's are one good way of of doing this all right but it is critical that we start with an embedding because where else an image we're already given you know float valued numbers that really represent the image that's not true with text right so with text we have to use embeddings to turn it into these nice numeric representations rnn is kind of generic term here right so specific network we use as lstm yes but there was other types we can use gru remember yeah yeah so simple rnn so kara supports like yeah we with all the ones we did in the last part of the course so we we looked in simple rnn gru and lstm all right so is that like lstm would be the best for that task no no um not at all um the gru and lstm's are pretty similar so it's not worth thinking about too much okay um all right so at this point here we now have 15 copies of of 15 copies of x right and so we now pass that into two more layers of rnn so this this here is our encoder and this here is our decoder now there's nothing we did particularly in kara's to say this is an encoder this is a decoder the important thing is the return sequence is equals false here and the repeat vector here right so like just what does it have to do well somehow it has to take this single summary and create some layers of rnn's until and then at the end we say okay here's a dense layer right and it's time distributed so remember that means that we actually have 15 dense layers um and so each of these dense layers now has a softmax activation um which means that we uh basically can then do an argmax on that to create our final list of letters so this is kind of our reverse embedding if you like so the model is very little code right and once we've built it um and again if if if things like this are mysterious to you go back and relook at lessons four five and six remind you how these embeddings work and how these kind of like time distributed dense works to give us effectively a kind of reverse embedding so that's our model starts with our phoneme input ends with our time distributed dense output um we can then compile that um our targets are just indexes remember we turn them into indexes so we use this handy sparse categorical cross entropy um it's just the same as our normal categorical cross entropy but rather than one hot encoding we just skip the whole one hot encoding and just leave it as an index and we can go ahead and fit passing in our training data so that was our um rectangular data of the phoneme indexes um our labels and then we can use some valid uh our test set data that we set aside as well so we fit that for a while um I found that the first three epochs the loss went down like this the second three epochs it went down like this it seemed to be flattening out so that's where as far as I stopped it so um we can now see how well that worked um now what I wanted to do was not just say what percentage of letters are correct um because that doesn't really give you the right sense at all right um what I really wanted to know is um what percentage of words are correct um so that's all this little eval keras um function does um it takes the thing that I'm trying to evaluate um calls dot predict on it it then does the arg max uh as per usual to take that softmax and turn it into a specific number which which character is this and then I want to check whether it's true for all of the characters that the real character equals the predicted character okay so this is going to return true only if every single item in the in the word is correct uh and so then taking the mean of that it's going to tell us what percentage of the words did it get totally right and unfortunately the answer is not very many 26 percent so let's look at some examples so we can go through 20 words at random um and we can print out um all of the phonemes with dashes between so here's an example of some phonemes um we can print out the actual word um and we can print out our prediction so here is a whole bunch of words that I don't really recognize uh perturbations should be spelled like that oh we spelled it like that slightly wrong so you can see some of the time the mistakes it makes are pretty clear so laro could be spelled like that but this seems perfectly reasonable um sometimes on the other hand it's way off um and interestingly um what I what you find is that most of the time when it's way off I found it tends to be with the longer words um and the reason for that is that the longer the word so like this one where it's terrible is by far the most phonemes I think one two three four five six seven eight nine ten eleven phonemes so if we had to somehow create a single representation that contained all of the information of all of those eleven phonemes in a single vector and that's hard to do right and then that single vector got passed I mean copied that passed to the decoder and that was everything it had to try to create this output okay so um that's the problem with this basic encoder decoder method and indeed here is a graph from uh from the model that made from the paper which originally um which originally introduced uh attentional models and this is for neural translation what it showed is that as the um as the sentence length got bigger the standard approach which we're talking about the standard encoder decoder approach the uh accuracy absolutely died so um what these researchers did was that they built a new kind of um RNN model called an attentional model and with the attentional model the accuracy actually stayed pretty good so goal number one the next couple of lessons is for me not to have a cold anymore um um yeah so basically um we're gonna finish our deep dive into neural translation and then we're gonna look at time series although we're not specifically looking at time series it turns out that the best way um that I found for time series is not specific to time series at all um but uh you'll see what I mean um reinforcement learning was something I was planning to cover um but I just haven't found almost any good examples of it actually being used in practice to solve important real problems um and indeed um when you look at the um have you guys seen the paper in the last week or two about using evolutionary strategies for reinforcement learning basically it turns out that using basically random search is better than reinforcement learning um that this paper by the way is like ridiculously overhyped um these evolutionary strategies is uh something that I was working on over 20 years ago and in those days um these are genetic algorithms as we called them used much more sophisticated methods than DeepMind's brand new evolutionary strategies um so people are like rediscovering these randomized meta heuristics um which is great but they're still far behind where they were 20 years ago um but far ahead of um reinforcement learning approaches so given I try to teach things which I think are actually going to stand the test of time I'm not at all convinced that any current technique for reinforcement learning is going to stand the test of time so I don't think we're going to touch that um part three yeah um I think before that we might have a part zero um where where we do practical machine learning for coders talk about decision tree ensembles and training test splits and stuff like that um um yeah and then yeah we'll see where we are um but uh yeah no I'm sure Rachel and I am not going to stop doing this in a hurry it's really fun and interesting and um and we're really interested in your ideas about how to how to keep this going like by the end of part two you know you guys have put in hundreds of hours you know maybe you know on average maybe 140 hours um put together your own box written blog posts done hackathons you know you're seriously in this now um and in fact I gotta say this week's kind of been special for me like this week's been the week where again and again I've spoken to various folks of you guys and heard about how like how many of you have like implemented projects at your workplace that have worked and are now running and making your business money or that you've you know achieved the career thing that you've been aiming for or that you've you know one yet another GPU at a hackathon like you know or of course the social impact thing where it's like all these transformative and inspirational things that you know it's it's gone from you know when Rachel and I started this we had no idea if it was possible to teach people you know with no specific required math background other than high school math you know deep learning to the point that you could use it to build cool things we thought we probably could because I don't have that background and I've been able to um but you know I've been kind of playing around with similar things for a couple of decades so it was a bit of an experiment and yeah this week's been the week that for me it's been clear that the experiments worked so I don't know what part three is going to look like it's I think it'll be a bit different because it's like it'll be more of a meeting of minds amongst a group of people who are kind of at the same level and thinking about the same kinds of things and so maybe it's more of a yeah more of a um ongoing um keep our knowledge up to date kind of thing it might be more of us teaching each other I'm not sure I'm certainly be interested to hear ideas okay um we don't normally have two breaks but I think I need one today and they're recovering a lot of territory so why don't we have a short break and um well oh hang on say 34 uh yeah let's have a short break and we're before the last 20 minutes let's come back at um 840 okay thank you so attention models attention models so um I actually I really like these I think they're great um and um really the paper that introduced these it was quite extraordinary paper introduced both GRUs and attention models at the same time I think it might even be before the guy had his phd if I remember correctly um it was just a wonderful paper um very successful um and the basic idea of an attention model um is it's actually pretty simple um you'll see here uh let's see where this is okay here is here's our decoder right okay and here's our embedding okay and notice here remember that um for my get r and n return sequences uh equals true is the default so the decoder now is actually spitting out sequence of states now the length of that sequence is equal to the number of phonemes okay and we know that there isn't a mapping one to one of phonemes to letters right so this is kind of interesting to even think about how we're going to deal with this right how are we going to deal with um 16 states and the states because they started with bi-directional state one both represents a combination of everything that's come before the first phoneme and everything that's come after and state two is everything that's come before the second phoneme and everything that's come after and so forth right so the states in a sense they're all representing something very similar but they've got a different focus you know each one of these states and each one I remember the length is 16 right so each one of these 16 states um represents uh everything that comes before and everything that comes after that point but with a focus on on that phoneme okay so what we want to do now is create a an RNN where um the number of inputs to the RNN needs to be 15 not 16 because remember the length of the word we're creating is 15 right so we're going to have 15 output time steps and at each point we wanted to have the opportunity to look at all of the 16 output states but we're going to go in with the assumption that only some of those 16 are going to be relevant um but we don't know which right so what we want to do is basically create a um so take a basically take each of these 16 states and do a weighted sum right sum of weights times um encoded states right where these weights somehow represent how important is each one of those 16 inputs for calculating this output and how important are each of those 16 inputs for calculating this output and so forth right if we could somehow come up with a set of weights for every single one of those time steps then we can replace the length 16 thing with a single thing right and if it turns out that um output number one only really depends on input number one and nothing else then basically that input those weights are going to be one zero zero zero zero zero zero right like it can it can learn to do that but if it turns out that um if it turns out that um output number one actually depends on phonemes one and two equally that it can learn the weights zero point five zero point five zero zero zero zero zero so in other words we want some function wi equals some function that returns the right set of weights to tell us which bit of the decoded input to look at so it so happens we actually know a good way of learning functions what if we made the function a neural net and what if we learned it using sgd why not so here's the paper neural machine translation by jointly a learning to align and translate and it's a great paper um it's not the clearest in my opinion in terms of understandability um but let me describe some of the main pieces okay so here's the starting point okay let's describe how to look at read this equation when you see a probability like this you can very often think of it as a loss function right the idea of um of sgd basically most of the time when we're using it is to come up with a model where the probabilities that the model creates are as high as possible for the true data and as low as possible for the other data that's like a just another way of talking about a loss function right so very often when you read the papers where we would write a loss function a paper will say a probability and what this here says earlier on they say that y is basically our um our outputs very common for y to be an output and what this is saying is that the probability of the output at time step i right so at some particular time step depends on so this bar here means depends on um all of the previous outputs right so in other words um in our spelling um thing uh when we're looking at the fourth letter that we're spelling it depends on the three letters that we spelt so far um you can't have it depend on the later letters that's cheating right so this is basically description of the problem as that we're building something which is time dependent and where the i thing that we're creating can only be allowed to depend on the previous i minus one things comma that basically means and and it's also allowed to depend on okay anything in bold is a vector right a vector of inputs and so this here is our list of phonemes right and this here is our list of all of the letters we've spelt so far okay so um so that whole thing right that whole probability uh we're going to calculate using some some function right and because this is a neural net paper you can be pretty sure it's going to turn out to be a neural net um and what are the things that we're allowed to calculate with well we're allowed to calculate with the previous letter that we just translated what's this the r and n hidden state that we've built up so far and what's this a context vector what is the context vector the context vector is a weighted sum of annotations h so these are the hidden states that come out of our encoder and these are some weights right so i'm trying to give you a lot of information to try and parse this paper over the week right so that's everything i've described so far the weights and now the nice thing is that hopefully you guys have now read enough papers that you can look at something like this and skip over it and go oh that's just softmax like over time your pattern recognition starts getting good right like you start seeing something like this and you go oh that's a weighted sum and you see something like this you go oh that's softmax like people who read papers don't actually read every symbol their eye looks at it and goes about softmax weighted sum logistic function okay got it right as if it was like pieces of code only this is like really annoying code that you can't look up in a dictionary and you can't run and you can't check it and you can't debug it but apart from that it's just like code okay so all right so the alphas are things that came out of a softmax all right what goes into the softmax all right something called e the other annoying thing about math notation is often you introduce something and define it later all right so here we are later we define e what's e equal to e is equal to some function of what some function of the previous hidden state and the encoder state okay and what's that function or that function is again a neural network now the important piece here is jointly trained jointly trained means it's not like a again where we train a bit of discriminator and a bit of generator it's not like one of these kind of manual attentional models where we first of all figure out the nodules are here and then we zoom into them and find them there jointly trained means we create a single model and a single computation graph if you like where the gradients are going to flow through everything so we have to try and come up with a way basically where we're going to build a standard regular RNN right but the RNN is going to use as the input at each time step this right so we're going to have to come up with a way of actually of actually making this mini neural net this is just a single one hidden layer standard neural net it's going to be inside every time step in our RNN so this whole thing is summarized in another paper this is actually a really cool paper grammar as a foreign language lots of names you probably recognize here Jeffrey Hinton who's kind of father of deep learning earlier who's now I think like chief scientist or something at director of science at book and II or your venues done lots of cool stuff this paper is kind of neat and fun anyway it basically says what if you didn't know anything about grammar and you attempted to build a neural net which assigned grammar to sentences and it turns out you actually end up with something more accurate than any rule-based grammar system that's being built one of the nice things they do is to summarize all the bits and again this is where like if you were reading a paper the first time and didn't know what an LSTM was and went oh an LSTM is all these things that's not going to mean anything to you right you have to recognize that people write stuff in papers I mean there's no point writing LSTM equations in papers right but it's basically you're going to have to go and find the LSTM paper or find a tutorial like learn about LSTMs when you're finished come back in the same way they summarize attention okay so they say we've used or adapted the attention model from two if you go and you have a look at two okay that's the paper we just looked at all right but the nice thing is that because this came a little later they've done a pretty good job of trying to summarize it into a single page so during the week if you want to try and get the hang of attention you might find it good to have a look at this paper and look at their summary and you'll see that basically the idea is that it's a standard sequence-to-sequence model so a standard sequence-to-sequence model means encoder hidden states the final hidden state decoder right plus adding attention okay and so we have two separate LSTMs an encoder and a decoder and now be careful of the notation encoder states are going to be called h the decoder states h1 through hta the decoder states are d which we're also going to call hta plus 1 to ta plus tb so the inputs are 1 through ta and here you can see is defining a single layer neural net okay so we've got our decoder states we've got our current encoder state put it through a non-linearity put it through another affine transformation stick it through a softmax and use that to create a weighted sum okay so there's like all of it in one little snapshot okay so again like don't expect this to make perfect sense the first time you see it necessarily but hopefully you can kind of see that you know these bits there's all stuff you've seen lots of times before so next week we're going to come back and work through you know creating this code and seeing seeing how it works did you have something Rachel we have two questions one is want the weight speed the weightings be heavily impacted by the padding done to the input set sure absolutely and specifically those weights will say oh the padding is always weighted zero it's not going to take it very long to learn to create that pattern and is a shared among all ij pairs or do we train a separate alignment for each pair no a is not trained a is the output of us a is the output of a softmax what's trained is w1 and w2 and note the capital letters of matrices right so we just have to learn a single w1 and a single w2 but note that they're being applied to um all of the input sorry all of the encoded states and the current state of the decoder right and you know in fact easier is to just abstract out this all the way back to say it is some function right like this is the best way to think about it it's some function of what some function of the current hidden state and all of the decoder states right so that's that's the inputs to the function and we just have to learn a set of weights that will spit out the inputs to our softmax did you say you had another question okay great so i don't feel like i want to introduce something new so let's take one final am a before we go home advice on imbalanced data sets oh okay um unbalanced data sets yeah um there's not really that much clever you can do about it you know basically um if you've got um well a great example would be um um one of the impact talks talked about breast cancer detection from um mammography scans and this thing called the dream challenge had uh what was it like less than one three percent zero point three percent of the scans actually had cancer um so that's very unbalanced um i think the first thing you try to do with such an unbalanced data set is ignore it and try it try it and see how it goes the reason that often it doesn't go well is that the initial gradients will tend to point to say they never have cancer you know because that's going to give you a very accurate model so one thing you can try and do is to come up with some kind of initial model which is like you know maybe some kind of heuristic which is not terrible and gets it to the point where the gradients are you know don't always point to saying they they never have cancer um but the really obvious thing to do is to um adjust your thing which is creating the mini batches so that on every mini batch um you grab like half of it as being people with cancer and half of the people being without cancer um so um that way you know you can still go through lots and lots of epochs um it's a bit of a the challenge is that you're going to the people that do have cancer you're going to see lots and lots and lots of times so you have to be very careful of over fitting um um and then basically there's kind of things between those two extremes so i think what you really need to do is figure out what's the smallest number of people with cancer that you can get away with you know what's the smallest number where the gradients don't point to zero um and then create a model where let's say it's 10 percent so create a model where every batch mini batch you create 10 percent of it with people with cancer and 90 percent people without um train that for a while um the good news is once it's working pretty well um you can then decrease the size that has the has cancer size because you're already at a point where your model's not kind of pointing off to zero right so you can kind of gradually start to change the sample to have less and less um i think that's the basic technique um so in this example where you're you're repeating the the positive results um over and over in your you're essentially just waiting more yeah could you get the same results by just throwing away a bunch of the false yeah no i i you could do that and like that's the really quick way to do it but that way you're not using like the information about the false stuff still has information so yeah okay thanks everybody have a good week