 So, yeah, this week, obviously, quite a big, kind of, quite a bit just to get set up to get results from this week in terms of needing all of ImageNet and that kind of thing and getting all that working. So I know that a lot of you are still working through that. I did want to mention a couple of reminders, just that I've noticed. One is that in general, we have that thing on the wiki about how to use the notebooks, and we really strongly advise that you don't open up the notebook we give you when click shift enter or through it again and again. You're not really going to learn much from that. But go back to that wiki page, it's like the first thing that's mentioned in the first paragraph of the home page of wiki is like how to use the notebooks. And basically the idea is try to start with a fresh notebook. Think about what you think you need to do first, try and do that thing. If you have no idea, then you can go to the existing notebook, take a peek, close it again, try and re-implement what you just saw. As much as possible, you know, really not just shift enter through the notebooks. And I know if some of you are doing it because there are threats on the forum saying, oh, it's shift entering through the notebook and this thing didn't work. And somebody's like, well, that's because that thing's not defined yet. So consider yourself busted. All right. The other thing to remind you about is that the goal of part two is to kind of get you to a point where you can read papers. And the reason for that is because you kind of know the best practices now. So anytime you want to do something beyond what we've learned, you're going to be kind of implementing things from papers or probably going beyond that and implementing new things. Reading a new paper in an area that you haven't looked at before is, at least to me, somewhat terrifying. On the other hand, reading the paper for the thing that we already studied last week hopefully isn't terrifying at all because you already know what the paper says. So I always have that in the assignments each week is like, read the paper for the thing you just learned about and like go back over it and please ask on the forums if there's a bit of notation or anything that you don't understand or if there's something we heard in class that you can't see in the paper or particularly interesting. You see something in the paper that you don't think we mentioned in class. So that's the reason that I really encourage you to read the papers that we studied for the topics we studied in class. I think it's for those of you like me who don't have a technical academic background, it's really great way to familiarise yourself with the notation. And I'm actually really looking forward for some of you asking about notation on the forums so I can explain some of it to you. There's a few key things that kind of keep coming up in notation like probability distributions and stuff like that. So please feel free and if you're watching this later in the MOOC, again, feel free to ask on the forum anything that's not clear. I was kind of interested in following up on some of last week's experiments myself and the thing that kind of I think we all were a bit shocked about was putting this guy into the device model and getting out more pictures of similar looking fish in nets. And I was kind of curious about how that was working and how well that was working. And I then completely broke things by training it for a few more epochs. And after doing that, I then did an image similarity search again. And I got these three guys who were no longer in nets. So I'm not quite sure what's going on here. And the other thing I mentioned is when I trained it, where my starting point was what we looked at in class, which was just before the final bottleneck layer, I didn't get very good results from this thing. But when I trained it from the starting point of just after the bottleneck layer, I got the good results that you saw. And again, I don't know why that is and I don't think this has been studied as far as I'm aware. So there's lots of other questions here. But I'll show you something I did then do was I thought, well, that's interesting. I think what has happened here is that it's focused, when you train it for longer, it knows that the important thing is the fish and not the net. And it seems to be now focusing on giving us the same kind of fish. Like these are clearly the exact same type of fish, I guess so anyway. So I started wondering, how could we force it to combine? So I tried the most obvious possible thing, I wanted to get more fish in nets. And I typed word2vec.dict-tench, that's a kind of fish, plus word2vec.dict-net divided by two, get the average of the two word vectors, and give me the nearest neighbor, and that's what I got. And then just to prove it wasn't a fluke, I tried the same on tench plus rod, and there's my nearest neighbor. Now, do you know what's really freaky about this? If you Google for ImageNet Categories, you'll get a list of the thousand ImageNet Categories. If you search through them, neither net nor rod appear at all. Like I can't begin to imagine why this works, but it does. So this device model is clearly doing some pretty deep magic in terms of the understanding of these objects and their relationships. But not only are we able to combine things like this, but we're able to combine it with categories that it's literally never seen before. It's never seen a rod, we've never told it what a rod looks like and did over a net. And I tried quite a few of these combinations and they just kept working. Like another one I tried was, I understand why this works, which is I tried searching for boat. Now, boat doesn't appear in ImageNet, but there's lots of kinds of boats that appear in ImageNet. So not surprisingly, it figures out generally speaking how to find boats. I didn't, I didn't, I expected that. And then I tried boat plus engine, and I got that pictures of power boats. And then I tried boat plus paddle, and I got that pictures of rowing boats. So there's a whole lot going on here. And I think there's lots of opportunities for you to explore and experiment based on the explorations and experiments that I've done. And more to the point, perhaps to create some interesting and valuable tools. Like, I would have thought a tool to do a kind of an image search to say, show me all the images that contain these kinds of objects. Or better still, maybe you could start training with things that are just nouns, but also adjectives. So you could start to search for pictures of crying babies or blaming houses or whatever. I mean, I think there's all kinds of stuff you could do with this, which would be really interesting, whether it be in your, in a narrow kind of organizational setting or to create some new startup or new open source project or whatever. So anyway, lots of things to try. More stuff this week. I actually missed this, this wasn't this week, but I was thrilled to see that one of our students has written this fantastic medium post, linear algebra cheat sheet. I think I missed it because it was posted not to the part two forum, but maybe to the main forum. This is really cool. Brendan has gone through and really explained, I think, all the stuff that I would have wanted to have known about linear algebra before I got started. And particularly, I really appreciate that he's kind of taking a code first approach. So like, how do you actually do this in NumPy and talking about broadcasting? So you guys will all be very familiar with this already, but you're friends who are wondering how to get started in deep learning. What's the minimal things you need to know? It's probably the chain rule and some linear algebra. I think this covers a lot of any linear algebra pretty effectively. So thank you, Brendan. Other things from last week. Andrea Frohm, who wrote that device paper, I actually emailed her and asked her what she thought I, what else I should look at. And she suggested this paper, Zero Shot Learning by a convex combination of semantic embeddings, which is only a later author on, but she says it's kind of, in some ways, a more powerful version of device. It's actually quite different and I haven't implemented it myself. But it's solved some similar problems and anybody who's interested in exploring this multimodal images and text space might be interested in this. And we'll put this on the lesson wiki, of course. And then one more involving the same author in a similar area a little bit later was looking at attention for fine-grained categorization. So a lot of these things, at least the way I think Andrea Frohm was casting it was about fine-grained categorization, which is how do we build something that can find very specific kinds of birds or specific kinds of dogs. But I think these kinds of models have very, very wide applicability. Okay, so I mentioned we'd kind of wrap up some final topics around kind of computer vision-y stuff this week before we started looking at some more NLP related stuff. One of the things I wanted to zip through was a paper which I think some of you might enjoy systematic evaluation of CNN advances on the ImageNet dataset. And I've pulled out what I thought was some of the key insights. But some of these are things we haven't really looked at before. One key insight, which is very much the kind of thing I appreciate, is that they compared what's the difference between the kind of original cafe net slash Alex net versus Google net versus VGG net on two different sized images training on the original 227 or 128. And what this chart shows is that the relative difference between these different architectures is almost exactly the same regardless of what size image you're looking at. And this really reminds me of like in part one when we looked at data augmentation and we said, hey, you can understand which types of data augmentation to use and how much on a small sample of the data rather than on the whole data set. And what this paper is saying is something similar, which is you can look at different architectures on small sized images rather than on full sized images. And so they then use this insight to do all of their experiments of then on using a smaller 128 by 128 ImageNet model, which they said was 10 times faster. So I thought that was the kind of thing which not enough academic papers do, which is like what are the hacky shortcuts we can get away with. So they tried lots of different activation functions. It does look like max pooling is way better. So this is the gain compared to value. But this one actually has twice the complexity. So it doesn't quite say that. What it really says is that something we haven't looked at, which is LU, which is, as you can see, it's very simple. If x is greater than or equal to zero, it's y equals x. Otherwise, it's e to the x minus one. So LU basically is just like RELU, except it's smooth. Whereas RELU looks like that. LU looks exactly the same here. But then here it goes like that. So it's kind of a nice smooth version. So that's one thing you might want to try using. Another thing they tried, which is interesting, was using ELU for the convolutional layers and max out for the fully connected layers. I guess nowadays we don't use fully connected layers very much. So maybe that's not as interesting. The main interesting thing here, I think, is the ELU activation function. Two percentage points is quite a big difference. They looked at different learning rate annealing approaches. And you can use Keras to automatically do learning rate annealing. And what they showed is that linear annealing seems to work best. They tried something else, which was like, what about different color transformations? And they found that amongst the normal approaches to thinking about color, RGB actually seems to work the best. But then they tried something I haven't seen before, which is they added two one by one convolutions at the very start of the network. So each of those one by one convolutions is basically doing a linear, some kind of linear combination of the channels. And with a nonlinearity then in between. And they found that that actually gave them quite a big improvement. And that should be pretty much zero cost. So there's another thing which I haven't really seen written about elsewhere, but that's a good trick. They looked at the impact of batch norm. So here is the impact of batch norm, positive or negative. Actually, adding batch norm to GoogleNet didn't help. It actually made it worse. It seems he's like really complex, carefully tuned architectures. You've got to be pretty careful. We're also on a simple network. It helps a lot. And the amount of helps also depends on somewhat which activation function you use. So batch norm, I think we kind of know that now. Be careful when you use it. Sometimes it's fantastically helpful. Sometimes it's slightly unhelpful. Question, is there any advantage in using fully connected layers for class? Yeah, I mean, I think there is. Like they're terribly out of fashion. But I think for transfer learning, they still seem to be the best in terms of the kind of, you know, the fully connected layers are super fast train. And you seem to get a lot of flexibility there. So I don't think we know one way or another yet, but I do think that VGG still has a lot to give us in terms of, you know, the last carefully tuned thing with fully connected layers. And that really seems to be great for transfer learning. And then there was a comment saying that L use advantage is not just the smooth, but that it goes a little below zero. Peter mentions that this action is being unused. Yeah, yeah, that's a great point. Thank you for adding that. Yeah, and anytime you hear me say something slightly stupid, please feel free to jump in, because otherwise it's on the video forever. So on the other hand, it does give you an improvement in accuracy if you remove the final max pooling layer, replace all the fully convolutional layers, sorry, fully connected layers with convolutional layers and stick a average pooling at the end, which is basically what this is doing. So it does seem there's definitely an upside to fully convolutional networks in terms of accuracy, but there may be a downside in terms of flexibility around transfer learning that's a little unclear still. I thought this was an interesting picture I hadn't quite seen before. Let me explain the picture. What this shows is, these are different batch sizes along the bottom and then we've got accuracy. And what it's showing is with a learning rate of 0.01, this is what happens to accuracy. So as you go above 256 batch size, it plummets. On the other hand, if you use a learning rate of 0.01 times batch size over 256, it's pretty flat. So what this suggests to me is that anytime you change the batch size, this basically is telling you to change the learning rate by a proportional amount, which I think a lot of us have realized through experiment, but I don't think I've seen it explicitly mentioned before. I think this is very helpful to understand as well is that removing data has a kind of a nonlinear effect on accuracy. So here's this green line here is what happens when you remove images. So with ImageNet, down to about half the size of ImageNet, there isn't a huge impact on accuracy. Well, maybe if you want to really speed things up, you can go 128 by 128, sized images and use just 600,000 of them, or even maybe 400,000. But then beneath that, it starts to plummet. So I think that's an interesting insight. Another interesting insight, although I'm gonna add something to this in a moment, is that rather than removing images, if you instead flip the labels to make them incorrect, that has a worse effect than not having the data at all. But there are things we can do to try and improve things there. And specifically, I wanna bring your attention to this paper, Training Deep Neural Networks on Noisy Labels with Bootstrapping. And what they show is a very simple approach, a very simple tweak you can add to any training method, which dramatically improves their ability to handle noisy labels. So this here is showing the, if you add noise, varying from 0.3 up to 0.5 to MNIST, so up to half of it. The baselines are doing nothing at all. It really collapses the accuracy. But if you use their approach to bootstrapping, you can actually go up to nearly half the images being intentionally changing their label and it still works nearly as well. I think this is a really important paper to mention in this like stuff that most of you will find important and useful area because most real world data sets have noise in them. So maybe this is something you should consider adding to everything that you've trained, whether it be Kaggle data sets or your own data sets or whatever. Particularly because you don't necessarily know how noisy the labels are. So noisy labels means incorrect? Yeah, noisy just means incorrect, yeah. So bootstrapping is some sort of technique? Yeah, this is this particular paper that grabs a particular technique, which you can read during the week if you're interested. So interestingly, they find that if you take VGG and then add all of these things together and do them all at once, you can actually get a pretty big performance hike. It looks in fact like VGG becomes more accurate than Google now if you make all these changes. So that's an interesting point. Although VGG is very, very slow. And big. There's lots of stuff that I noticed they didn't look at. They didn't look at data augmentation, different approaches to zooming and cropping, adding skip connections like in ResNet or GensNet or Ionetworks, different initialization methods, different amounts of depth. And to me, the most important is the impact on transfer learning. So these to me are all open questions as far as I know. And so maybe one of you would like to create the successor to this, the more observations on training CNNs. There's another interesting paper, although the main interesting thing about this paper is this particular picture. So feel free to check it out. It's pretty short and simple. But this paper is looking at the accuracy versus the size and the speed of different networks. So the size of a bubble is how big is the network? How many parameters does it have? So you can see VGG 16 and VGG 19 are by far the biggest of any of these networks. Interestingly, the second biggest are the very old basic AlexNet. So interestingly newer networks tend to have a lot less parameters, which is a good sign. Then on this axis, we have basically how long does it take to train? So again, VGG is big and slow. And without at least some tweaks, not terribly accurate. So again, there's definitely reasons not to use VGG, even if it seems easier for transfer learning or we don't necessarily know how to do a great job of transfer learning on ResNet or Inception. But as you can see, the more recent ResNet and Inception based approaches significantly more accurate and faster and smaller. So this is I think why I was looking last week at trying to do transfer learning on top of ResNet. There's really good reasons to want to do that. So I think this is a great picture. You know, these two papers really show us that academic papers are not always just here's some highly theoretical, wacky result, you know, from time to time, people write these great thorough, you know, analyses of best practices and everything that's going on. So there's some really great stuff out there. One other paper to mention in this kind of broad ideas about things that you might find helpful as a paper by somebody named Leslie Smith, who I think it's got to be just about the most, you know, overlooked researcher. Leslie Smith does a lot of really great papers, which I really like. This particular paper came up with a list of 14 design patterns. Which seem to be generally associated with better CNNs. And this is a great paper to read. It's a really easy read. You guys wouldn't have any trouble with it at all. I don't think that's very short. But I really, I looked through all these and I just kind of thought, yeah, these all make a lot of sense. And so as you, if you're doing something a bit different and a bit new and you have to design a new architecture, this would be a great list of patterns to look through. One more Leslie Smith paper to mention and it's crazy that this is not more well known. Something incredibly simple, which is a different approach to learning rates. Rather than just having your learning rate gradually decrease, I'm sure a lot of you have noticed that sometimes if you kind of suddenly increase the learning rate for a bit and then suddenly decrease it again for a bit, it kind of goes into a better little area. So what this paper suggests doing is try actually continually increasing your learning rate and then decreasing it, increasing it, decreasing it, increasing it, decreasing it. Something that they call cyclical learning rates. And check out the impact. Compared to non cyclical approaches, it is way way faster and much, and it gets at every point, it's much better. And this is something which you could easily add. And I haven't seen this added to any library. So like if you created the cyclical learning rate annealing class for Keras, many people would thank you. Well, actually many people would have no idea what you're talking about. So you don't have to write the blog post, explain why it's good and show them this picture and they would thank you. I just sort of quickly add to Keras because lots of callbacks that I actually believe are something. Yeah, exactly. There's a great loop with a bunch of callbacks on the start. And if I was doing this in Keras, what I would do would be I would start with the existing learning rate annealing code that's there and makes more changes until it starts working. This is already code that does pretty much everything you want. The other cool thing about this paper is that they suggest a fairly automated approach to picking what the minimum and maximum bounds should be. And again, this idea of like roughly what should our learning rate be is something which we tend to use a lot of trial and error for. So check out this paper for a suggestion about how to do it somewhat automatically. Okay, so there's a whole bunch of things that I've zipped over. Normally I would have dug into each of those and explained it and shown examples in notebooks and stuff, but this is like a, you know, you guys hopefully now have enough knowledge to take this information and play with it. And what I'm hoping is that different people will play with different parts and come back and tell us what you find. And, you know, hopefully we'll get some good new contributions to Keras or PyTorch or some blog posts or some papers or so forth. Or maybe with that device stuff, even some new applications. So the next thing I wanted to look at, and again, somewhat briefly, is the data science poll. And the reason I particularly wanted to dig into the data science poll is there's a couple of reasons. One of them, there's a million reasons. It's a million dollar prize. And there are 23 days to go. The second is it's an extension to everything that you guys have learned so far about computer vision. It uses all the techniques you've learned, but then some. So rather than 2D images, they're gonna be 3D volumes. Rather than being kind of 300 by 300 or 500 by 500, they're gonna be 512 by 512 by 200. So like a couple of hundred times bigger than stuff you've dealt with before. The stuff we learned in lesson seven about like where are the fish? You're gonna be needing to use a lot of that. So I think it's a really interesting problem to solve. And then I personally care a lot about this because my previous startup in Lydec was the first organization to use deep learning to tackle this exact problem, which is trying to find lung cancer in CT scans. The reason I made that in Lydec's first problem was mainly because I learned that if you can find lung cancer earlier, the probability of survival is 10 times higher. So here is something where you can have a real impact by doing this well, which is not to say that a million dollars isn't a big impact as well. So let me tell you a little bit about this problem. Here is a lung. And this is in a DICOM viewer. DICOM, D-I-C-O-M is the standard that is used for sharing most kinds of medical imaging, certainly CT scans. It is a format which contains two main things. One is a stack of images and another is some metadata. So that metadata will be things like how much radiation were there zapped by and how far away from the chest with the machine and what brand of machine was it and so on and so forth. So you can, for most DICOM viewers, just use your scroll wheel to zip through them. And so all this is doing is like going from top to bottom or from bottom to top. And so kind of see what's going on. You're saying right, John? I was just saying, do you want to orient? Yeah, so you can. Well, actually what I might do, I think is more interesting is to say, well, let's actually focus on the bit that's going to matter to you, which is the inside of the lung is this dark area here. And these little white dots, what's called the vasculature. So all the little vessels and stuff going through the lungs. And as I scroll through, please have a look at this little dot. You'll see that it seems to move. All right, see how it's moving? And the reason it's moving is because it's not a dot. It's actually a vessel going through space. So it actually looks like this, right? And so if you take a slice through that, it looks like lots of dots, right? And so as you go through those slices, it looks like that, right? And then eventually we get to the top of the lung and that's why you see eventually kind of all goes to white, right? So that's the edge basically of the organ. So you can see there are edges on each side. And then there's also bone. So some of you have been looking at this already over the last few weeks and have often asked me about how to deal with like multiple images. And what I've said each time is don't think of it as multiple images. Think of it in the way your DICOM viewer can if you have a 3D button like this one does. That's actually what we're just looking at, right? So it's not a bunch of flat images. It's a 3D volume, right? It just so happens that the default way that most DICOM viewers show things is by a bunch of flat images, right? But it's really important that you think of it as a 3D volume because you're looking in this space, right? Now, what are you looking for in this space? Well, what you're looking for in order is you're looking for somebody who has lung cancer. And what somebody has lung cancer looks like is that somewhere in this space, there is a blob, right? Like it could be roughly spherical blob. It could be pretty small, like around five mil, five millimeters is where people start to get particularly concerned about a blob. And so what that means is that for a radiologist is they flex through a scan like this is that they're looking for a dot which doesn't move but which appears, gets bigger and then disappears. That's what a blob looks like, right? So you can see why radiologists very, very, very often miss nodules, blobs, in lungs. Because in all of this area, you've got to have extraordinary vision to be able to see every little blob appear and then disappear again. And remember, the sooner you catch it, you get a 10x improved chance of survival. And generally speaking, when a radiologist looks at one of these scans, they're not looking for nodules. They're looking for something else because lung cancer, at least in the earlier stages, is asymptomatic, doesn't cause you to feel different, right? So it's like something that every radiologist has to be thinking about when they're looking for a pneumonia or whatever else. So that's the basic idea is that we're gonna try and come up with in the next hour or so, some idea about how would you find these blobs? How would you find these nodules? So each of these things generally is about 512 by 512 by a couple of hundred, right? And each of those, and the equivalent of a pixel in 3D space is called a voxel, right? So a voxel simply means a pixel in 3D space. So this here is rendering a bunch of voxels, okay? Each voxel in a CT scan is a 12-bit integer. Memory serves me correctly. And a computer screen can only show eight bits of grayscale. And furthermore, your eyes can't necessarily distinguish between all those grayscale perfectly anyway. So what every DICOM viewer provides is something called a windowing adjustment. So a windowing adjustment here is the default window, which is designed to basically map some subset of that 12-bit space to the screen so that it highlights certain things. And so the units, CT scans you use are called Houndsfield units, and different certain ranges of Houndsfield units tell you that something is some particular part of the body. And so you can see here that the bone is being lit up. So we've selected an image window which is designed to allow us to see the bone clearly. So what I did when I opened this was I switched it to CT's chest, which is some kind of person has already figured out what the best window is. Pardon that change. Oh, sorry, CT Lungs has figured out what's the best window to see the nodules and vasculature in Lungs. Now, for you working in with deep learning, you don't have to care about that because of course the deep learning algorithm can see 12 bits perfectly well. So nothing really to worry about. So one of the challenges with dealing with this data science bold data is that there's a lot of pre-processing to do. But the good news is that there's a couple of fantastic tutorials. So hopefully you've found out by now that I'm Kaggle. If you click on the kernels button, you basically get to see people's iPython notebooks where they show you how to do certain things. For this case, this guy has got a full pre-processing tutorial showing how to load dichon, convert the values to ounce field units and so forth. I'll show you some of these pieces. So dichon, you will load with some library, probably with PyDicom. And so PyDicom is a library that kind of, it's a bit like PILO or PILO, you know, an image.open is more like a dichon.open and end up with a 3D file and of course the metadata. You can see here, using the metadata, image position, lice location. Okay, so the metadata comes through with just attributes of the Python object. And this person is very kindly provided to you a list of the ounce field unit, I was just copying from Wikipedia, ounce field units for each of the different substances. Okay, so he shows how to translate stuff into that range. And so it's great to draw lots of pictures. So here is a histogram for this particular picture. So you can see that most of it is air. And then you get some bone and some lung. There's the actual slice. There's the actual slice. So then the next thing to think about is the, it's really voxel spacing, which is as you move across one bit of x-axis or one bit of y-axis or from slice to slice, how far in like the real world are you moving? And one of the annoying things about medical imaging is that different kinds of scanners have different distances between those slices. It's called the slice thickness and the different meanings of the x and y-axis. Luckily that stuff's all in the diplom metadata. So the resampling process means taking those lists of slices and turning it into something where every step in the x-direction or the y-direction or the z-direction equals one millimeter in the real world. And so it'd be very annoying for your deep learning network if your different lung images were squished by different amounts, especially if you didn't give it the metadata about how much was being squished. So that's what resampling does. As you can see, it's using the slice thickness and the pixel spacing to make everything nice and even. So there are various ways to do 3D plots and it's always a good idea to do that. And then something else that people tend to do is segmentation. And depending on time, we may or may not get around to looking more at segmentation in this part of the course. But effectively segmentation is just another generative model. It's a generative model where hopefully somebody's given you some things saying this is lung, this is air, and then you build a model that tries to predict for something else what's lung and what's air. Unfortunately for lung CT scans, we don't generally have that kind of ground truth of which bit's lung and which bit's air. So generally speaking medical imaging, people use a whole lot of heuristic approaches, so kind of hacky rule-based approaches. And in particular, applications of region-growing and morphological operations. I find this kind of the boring part of medical imaging because it's so clearly a dumb way to do things. But deep learning is far too new in this area yet to develop the data sets that we need to do this properly. But the good news is that there's a button which I don't think many people notice exists called tutorial on the main data science fall page where these folks from Brusell and Hamilton actually show you a complete segmentation approach. Now, it's interesting that they pick unit segmentation. This is definitely the thing about segmentation I would be teaching you guys if we have time. Unit is one of these things that outside of the cavalry world I don't think that many people are familiar with. But inside the cavalry world, we know that any time segmentation crops up, unit wins, and it's the best. More recently, there's actually been something called dense net for segmentation, which takes unit even a little bit further, and maybe that would be the new winner for newer cavalry competitions when they happen. But the basic idea here of things like unit and dense net is that we have a model where maybe I can draw it. When we do generative models, I think about doing style transfer. We generally start with this kind of large image and then we do some down sampling operations to make it a smaller image. And then we do some computation and then we make it bigger again with these up sampling operations. What happens in unit is that there are additional neural network connections made directly from here to here and directly from here to here and here to here and here to here. And those connections basically allow it to almost do like a kind of residual learning approach, like it can figure out the key pieces, kind of semantic pieces of a really low resolution. But then as it upscales it, it can learn, well, what was special about the difference between the down sample image and the original image here? And it can kind of learn to add that additional detail at each point. So, unit and dense net or segmentation. Really interesting. And I hope we find some time to get back to them in this part of the course. But if we don't, you can get started by looking at this tutorial in which these folks basically show you from scratch. And what they try to do in this tutorial is something very specific, which is the detection part. So what happens in this kind of, like think about the fisheries competition. We pretty much decided that in the fisheries competition, if you wanted to do really well, you would first of all find the fish and then you would zoom into the fish and then you would figure out what kind of fish it is. Certainly in the right whale competition earlier, that was how that was planned. But this competition, this is even more clearly going to be the approach because these images are just far too big to do a normal convolutional neural network. So we need one step that's going to find the nodule and then a second step that's going to zoom into a possible nodule and figure out is this a malignant tumor or something else, a false positive. The bad news is that the data science bold data set does not give you any information at all about, for the training set, where are the cancerous nodules? Which I actually wrote a post in the Kaggle forums about this. I just think this is a terrible, terrible idea because that information actually exists. The data set they got this from is something called the National Lung Screening Trial, which actually has that information or something pretty close to it. So the fact they didn't provide it is, I just think it's horrible. You know, for a competition which can save lives and I can't begin to imagine. The good news though is that there is a data set which does have this information. The original data set was called LIDC Idry, I-D-R-I. But interestingly, that data set was recently used for another competition, a non-capable competition called LUNA, L-U-N-A. That competition is now finished and one of the tracks in that competition was actually specifically a false positive detection track. So, and then the other track was a find the nodule track basically. So you can actually go back and look at the papers written by the winners. They're generally ridiculously short. Many of them are a single sentence saying for commercial confidentiality agreement we can't do anything, but some of them including the winner of the false positive track is actually provided. Not surprisingly, they all use deep learning. And so what you could do, in fact, I think what you have to do to do well in this competition is download the LUNA data set. Use that to build a nodule detection algorithm. So the LUNA data set includes files saying this lung has nodules here, here and here. So do nodule detection based on that and then run that nodule detection algorithm on the Kaggle data set. Find the nodules and then use that to do some classification. There are some tricky things with that. The biggest tricky thing is that most of the CT scans in the LUNA data set are what's called contrast studies. A contrast scan means that the patient had a radioactive dye injected into them so that the things that they're looking for are easier to see. For the national lung screening trial, which is what they use in the Kaggle data set, none of them use contrast. And the reason why is that what we really want to be able to do is to take anybody who's over 65 and has been smoking more than a pack a day for more than 20 years and give them all a CT scan and find out which ones have cancer. But in the process, we don't want to be shooting them up with radioactive dye and giving them cancer. So that's why we try to make sure that when we're doing these kind of asymptomatic scans that they're as low radiation dose as possible. So that means that you're going to have to think about transfer learning issues, that the contrast in your image is going to be different between the thing you build on the LUNA data set and nodule protection and the Kaggle competition data set. But, you know, it's a... When I looked at it, I didn't find that that was a terribly difficult problem. I'm sure you won't find it impossible by any means. So to finalize this discussion, I wanted to refer to this paper, which I'm guessing not that many people have read yet. It's a medical imaging paper. And what it is, is a non-deep learning approach to trying to find nodules. So that's where they use, you know, they're saying here... Oh, sorry. Nodule segmentation. Yes, Rachel? You have a correction from our radiologist. I'm saying that dye is not radioactive. It's just dense. Isovue 70 or Isovue... Okay, but there's a reason we don't inject people with a contrast dye. Are the issues our contrast in nephropath or jacquery? Yeah, that's what I meant. I do know, though, that the NLST... Again, radiologists correct me. The NLST studies use a lower amount of radioactivity than I think the loner ones do. So it's another difference. Okay. So this is an interesting idea of, like, how can you find nodules using more of a heuristic approach? And the heuristic approach they suggest here is to do clustering. And we haven't really done any clustering in class yet, so we're going to dig into this in some detail. Because I think this is a great idea for the kind of heuristics that you can add on top of deep learning to make deep learning work in different areas. The basic idea here is to, as you can say, they call it a five-dimensional mean. They're going to try and find groups of voxels which are similar, and they're going to cluster them together. And hopefully we've got to particularly cluster together things that look like nodules. So the idea is at the end of this segmentation, there will be one cluster for the whole lung boundary, one cluster for the whole vasculature, and then one cluster for every nodule. So the five dimensions are x, y, and z. It's straight forward. Intensity, so the number of houndsfield units. And then the fifth one is volumetric shape index. And this is the one tricky one. The basic idea here is it's going to be a combination of the different curvatures of a voxel based on the Gaussian and mean curvatures. Now what the paper goes on to explain is that you can use for these the first and second derivatives of the image. What it basically means is you subtract one voxel from its neighbor. And then you take that whole thing and subtract one voxel's version of that from its neighbor. You get the first and second derivatives. So it kind of tells you the direction of the change of image intensity at that point. So by getting these first and second derivatives of the image and then you put it into this formula, it comes out with something which basically tells you how sphere like this voxel seems to be part of how sphere like a construct. So that's great, right? If we can basically take all the voxels and combine the ones that are nearby have a similar number of houndsfield units and seem to be of similar kinds of shapes, we're going to get what we want. So I'm not going to worry about this bit here because it's very specific to medical imaging. Anybody who's interested in doing this, feel free to talk on the forum about what this looks like in Python. But what I did want to talk about was the mean shift clustering, which is a particular approach to clustering which they talk about. I just received a comment that the voice from the other mic breaks up. It's hard to understand another one. Okay. Yeah. Okay, Rachel, can you say something again? Clustering is something which, for a long time I've been kind of an anti-fan of. It belongs to this group of unsupervised learning algorithms which always seem to be kind of looking for a problem to solve. But I've realised recently there are some specific problems that can be solved well with them. And I'm going to be showing you a couple, one today and one in lesson 14. Clustering algorithms are perhaps the best, easiest to describe by what they do by generating some data to show them. Here's some generated data. I'm going to create six clusters. And for each cluster I'll create 250 samples. So I'm going to basically say, okay, let's create a bunch of centroids by creating some random numbers. So six pairs of random numbers for my centroids. And then I'll grab a bunch of random numbers around each of those centroids and combine them all together and then plot them. And so here you can see each of these X's represents a centroid. Okay, so a centroid is just like the average point or a cluster of data. And each color represents one cluster. So imagine if this was showing you clusterings of different kinds of lung tissue, ideally you'd have some voxels that were coloured one thing for nodule and a bunch of other things that are coloured, a different colour for vasculature and so forth. We can only show this easily in two dimensions, but there's no reason to not be able to imagine doing this in certainly five dimensions. So the goal of clustering will be to undo this. Given the data but not the X's, how can you figure out where the X's work? So basically, and then it's pretty straightforward once you know where the X's are to then find the closest points to that and assign every data point to a cluster. The most popular approach to clustering is called pain means. Pain means is an approach where you have to decide upfront how many clusters are there and what it basically does is there's two steps. The first one is to guess as to where those clusters might be and the really simple way to do that is just to I wonder if I've got pictures of this. The simple way to do that is basically to randomly pick a point and then start randomly picking points which are as far away as possible from all the previous ones I've picked. Let me throw away the first one. So if I started here, the furthest away point would be down here. This would be like our starting point for cluster one. What point is the furthest away from that? That's probably this one here. What's the furthest point away from both of these? Probably this one over here and so forth. So you keep doing that to get your initial points and then you just iteratively move every point. Let's assume these are the clusters which cluster does every point belong to and then you just iteratively move the points to different clusters a bunch of times. Now, K means it's a shame it's so popular because it kind of sucks, right? Sucky thing number one is that you have to decide how many clusters there are and not how many nodules there are, right? And then sucky thing number two is without some changes to do something called kernel K means it only works if the things are the same shape. They're all kind of nicely Gaussian shaped. So we're going to talk about something waycaller which I only kind of came across somewhat recently much less well known which is called mean shift clustering. Now, mean shift clustering is one of these things which seems to spend all of its time in kind of serious mathematician land. Like whenever I try to kind of look up something about mean shift clustering I kind of started seeing this kind of thing and this is like the first tutorial not in the PDF that I could find. So, okay, so this is one way to think about mean shift clustering. Another way is a code first approach which is that this is the entire algorithm. So let's talk about what's going on here. What are we doing? At a high level we're going to do a bunch of loops, right? So we're going to do five steps. It would be better if I didn't do five steps but I kept doing this until it was stable but for now I'm just going to do five steps. And each step I'm going to go through, so our data is X, right? I'm going to go through the numerator through our data. And what I'm going to do is I'm going to find, okay, so a small X is the current data point I'm looking at. Now what I want to do is find out how far away is this data point from every other data point. So I'm going to create a vector of distances and I'm going to do that with the magic of broadcasting, right? So a small X is a vector of size 2. This is two coordinates. And big X is a matrix of size n by 2 where n is the number of points. And thanks to what we've now learned about broadcasting we know that we could subtract a matrix from a vector and that vector will be broadcast across the axis of the matrix. And so this is going to subtract every element of big X from little X. And so if we then go ahead and square that and then sum it up and then take the square root this is going to return a vector of distances of small X to every element of big X, okay? And the sum here is just summing up the two coordinates. Okay, so that's step one. So we now know for this particular data point how far away is it from all of the other data points. Now the next thing we want to do is to... Well, let's go to the final step. The final step will be to take a weighted average, right? So we're basically going to... In the final step we're going to say what cluster do you belong to? So let's draw this. Okay, so we've got a whole bunch of data points and we're currently looking at this one. Okay? And what we've done is we've now got a list of how far it is away from all of the other data points. And the basic idea is now what we want to do is take the weighted average of all of those data points weighted by the inverse of that distance. So the things that are a long way away we want to weight very small and the things that are very close we want to weight very big, right? So I think this is probably the closest, okay? And this is about the second closest and this is about the third closest. So assuming these got most of the weight the average is going to be somewhere about here, right? And so by doing that at every point we're going to move every point closer to where its friends are, closer to where the nearby things are. And so if we keep doing this again and again everything's going to move until it's right next to its friends. So how do we take something which initially is a distance and make it so that the larger distances have smaller weights? And the answer is we probably want to shape that looks something like that. In other words, Gaussian. This is by no means the only shape you could choose, right? It would be equally valid to choose this shape which is a triangle, at least half of one. In general though, note that if we're going to multiply every point by one of these things and add them all together it would be nice if all of our weights added to one, right? Because then we're going to end up with something that's of the same scale that we start with. So when you create one of these curves where it all adds up to one generally speaking we call that a kernel. And I mention this because you will see kernels everywhere. If you haven't already, now that you've seen it you'll see them everywhere. In fact, kernel methods is a whole area of machine learning that in the late 90s basically took over because it was so theoretically pure and if you want to get published in conference proceedings it's much more important to be theoretically pure than actually accurate. So for a long time kernel methods went out and neural networks in particular disappeared. Eventually people realized that accuracy was important as well and more recent times kernel methods are largely disappearing. But you still see the idea of a kernel coming up very often because they're super useful tools to have. They're basically something that lets you take a number like in this case a distance and turn it into some other number where you can weight everything by that other number and add them together to get a nice little weighted average. So in our case we're going to use a Gaussian kernel. The particular formula for a Gaussian doesn't matter. I remember learning this formula in grade 10 and it was by far the most terrifying mathematical formula I've ever seen. But it doesn't really matter, right? For those of you that remember or have seen the Gaussian formula you'll recognize it. For those of you that haven't it doesn't matter. But this is the function that draws that curve. So if we take every one of our distances and put it through the Gaussian we will then get back a bunch of weights that add to one. So then in the final step we can multiply every one of our data points by that weight. Add them up and divide by the sum of the weights. In other words, a weighted average. You'll notice that I had to be a bit careful about broadcasting here because I needed to add a unit axis at the end of my dimensions, not at the start. So by default it adds unit axes to the beginning when you do broadcasting. That's why I had to do an expand-ims. If you're not clear on why this is then that's a sign you definitely need to do some more playing around with broadcasting. So have a fiddle with that during the week. Feel free to ask if you're not clear after you've experimented. But this is just a weighted sum. So this is just doing sum of weights times x divided by sum of weights. Importantly there's a nice little thing we can pass to a Gaussian which is the thing that decides does it look like the thing I just drew or does it look like this or does it look like this? All of those things add up to one. They all have the same area underneath but they're very different shapes. If we make it look like this then what that's going to do is it's going to create a lot more clusters because things that are really close to it are going to have really high weights and everything else is going to have a tiny weight which is meaningless. So if we use something like this we're going to have much fewer clusters because even stuff that's further away it's going to have a higher weight in the weighted sum. So the choice that you use for the, this is called the kernel width. That's got lots of different names you can use. We actually have here used bw being bandwidth. That number, there's actually some cool ways to choose it. The simple way to choose it is to find out which size of bandwidth covers, say, one third of the data in your dataset. I think that's the approach that Scikit-learn uses. So anyway, there are some ways that you can automatically figure out the bandwidth and just one of the very nice things about mean shift. Okay, so we just go through a bunch of times, five times and each time we replace every point with its weighted average weighted by this Gaussian kernel. And so when we run this five times it takes a second and here's the results. I've offset everything by one just so that we can see it. Otherwise it would be right on top of the X. So you can see that for nearly all of them it's in exactly the right spot. Whereas for this cluster, let's just remind ourselves of that cluster. For these two clusters this particular bandwidth it decided to create one cluster for them rather than two. Okay, so this is kind of an example whereas if we decreased our bandwidth it would create two clusters. There's no one right answer. That should be one or two. So one challenge with this is that it's kind of slow. So I thought let's try and accelerate it for the GPU. And because mean shift is not very cool nobody seems to have implemented it for the GPU yet or maybe it's just not a good idea. So I thought I'd use PyTorch. And the reason I use PyTorch is because it really feels like writing PyTorch just feels like writing NumPy. Everything happens straight away. So I hoped that I could take my original code and make it almost the same. And indeed here is the entirety of mean shift in PyTorch. So that's pretty cool. You can see rather than anywhere that I used to have Np it now says torch. Np.array is now torch.flow tensor. Np.square root is torch.square root. Everything else is almost the same. One issue is that torch doesn't support broadcasting. So we'll talk more about this shortly in a couple of weeks but basically I decided that's not okay so I wrote my own broadcasting library for PyTorch. So this rather than saying x little x minus big x I used sub for subtract. That's the subtract from my broadcasting library. If you're curious, check out torch.utils and you can see my broadcasting operations there. But basically if you use those you can see the same for multiplication. It'll do all the broadcasting for you. Okay, so as you can see this looks basically identical to the previous code but it takes longer. So that's not ideal. One problem here is that I'm not using CUDA. Okay, so I could easily fix that by adding dot CUDA to my x but that made it slower still. And the reason why is that all the work's being done in this for loop and PyTorch doesn't accelerate for loops. Like each run through a for loop in PyTorch is basically calling a new CUDA kernel each time you're going through and it takes a certain amount of time to even launch a CUDA kernel. This is rather annoying, sorry. When I'm saying CUDA kernel this is a different usage of the word kernel. In CUDA kernel refers to a little piece of code that runs on the GPU. So it's launching a little GPU process every time through the for loop. It takes quite a bit of time and it's also having to copy data all over the place. So although it gets the right result that's good. So what I then tried to do was to make it faster. The trick is to do it by minibatch. So each time through the loop we don't want to do just one piece of data but a minibatch of data. So here are the changes I made. The main one was that my 4i now jumps through one batch size at a time. So I'm going to go not 0123 but 01632. So I now need to create a slice which is from i to i plus batch size unless we've gone past the end of the data in which case it's just as far as hand. So this is going to refer to the slice of data that we're interested in. So what we can now do is say x with that slice to grab back all of the data in this minibatch. And so then I had to create a special version of, I can't just say subtract anymore I need to think carefully about the broadcasting operations here. I'm going to return a matrix. Let's say batch size is 32. I'm going to have 32 rows and then let's say n is a thousand it'll be a thousand columns and that shows me how far away each thing in my batch is from every piece of data. So when we do things a batch at a time you're basically adding another axis to all of your tensors. Suddenly now you have a batch axis all the time. And when we've been doing deep learning that's been like something I think we've got pretty used to, right? The first axis in all of our tensors has always been a batch axis. So now we're writing our own GPU accelerated algorithm. Can you believe how crazy this is? Like two years ago writing a GP like if you Google for like K-means Puda or K-means GPU you get back like research studies where people write papers about how to put these algorithms in GPUs because like it was hard. And here's a page of code that doesn't. So this is crazy. This is possible. But here we are. We have built a batch by batch GPU accelerated main shift algorithm. So the basic distance formula is exactly the same. I just had to be careful about where I added unsqueezers the same as expand-ims in NumPy. So I just had to be careful about where I add my unit axes. Add it to the first axis of one bit and the second axis of the other bit. That's going to do like a subtract every one of these from every one of these. Return of matrix. Again, this is like a really good time to look at this and like think why does this broadcasting work? Because this is getting more and more complex broadcasting. And hopefully you can now see the value of broadcasting, right? Not only did I get to avoid writing a pair of nested for loops here, but I also got to do this all on the GPU in a single operation. So I've made this thousands of times faster. So here is a single operation which does that entire matrix subtraction. Yes, Rachel. I was just going to suggest that we take a break soon. Yeah. So that's our batch-wise distance function. We then chuck that into a Gaussian. And because this is just element-wise the Gaussian function hasn't changed at all. That's nice. And then I've got my weighted sum and then divide that by the sum of weights. So that's basically the algorithm. So previously for my NumPy version it took a second. Now it's 48 milliseconds. So we've just fed that up by 20 seconds. Yes, Rachel. We have a question. I get how batching helps with locality and cache, but I do not quite follow how it helps otherwise, especially with respect to accelerating the for loop. Yeah. So in PyTorch the for loop is not run on the GPU. The for loop is run on your CPU, and your CPU goes through each step of the for loop and calls the GPU to say do this thing, do this thing, do this thing. Right? So this is not to say you can't accelerate this intensive flow in a similar way. Like intensive flow, there's a tf.wile and stuff like that where you can actually do GPU-based loops. Even still, if you do it entirely in a loop in Python, it's going to be pretty difficult to get this performance, but particularly in PyTorch it's important to remember in PyTorch your loops are not optimized. It's what you do inside each loop that's optimized. And we have another question. Some of the math functions are coming from Torch and others are coming from the math Python library. What is the difference when you use the Python math library? Does that mean the GPU is not being used? Yes. So you'll see that I use that math.py is a constant and then math.square root of 2 times pi is a constant. So I don't need to use the GPU to calculate a constant, obviously. So we only use Torch for things that are running on a vector or a matrix or a tensor of data. Okay. So let's have a break. We'll come back in 10 minutes. That would be two past eight and we'll talk about some ideas I have for improving mean shift which maybe you guys will want to try during the week. Okay. So basically the idea here is we figure that there are two steps we need to figure out where the nodules are in something like this if any. Step number one is to find the things that may be kind of nodule-ish and zoom into them and create a little cropped version. And then step two would be where your deep learning particularly comes in which is to figure out is that cancerous or not. The once you've found a nodule-ish thing the cancerous ones actually are by far the biggest driver of whether or not something is a malignant cancer is how big it is. So it's actually pretty straightforward. The other thing particularly important is how kind of spidery it looks. If it looks like it's kind of evenly going out to capture more territory that's probably a bad sign as well. So the size shape of the two things that you're going to be wanting to try and find and obviously that's a pretty good thing for a neural net to be able to do you probably don't need that many examples of it. When you get to that point there was obviously a question about how to deal with the 3D aspect here you can just create a 3D convolutional neural net so if you had like a 10 by 10 space that's obviously certainly not going to be too big and it's 20 by 20 by 20 you might be okay and kind of think about how big a volume can you create and there's plenty of papers around on 3D convolutions although I'm not sure if you even need one because it's just a convolution in 3D. The other approach that you might find interesting to think about is something called triplanar. What triplanar means is that you take a slice through the X and the Y and the Z axes and so you basically end up with three images one is the slice through X, Y and Z and then you can kind of treat those as different channels if you like even they probably use pretty standard neural net libraries that expect three channels. So there's a couple of ideas for how you can deal with the 3D aspect of it. I think using this the lunar dataset as much as possible is going to be a good idea because you really want something that's pretty good at detecting nodules before you start putting it on to the Kaggle dataset because the other problem with the Kaggle dataset is it's ridiculously small and again there's no reason for it like there are far more cases in NLST than they've provided to Kaggle so I can't begin to imagine why they went to all this trouble and a million dollars of money or something which has not been set up to succeed. That's not our problem, it makes it all a more interesting thing to play with but you know after the competition's finished if you get interested in it you'll probably want to go and download the whole NLST dataset or as much as possible and do it properly. Actually there are two questions that I wanted to read one is for the audio stream there are occasional max volume pops that are really hard on the ears for remote listeners this might not be solvable right now but something to look into. Okay. And then someone asked last class you mentioned that you would explain when and why to use Keras versus PyTorch if you only had Brighton space for one in the same way some only have Brighton space for VI or Emax which would you pick? Okay. So I just reduced the volume a little bit so let us know if that helps I would I would pick PyTorch like it feels like it kind of does everything Keras does but gives you the flexibility to really play around a lot more yeah I'm sure you've got Brighton space for both So question you mentioned there are other datasets of cancerous images that has like labels and proper marks so can we like create this thing on that That was my suggestion and that's what the tutorial shows how to do yeah we have a little thing Kernel on Kaggle called Candidate Generation and Lunar 16 something something which shows how to use Lunar to build a nodule finder and this is one of the highest rated Kaggle kernels we've now used Kernel in three totally different ways in this lesson if we can come up with a fourth Kaggle kernels, CUDA kernels Kernel methods yeah so this is in fact looks very familiar doesn't it so here's a Keras approach to finding lung nodules based on Leno so Is there such a thing like VGG and 3D like we can make convolutions instead of rectangles like cubes we're saying about 3D CNNs just now yeah absolutely why not I mentioned an opportunity to improve this mean shift algorithm and the opportunity for improvement when you think about it's pretty obvious the actual amount of data is huge right you've got data points all over the place the ones that are a long way away is going to be so close to zero that we may as well just ignore them so the question is how do we quickly find the ones which are a long way away and we know the answer to that it's approximate nearest neighbors so what if we added an extra step here which rather than using x to get a distance to every data point instead of using approximate nearest neighbors to grab the closest ones the ones that actually are going to matter so that would basically turn this linear time piece into a logarithmic time piece which would be pretty fantastic so we learned very briefly about a particular approach which is locality sensitive hashing I think I mentioned also there's another approach which I'm really fond of called spill trees I really really want us as a team to take this algorithm and add approximate nearest neighbors to it and release it to the community as you know the first ever super fast GPU accelerated approximate nearest neighbor accelerated in chip clustering I think that would be a really big deal so if anybody's interested in doing that I believe you're going to have to implement something like LSH or spill trees in PyTorch and once you've done that it should be totally trivial to add the step that then uses that so if you do that then if you're interested I would invite you to team up with me and that we would then release this piece of software together and author a paper or a post together so that's my hope is that one of you or the group of you will make that happen that would be super exciting because I think this would be great you know like we'd be showing people something pretty cool about the idea of writing GPU algorithms today in fact I found just during the break here's a whole paper about how to write k-means with CUDA it used to be so much work this is without even including any kind of use or whatever so I think this would be great so hopefully that will happen okay and look at gives the right answer I guess to do it properly we should also be replacing the Gaussian kernel bandwidth with something that we figure out dynamically rather than have it hard coded alright so pick change we're going to learn about chatbots so we're going to start here with Slate Facebook thinks it has found the secret to making bots less dumb okay so this talks about a new thing called memory networks which was demonstrated by Facebook you can feed it sentences that convey key plot points and then ask it various questions published in your paper or an archive that generalizes the approach there was another long article about this on popular science in which they described it as early progress towards a truly intelligent AI down in the corner is excited about working a memory network giving the ability to retain information you can help the network a story and have it answer questions and so it even has this little yes okay so in the article they've got this little example showing reading a story of Lord of the Rings and then asking various questions about Lord of the Rings and it all looks pretty impressive so we're going to implement this paper and the paper is called end to end memory networks the paper was actually not shown on Lord of the Rings but was actually shown on something called Babby, I don't know, Babby or Baby I'm not quite sure which one it is it's a paper describing a synthetic data set towards AI complete question answering a set of prerequisites that toy tasks I saw a cute tweet last week which was kind of explaining the meaning of various different types of titles of papers and basically saying towards means we actually made no progress whatsoever so we'll take this with a great assault so these introduced the Babby tasks and the Babby tasks are probably best described by showing an example here's an example so each task is basically a story a story contains a list of sentences a sentence contains a list of words at the end of the story is a query to which there is an answer so the sentences are ordered in time so where is Daniel we'd have to go backwards this says where John is this is where Daniel is Daniel going to the bathroom okay so Daniel is in the bathroom so this is what the Babby tasks look like there's a number of kind of different structures called a one supporting fact structure which is to say you only have to go back and find one sentence in the story to figure out the answer we're also going to look at two supporting fact stories which is ones where you're going to have to look twice so the reading in these data sets is not remotely interesting they're just a text file we're going to look at some out there's various different text files for the various different tasks and if you're interested in the various different tasks you can check out the paper we're going to be looking at single supporting fact and two supporting fact they have some with 10,000 examples and some with 1,000 examples and the goal is to be able to solve every one of their challenges with just a thousand examples of that goal but it makes some movement towards it so basically we're going to put that into a bunch of different lists of a list of stories along with their queries and we can start off by having a look at some statistics about them so the first is for each story what's the maximum number of sentences in a story and the answer is 10 so Lord of the Rings it ain't and in fact if you go back and you look at the GIF when it says read story Lord of the Rings, that's the whole Lord of the Rings photo journey to Mount Doom photo drop the ring there the total number of different words in this thing is 32 the maximum length of any sentence in a story is 8 the maximum number of words in any query is 4 so okay we're immediately thinking what the hell because this was presented by the press as being the secret to making bots less dumb and showed us that they took a story and summarized Lord of the Rings with major plot points and asked various questions and clearly that's not entirely true like what they did like if you look at even the stories the first word is always somebody's name the second word here is always some synonym for move there's then a bunch of prepositions and then the last word is always place so like these Tory tasks are very very very Tory so immediately we're kind of thinking maybe this is not a step to making bots less dumb or whatever they said here a truly intelligent AI maybe it's towards a truly intelligent AI okay so to get this into Keras we need to turn it into a a tensor in which everything is the same size so we use pad sequences for that like we did in the last part of the course which will add zeros to make sure that everything is the same size so the other thing we'll do is we will create a dictionary from words to integers to turn every word into an index so we're going to turn every word into an index and then pad them so that they're all the same length and then that's going to give us inputs train 10,000 stories each one of 10 sentences each one of 8 words anything that's not 10 sentences long is going to get sentences of just zeros any sentences not 8 words long will get some zeros appended to that and did over the test except we just got a thousand okay so how do we do this not surprising we're going to use embeddings now we've never done this before we have to turn a sentence into an embedding not just a word into an embedding so there's lots of interesting ways of turning a sentence into an embedding but when you're just doing towards intelligent AI you instead just add the embeddings up and that's what happened in this paper and if you look at the way it was set up you can see why you can just add the embeddings up Mary, John and Sandra they only ever appear in one place they're always the object of this the verb is always the same thing the prepositions are always meaningless and the last word is always a place so to figure out what a whole sentence says you can just add up the work concepts the order of them doesn't make any difference there's no knots there's nothing that makes language remotely complicated or interesting so what we're going to do is we're going to create an input for our stories with a number of sentences and the length of each one we're going to take each word and put it through an embedding so that's what time distributed is doing here right it's putting each word through a separate embedding and then we do a lambda layer to up okay so here is our very sophisticated approach to creating sentence embeddings so we do that for our story so we end up with something which rather than being 10 by 8 i.e. 10 sentences by 8 words it's now 10 by 20 that is 10 sentences by a length 20 embedding right so each one of our 10 sentences has been turned into a length 20 and we're just starting with a random embedding we're not going to use word to beck or anything because there's nothing we don't need the complexity of that vocabulary model we're going to do exactly the same thing for the query okay we don't need to use time distributed this time we can just take the query because this time we have just one query so we can do the embedding sum it up and then we use reshape to add a unit access to the front so that it's now the same basic rank right we now have one question of embedded to length 20 so we have 10 sentences in the story and one query okay so what is the memory network or more specifically the more advanced end-to-end memory network and the answer is it is this okay as per usual when you get down to it it's less than a page of code to do these things let's draw this before we look at the code okay so we have a bunch of sentences let's just use four sentences for now so each sentence contained a bunch of words okay and we took each word and we turn them into an embedding okay and then we summed all of those embeddings up to get an embedding for that sentence right so each sentence was turned into an embedding and they were length 20 what it's worth um okay and then we talk about the query so this is my query and kind of idea a bunch of words which we got embeddings for and we added them up to get an embedding for our question okay so to do a memory network what we're going to do is we're going to take each of these um embeddings and we're going to combine it each one with a question or a query and we're just going to take a dot product okay so that's the first way to draw this dot product dot product okay so we're going to end up with four dot products from each sentence of the story times the query the dot product do it basically says how similar two things are when one thing's big and the other thing's big if one thing's small and the other thing's small those things both make the dot product bigger so these basically are going to be four vectors describing how similar each of our four sentences um to the query okay so that's step one um step two is to stick them through a softmax before we come again so remember the dot product is to the scalar right so we now have four scalars and they add up to one okay um and they each are um basically related to how similar is the query to each of the four sentences okay we're now going to create a totally separate embedding of each of the sentences in our story by creating a totally separate embedding for each word so we're basically just going to create a new random embedding matrix for each word to start with um sum them all together and that's going to give us a new embedding this one they call C I believe and all we're going to do is we're going to multiply each one of these some of them each one of these embeddings we're going to multiply by the um equivalent softmax as a weighting and then just add them all together right so we're just going to have uh these are called S one two three four going to be C one times S one plus D two times S two and then divided by S one actually you don't need to divide by because they add to one well that's it right so that's going to be um our final um result which is going to be of length 20 right so this thing is a vector of length 20 um and then we're going to take that and we're going to put it through a single dense layer and we're going to get back the answer and that whole thing is the memory network um it's um incredibly simple you know there's no there's nothing deep in terms of deep learning there's almost no non-moniarities um so you know it doesn't seem like it's likely to be able to do very much um but I guess we haven't given it very much to do um so let's take a look at the the code version yes it's a drawing um so in that last step you said the answer was that really like the embedding of the answer and then it has to get the reverse lookup it's yeah it's the soft max of the answer and then you have to do an arg max okay um so here it is right we've got um we've got the story times the query um the embedding of the story times the embedding of the query the dot product okay um we do a soft max um soft max works in the last dimension so I just have to reshape to get rid of the unit axis and then I reshape again to put the unit axis back on again right but you know the reshapes aren't doing anything interesting so it's just a dot product followed by a soft max right and that gives us the weights so now we're going to take each weight and multiply it by the second set of embeddings here's our second set of embeddings embedding C and in order to do this I just use the dot product again um but because of the fact that you've got a unit axis there this is actually just doing a very simple weighted average uh and again reshape to get rid of the unit axis so that we can stick it through a dense layer with a soft max um and that gives us our final result so what this is effectively doing is it's basically saying okay how similar is the query to each one of the sentences in the story use that to create a bunch of weights and then these things here are basically the answers right this is like if story number one was where the answer was then we're going to use this one in story number two three and four right because there's a single you know a single linear layer at the very end so it doesn't really get to do much computation so these it basically has to learn what the answer represented by each story is and again this is lucky because the um original um from the original data set the answer to every question is the last word of the sentence right where is Frodo's ring um or whatever right so that's why we just can have this incredibly simple um final piece um um so we've you know this is an interesting use of Keras right um we've we've created a model which is in no possible way deep learning right but it's you know a bunch of tensors and layers that are stuck together and so it has some inputs it has an output so we can call it a model we can compile it with an optimizer and the loss right and then we can fit it so it's kind of interesting how you can use Keras for things which you know don't really use any of the normal layers in any normal way um and as you can see it works right for what it's worth okay we solve this problem and the particular problem we solved here is the one supporting fact um and in fact it worked in less than one epoch more interesting is um two supporting facts actually before I do that I'll just point out something interesting which is um we could create another model now that this is already trained which is to return not the final answer but the value of the weights right and so we can now go back and say okay for a particular story what are the weights um so let's do F dot predict rather than answer dot predict so for this story where is Sandra Daniel, Mary, Sandra Sandra went to the bathroom where is Sandra bathroom right so for this particular story um the weights are here right and you can see that the weight for sentence number two is 0.98 right so we can actually look inside the the model and find out you know what what sentences is it using to answer this question we have a question would it not make more sense to concat the embeddings rather than sum them um not for this particular problem because because of the way the vocabulary is structured when the sentences are structured I also have to deal you would also have to deal with the variable length yeah well we've used padding to make them the same length so um yeah I mean if you wanted to use this in real life you would need to come up with a better sentence embedding um which presumably might be an RN or something like that because you need to deal with things like not and location of subject and object and so forth one thing to point out is that the order of the sentences matters and so what I actually did was when I pre-processed it was I added a 0 colon 1 colon whatever to the start of each sentence right so that it would actually be able to learn the time order of sentences so this is like another token that I added um so because you're wondering what that was that was something that I had in the pre-processing okay so one nice thing with memory networks is we can kind of look and see if they're not working in particular why they're not working um okay so multi hop so let's now look at an example of a two supporting facts story um it's mildly more interesting we still only have one type of verb with very synonyms and a small number of subjects a small number of objects um so it's basically the same um but now to answer a question we have to go down through two hops so where is the milk okay let's find the milk Daniel left the milk there okay where is Daniel Daniel traveled to the hallway okay where is the milk hallway alright so that's what we have to be able to do this time um and so what could it do is exactly the same thing as we did before um but we're going to take our whole little um our whole little model right so do the embedding reshape dot reshape softmax reshape dot reshape um what's h oh dense layer um um and we're going to call this one hop so this whole picture is going to become one hop um and what we're going to do is we're going to take this and go back and replace the query with our new output alright so at each step uh each hop we're going to replace the query with the result of our memory network and so that way the memory network can learn to recognize that okay the first thing I need is the milk search back find milk okay I now have the milk now you need to update the query to where is Daniel okay now go back I'm Daniel right so the memory network in multi hop mode basically does this whole thing again and again and again replacing the query each time right so that's why I just took the whole set of steps and chucked it into a single function and so then I just go okay response comma story is one hop response comma story is one hop on that and you can keep repeating that again and again again and then at the end um get our output um that's our model compile fit I had real trouble getting this to um fit nicely I had to play around a lot with learning rates and batch sizes and whatever else but I did eventually get it up to 0.999 accuracy um so this is kind of an unusual class for me to be teaching because like particularly compared to part one where it was like best practices you know clearly this is anything but right I'm kind of showing you something which was like maybe the most popular request was like teachers about chat bots right um but let's be honest who has ever used a chat bot that's not terrible and the reason no one's used a chat bot that's not terrible is that the current state of the art is terrible um so chat bots um have their place and indeed one of the students of class has written a really interesting kind of analysis of this which hopefully she'll share on the forum um but that place is really um kind of lots of heuristics and you know carefully set up vocabularies and um selecting from small sets of answers and so forth it's not kind of general purpose is a story asking a thing you like about it you know we'll hear some answers um it's not to say we won't get there um I sure hope we will um but the kind of incredible hype we had around neural truing machines and memory networks and then into any memory networks is kind of um as you can see when you actually even when you just look at the data set what they worked on it's kind of crazy um so that is not quite um the final conclusion of this though because um yesterday um a paper came out um which showed how to identify buffer overruns in computer source code using memory networks and so it kind of spoiled my whole narrative uh that somebody seems to have um actually used this technology for something um effectively and I guess when you think about it it makes some sense right so in case you don't know what a buffer overrun is that's like if you're writing in an unsafe language probably see uh you allocate some memory that's going to store some result or some input and you try to put into that memory something bigger than the amount that you're allocated it basically spills out the end and in the best case um it crashes um in the worst case somebody figures out how to get exactly the right code to spill out into exactly the right place and ends up taking over your machine um so buffer overruns are horrible things um and the idea of being able to find them I can actually see it does look a lot like this memory network you kind of have to see like oh where was that variable um kind of set and then where was the thing set from set and where was the original thing allocated um it's kind of like just going back through the source code um and the vocabulary is pretty straightforward you know it's just the variables that have been defined um so you know that's kind of interesting um I haven't had a chance to really study the paper yet um but you know it's no chat bot um but maybe there is a room for memory networks uh already after all is there a way to visualize what the neural network has learned for the text um there is no neural network uh um if you mean the embeddings I mean yeah I mean you can look at the embeddings easily enough I mean the whole thing is so simple it's very easy to look at every embedding and as I mentioned we looked at visualizing the weights you know that came out of the softmax um but I mean we don't even need to look at it in order to figure out what it is right like based on the fact that this is just a a small number of simple linear steps um we know that it basically has to learn what the um what each sentence answer can be you know sentence number 3 it's answer will always be milk and sentence number 4 it's answer will always be hallway or whatever um and then um uh the so that's what the C embeddings are going to have to be and then the the embeddings of the weights are going to have to basically learn how to come up with it um it's going to be probably a similar embedding to the query in fact I think you can even make them the same embedding um so that these dot products basically give you something that gives you similarity scores um so this is really a very simple largely linear model so it doesn't require too much visualizing um so having said all that um none of this is to say that memory networks um are useless right I mean they're created by very smart people with impressive pedigree and deep learning um it's just this is very early right and this tends to happen in um you know popular press um uh they kind of get over excited about things although in this case I don't think we can play the press I think we have to brain Facebook for creating a ridiculous demo or like this I mean this is clearly created to give people the wrong idea which I find very surprising from people like young McComb who normally do the opposite of that kind of thing um so not this is not really the press's fault in this case um um but this may well turn out to be a critical component um in that box and Q&A systems and whatever else um but we're not there yet um I had a good chat to um um uh Steven Merity the other day who's a researcher I respect a lot and also somebody I like uh who um I asked him what he thought was kind of the most exciting research in this direction at the moment and he mentioned something that I was also very excited about which is called Recurrent Entity Networks and the Recurrent Entity Network paper is the first to solve all of the all of the baby tasks um with a hundred percent accuracy um now you know take of that what you will um I don't know how much that means there's synthetic tasks one of the things that Steven Merity actually pointed out in the blog post is that they are you know even the basic kind of coding of how they're created is pretty bad they have like lots of replicas and it's like the whole thing's a bit of a mess but anyway um nonetheless this is an interesting approach so if you're interested in memory networks um this is certainly something you can look at and I do think this is um yeah you know it's likely to be an important direction having said all that one of the key reasons I wanted to look at um these memory networks is not only because it was the largest request I think from the forums for this part of the course but also because it introduces um something that's going to be critical for the next couple of lessons which is the concept of attention um attention or attention or models models where um we have to do exactly what we just looked at which is basically find out at each time you know which um part you know of a story to look at next or which part of an image to look at next or which part of a sentence to look at next um and so out the task that we're going to be trying to get out over the next lesson or two is going to be to translate French into English okay so this is clearly not a toy task right this is a very challenging task um and one of the challenges is that in a particular French sentence which has got some bunch of words it's likely to turn into an English sentence with some different bunch of words and maybe these particular words here might be this translation here and this one might be this one and this one might be this one and so as you go through you need some way of saying like oh which word do I look at next right so that's going to be the um attentional model and so what we're going to do is we're going to be trying to come up with a proper RNN you know proper RNN like an LSTM or a GRU or whatever where we're going to change it so that inside the RNN it's going to actually have some way of figuring out um which part of the input um to look at next so that's the basic idea of attentional models and so interestingly during this time that memory networks and neural truing machines and stuff were getting all this huge amount of press attention very quietly in the background at exactly the same time attentional models were appearing as well and it's the attentional models for language that have really turned out to be critical so you've probably seen uh all of the um press about Google's new neural translation system and that really is everything that it's claimed to be like it really is basically one giant neural network that can translate any pair of languages uh the accuracy of those translations is far beyond anything that's happened before and the basic um structure of that neural net is um as we're going to learn is not that different to what we've already learned it's just going to have this one extra step which is um attention and kind of depending on you know how interested you guys are on the details of this neural translation system it turns out that there are also lots of little tweets you know the tweets are kind of around like okay you've got a really big vocabulary some of the words appear very rarely um you know how do you build a system that can understand how to translate those really rare words for example um and also just kind of things like how do you deal with the memory issues around having huge embedding matrices of like 160,000 words um and stuff like that so there's like lots of details and the nice thing is that because uh I guess particularly because Google has ended up putting this thing in production um all of these little details kind of have answers now and those answers are all really interesting um there aren't really um on the whole great examples of kind of all of those things put together um so you know one of the things interesting here will be that you'll have opportunities to kind of do that um generally speaking the blog posts about these kind of neural translation systems tend to be kind of at a pretty high level they describe like roughly how these kind of approaches work but uh Google's complete neural translation system is not out there you know kind of download it and see the code um so you know we'll see how we go um but we'll kind of do it um um piece by piece um here so there's one other thing to mention about the memory network um is that Keras actually comes with a um N2N memory network example um in the Keras github um which weirdly enough when I actually looked at it it turns out doesn't implement this at all um and so even on the single supporting fact thing it takes many many generations and doesn't get 100% accuracy um and I found this quite surprising to discover that you know once you start getting to some of these more recent advances or kind of non you know not not just a standard CNN or whatever it's just less and less common that you actually find code that's correct and that works um and so this memory network example is one of them so if you actually go to the Keras github and look at examples and go and have a look and download the memory network you'll find that you don't get results anything like this and if you look at the code you'll see that it doesn't it really doesn't do this at all um so I just kind of wanted to mention that as a bit of a warning that um you're kind of at the point now where you might want to take with a grain of salt log posts you read like or even some papers that you read um well worth experimenting with them and assuming you know you should start with the assumption that you can do it better and uh maybe even start with the assumption that you can't necessarily trust all of the conclusions that you've read um because the vast majority of the time um in my experience putting together this part of the course the vast majority of the time um the stuff out there is just wrong even in cases like in the you know I deeply respect the Keras authors and the Keras source code but even in that case this is just wrong so I think that's an important important point to be aware of okay um I think we're done so I think we're going to finish five minutes early for a change I think that's never happened before um so thanks everybody uh and so this week um yeah hopefully we can have a look at the data science bowl make a million dollars uh create a new pie torch approximate nearest neighbors algorithm and then when you're done maybe uh figure out the next stage for memory networks okay thanks everybody