 is an introduction to deep learning. So please welcome, Jeff. Good morning, thanks for coming. Okay, I'd like to start by thanking Tariq Rashid for giving his excellent gentle introduction to neural networks. I'm going to build upon that and hopefully show you how to develop some of the networks that have been used to get to the really good computer vision results that we've seen recently. So our focus is mainly going to be on image processing this morning. And this talk is, I'm going to cover more the principles and the maths behind it than the code. And the reason is, it's quite a big topic, there's quite a lot to go through. I've got to squeeze it into an hour, so a little less code, but hey, I hope it's useful. So a quick overview of what we're going to go through. We are going to discuss the Theano library, which is the one I personally use, although there's also libraries like TensorFlow. We're going to cover the basic model of what is a neural network, just building on Tariq's talk. Then we're going to go through convolutional networks. And these are some of the networks that have been getting the really, really good results that we've seen recently. Then we'll look briefly at Lasagna, which is another Python library that builds on top of Theano to make it easier to build neural networks. We'll discuss why it's there and what it does. And then I'll give you a few hints about how to actually build a neural network, how to actually structure it, what layers to choose. Just to say you have a rough idea on how to train them, just a few hints and tips to practically get going. And then finally, time permitting, I'll go through the OxfordNet VGG network, which is how to use a pre-trained network that you can download under creative commons from Oxford University. You can't use that yourself because I'll go through why it's useful to sometimes use a network that somebody else has trained for you and then tweak it for your own purposes. Now, the nice thing is there are some talk materials. This is based off a tutorial I gave at PyData London in May, and if you check out the GitHub repo there, Bright Fury deep learning tutorial PyData 2016, you'll find that there's the GitHub repo there. All the notebooks are viewable on GitHub, so you should be able to see everything there in your browser. I would ask though that please, please, do not try and run this code during the talk. And the reason is, is because when you run the stuff that uses the VGGNet OxfordNet models, that will need to download a 500 meg weights file, and you will kill the Wi-Fi if you will start doing that. So, please do that in your own time, if that's okay. Also, yep, there's, if you want to get more in depth about Theano and lasagna, I'll put up some slides. If you check out my speaker deck profile, there'll be this talk slides, and there'll also be Intro to Theano and lasagna as well. So, that will give you a breakdown of Python code using Theano and lasagna, what it does and how to use it. And furthermore, if you don't have a machine available, you don't want to set it up yourself. I've set up an Amazon AMI for you. So, if you want to go use one of their GPUs, you can go and grab ahold of that and run all the code there. Everything's all set up, and I hope it's all relatively easy to get into. All right, now time to get into the meat of the talk. And what better place to start than ImageNet? ImageNet is an academic image classification dataset. You've got about a million images, I think it might be even more now. They divide into a thousand different classes, so you've got various different types of dog, various different types of cat, flowers, buckets, whatever else, whatever you can come up with, rocks, snails, and the way the ground truths, as in you've got a bunch of images that have been scraped off Flickr and you've got to provide a ground truth of what each image is. And the way all those are prepared is they went and got some people to do it over Amazon Mechanical Turk. Now, the top five challenge. What you've got to do is you've got to produce a classifier that when it, given an image, will produce a probability score of what it thinks it is and you score a hit if the ground truth class, the actual true class, is somewhere within your neural network or whatever it is used. It's top five choices for what it thinks the image is. And in 2012, the best approaches at the time used a lot of handcrafted features. For those of you familiar with computer vision, these are things like sift, hogs, fish of vectors, and they stick it into like a classifier, maybe a linear classifier, and the top five error rate was around 25%. And then the game changed. Kazewski, Sutskiver, and Hinson, in their paper ImageNet classification with deep convolutional neural networks, bit of a mouthful, they managed to get the error rate down to 15%. And in the last few years, more modern network architectures have gone down further. Now we're down to about 5% to 7%. I think people like Google and Microsoft have even gone down to three or four. And I hope that this talk is going to give you an idea of how that's done. Okay, let's have a quick run over theano. Neural network software comes in two flavors, or it's kind of on a spectrum, really. You've got the kind of neural network toolkits at the quite high level at one end, and at the other end, you've got expression compilers. A neural network toolkit, you specify the neural network in terms of layers. With expression compilers, they're somewhat lower level, and you're going to describe the mathematical expressions that Tariq covered that are behind the layers that effectively describe the network. And it's a more powerful and flexible approach. Theano is an expression compiler. You're going to write numpy-style expressions, and it's going to compile it to either C to run on your CPU or CUDA to run on an NVIDIA GPU if you have one of those available. And once again, if you want to go get an intro to that, there are my slides there that I mentioned earlier. There's a lot more to Theano, so go check out the deep-loaning.net website to learn more about it and find out about, that gives you the full description of the API and everything it'll do, some of which you may want to use. There are, of course, others. There's TensorFlow developed by Google, and that's gaining popularity really fast these days, so that may well be the future. We'll see. Okay. What is a neural network? Well, we're going to cover a fair bit of what Tariq covered in the previous talk, but it's got multiple layers, and the data propagates through each layer and it's transformed as it goes through. So, we might start out while our image of a bunch of bananas is going to go through the first hidden layer and get transformed into a different representation, and then get transformed again into the next hidden layer. And finally, we end up with, assuming we're doing an image classifier, we end up with a probability vector. Effectively, all the values in that vector will sum up to one, and our predicted class is the corresponding row in the probability vector, element in the probability vector, rather that has the highest probability. Okay. And this is what our network kind of looks like. We see there are weights that you saw in the previous talk that connect all the units between the layers, and you see our data being put in on the input and propagating through and arriving at the output. Breaking down a single layer of a neural network, we've got our input, which is basically a vector, an array of numbers. We multiply by our weights matrix, which are the crazy lines, and then we add a bias term, which is simply an offset. You just add a vector, and then you have our activation function or non-linearity. Those terms are roughly interchangeable. And that's the output layer activation is what then goes into the next layer or the output if it's the last layer in the network. Mathematically speaking, x is our input vector, y is our output. We represent our weights by their weights matrix. That's one of the parameters of our network. Our other parameter is the bias. We've got our non-linearity function. And normally, these days, that's really rectified linear unit. It's about as simple as they come. It's simply max of x and zero. That's the activation function that's become the most popular recently. In a nutshell, y equals ffwx plus b. Repeated for each layer as it goes through. And that's basically a neural network. Just that same formula repeated over and over once for each layer. And to make an image classifier, we're going to take the pixels from our image, we're going to splat them out into a vector to stretch them out row by row, run them through the network, and get our result. So in summary, our neural network is built from layers, each of which is a matrix multiplication, then our bias, then our plan on linearity. OK, and how to train a neural network. We've got to learn values for our parameters, the weights and the biases for every layer. And for that, we use backpropagation. We're going to initialise our weights randomly. There'll be a little more on this later. We're going to initialise the biases all to zero. And then for each example in our training set, we want to evaluate, as Tariq said, we've got to evaluate our network's prediction and see what it reckons the output is. Compare it to the actual training output, what it should produce given that input. We've got to measure our cost function, which is, roughly speaking, the error. That's the difference between what our network is predicting and what it should predict, the ground truth output. Now, the cost function is kind of important, so we'll just discuss that a little bit. For classification, where the idea is given an input and a bunch of categories, which category best describes this input, our final layer, we use a function called softmax as our nonlinearity or activation function, and it outputs a vector of class probabilities. The best way of thinking about it is, let's say I've got a bunch of numbers and I sum them all up and then I divide each element by the sum. That'll give us roughly the proportion or a probability, assuming all of our numbers to start with are positive. But they can also go negative in a neural network, so the softmax is at one little wringle. What we do is we take our input numbers, we compute the exponent of them all, and then we sum them up and we divide the exponent by the sum of the exponents. That's softmax. And then our cost function, our error function, is negative log likelihood, also known as categorical cross entropy. To do that, you've got to take the log of you. Let's just say you have an image of a dog. You run the image to the network, you see what the predicted probability is for a dog. You take the log of that probability, which is going to be negative. If its predicted probability is one, the log of that is going to be zero. If it's like 0.1, it's going to be quite strongly negative. You negate that log, and so the idea is if it's supposed to output dog, it should give a probability of one. If it's giving a probability of less than that, the negative log of that will be quite positive, which indicates high error. So that's your cost. Now, regression is different rather than classifying an input and saying which category closely matches this, you're trying to quantify it. You're measuring strength of something or strength of some response. Typically, with that, your final layer doesn't have an activation function, it's just identity linear. And your cost is going to be sum of squared difference. Then what we've got to do with our neural networks is we've got to reduce the cost, reduce the error using gradient descent. And what we have to do is we have to compute the derivative, the gradient of the cost, with respect to our parameters, which is all our weights and all our biases within our layers. The cool thing about it is, Theano does the symbolic differentiation for you. I can tell you right now that you don't want to be in a situation where you have this massive expression for your neural network and you've got to go and compute the derivative, the cost, with respect to some parameter by hand because you will make a mistake, you will flip a minus sign somewhere and then your network won't learn and debugging it will be a goddamn nightmare because it'll be really hard to figure out where it's gone wrong. So I would recommend getting a symbolic mathematical package to it for you or use something like Theano that just handles it all and literally you write that code there. D cost by d weight is Theano grad cost weight and other toolkits do this as well just to save you time and sanity. Then you update your parameters. You take your weights and you subtract the learning rate which is lambda multiplied by the gradient and I'd generally recommend that learning rate should be somewhere in the region of one times 10 to the minus four to one times 10 to the minus two, something in that region. You're also going to, you typically don't train one example at once. You're going to take what's known as a mini batch of about 100 samples from your data set. You're going to compute the cost of each of those samples, average all the cost together and then compute the derivative of the average cost with respect to all of your parameters and then the idea is you end up with an average and the idea is that it means that you get about 100 samples processed and parallel and that means when you run on a GPU that tends to speed things up a lot because it uses all of the parallel processing power of a GPU. Training on all the examples in your entire training set is called an epoch and you often run through multiple epochs to train you on your network, something like 200 or 300. So in summary, take a mini batch of training samples, evaluate, run them through the network, measure the average error or cost across the mini batch and use gradient descent to modify the parameters to reduce the cost and repeat the above until done. All right, multi-layer perceptron. This is a simple SNU or network architecture and it's nothing we haven't seen so far. It uses only what are known as fully connected or dense layers and on a dense layer, each unit is connected to every single unit in the previous layer and to carry on, to pick up from Tariq's talk, the MNIST handwritten digits data set is a good place to start and your network with two hidden layers, both the 256 units after 300 iterations gets about 1.83% validation error. So it's about 98.17% accuracy, which is pretty good. However, these handwritten digits are quite a special case. All the digits are nicely centered within the image that are roughly the same position, scaled right about the same size and you can see that in the examples there and our fully connected networks have one weakness. There's no translational invariance. Imagine you want to take an image and it's got to detect a ball somewhere in the image. What it effectively means is it will only learn to pick up the ball in the position where it's been seen so far. It won't learn to generalise it across all positions in the image. One of the cool things we can do is if we take the weights that we learn and we take one of the neurons or one of the units in the first hidden layer and take the strengths of the weights that link them to all the pixels in the input layer and visualise it, that's what you end up with. So you see that your first hidden layer, the weights, are effectively from a bunch of little feature detectors that pick up the various strokes that make up the digits. So it's kind of cool to visualise it, but that shows you how the dense layers are translationally dependent. For general imagery, like say if you want to detect cats, dogs, various eyes and everything that makes up the various little creatures and all the various things, you've got to have a training set large enough to have every single possible feature and every single location of all the images and you've got to have a network that's got enough units to represent all this variation. Okay, so you're going to have a training set in the trillions of neural network with billions and billions of nodes and you're going to have enough, you're going to need enough about all the computers in the world and the heat death of the universe in order to train it. So moving on, convolutional networks is how we address that. Convolution. It's a fairly common operation in computer vision and signal processing. You're going to slide a convolutional kernel over the image and what you do is you imagine, say, the image pixels are in one layer, you're going to take your kernel, which has got a bunch of little weights, little values and you're going to multiply the value in the kernel by the pixel underneath it for all the values in the kernel and you're going to take those products and sum them all up and you're going to slide the kernel over one position and do the same, slide it over this, do the same and what you end up with is an output and, well, they're often used for feature detection. So a brief detour. Gabor filters. If we produce these filters which are a product of a sine wave and a Gaussian function, you end up with these little wave, these little soft circular wave things and if you do the convolution, you'll see that they act as a feature detector that detects certain features in the image. So you can see how it roughly corresponds. You can see the ones with the vertical bars there roughly pick out the vertical lines in the image of the bananas. The horizontal bars pick out the horizontal lines and you can see how convolutions act as a feature detector and they use quite a lot for that. So back on track to convolutional networks. For a quick recap, that's what our fully connected there looks like with all of our inputs connected to all of our outputs. In a convolutional there, you'll notice that the node on the right is only connected to a small neighbourhood of nodes on the left and the next node down is only connected to a small corresponding neighbourhood. The weights are also shared so it means you use the same value for all the red weights and for all the greens and for all the yellows and the values of these weights form that kernel, that feature detector and for practical computer vision whether you're producing the kernels manually or learning them like in a convolutional network, more than one kernel has to be used because you've got to extract a variety of features. It's not sufficient just to be able to detect all of just the horizontal edges, you want to detect the vertical ones and all the other various orientations and sizes as well. So you've got to have a range of kernels. So you're going to have different weight kernels and the idea is you've got an image there with one channel on the input and about three channels on the output or what you might find in a typical convolutional network might actually have about 48 channels or 256. I'll show you some examples later of some architectures and you end up with some very high dimensionality in this sort of channels output. OK. So each kernel connects to the pixels in all channels in the previous layer so it draws in data from all channels in the previous layer. However, the maths is still the same and the reason is because a convolution can be expressed as a multiplication by a weight matrix. It's just that the weight matrix is quite sparse but the maths doesn't really change as far as conceptually and that's fortunate for us because it means that the gradient descent and everything we've done so far just still works. As for how you go about figuring that out I'd just recommend letting Theanna do that for you. I wouldn't hurt myself. I wouldn't recommend it. There's one more thing we need down sampling. Typically, if you've worked in Photoshop or GIMP or any of these other image editing packages you might want to shrink an image down by a certain amount say by 50%. You want to shrink the resolution and for that we use two operations either max pooling or striding. Max pooling, what you're going to do is you can see that the image up there is divided into four colour blocks say the blue block has four pixels and what we do is we take those four pixels we pick the one with the maximum value and we use that rather than averaging we just take the maximum and that's max pooling. It down samples the image by the factor pfp is the size of the pooling and it operates on each channel independently. The other option is striding. What you do there is you effectively pick a sample, skip a few, pick a sample, skip a few. It's even simpler. It's often quite a lot faster because what you can do is a lot of the convolution operations support striding convolutions where rather than taking producing the output and throwing some away they just effectively jump over by a few pixels each time. So that's faster and you get similar results. So moving on. Jan Llekun used convolutional networks to crack the to solve the image, the MNIST dataset in 1995 and this is a simplified version of his architecture. What you've got is you've got 20 kernels. You've got this 28 by 28 input image. One channel because it's monochrome. You've got 20 kernels 5 by 5 so they reduce the image to 24 by 24 but it's now 20 channels deep. Max pool shrink it by half. Let me have 50 kernels 5 by 5 and now we've got a 50 channel image 8 by 8, Max pool shrink it by half and then we flatten it and you do a fully connected dense layer to 256 units and finally fully connected to our 10-unit output layer for our class probabilities. After 300 iterations of the training set we get 99.21% accuracy. 0.79% error rate is not too bad. And what about the learned kernels? It's interesting to think about what the feature detectors it's picking up so if you look at say a big dataset like ImageNet, this is the Krzebski paper that I mentioned right at the beginning. These are the kernels that get learned by the neural network and for comparison you can see cabal filters over there. The reason the colour ones are at the bottom is just because of the way they did the actual thing involving two GPUs and the way they split it up but if you look at the top row you can see that there are two layers and orientations. That's the first layer. Xylor and Fergus took it a little further and they figured out a way of visualising the kernels how they respond to the second layer so you can see you've got kernels there that respond to various sort of slightly more complex features things like squares and curved texture little sort of eye-like features or circular features and then further up on about layer 3 you get somewhat more complex features still where you've got things that recognise simple parts of objects. This gives you an idea of roughly how the convolutional networks fit together. They operate as sort of feature detectors where each layer builds on the previous one picking up ever more complex features. Now I'll move on to lasagna. If you want to specify your network using the mathematical expressions it's really powerful but it's quite low level. If you have to write out your neural network as mathematical expressions and numpy expressions each time it could get a bit painful. There's only a build on top of it and it makes it nicer to build networks using Theano. Its API rather than just allowing you to specify mathematical expressions you can construct layers of the network but you can also then get the expressions for the output for its output or loss. It's quite a thin layer on top of Theano but the cool thing about it is if you have one of these mathematical expression compilers, if you want to come up with some crazy new loss function or do something new and crazy whatever it is you like or you want to be inventive you can just go write out the maths and let Theano take care of figuring out how to run that using QDA using Nvidia's QDA so you don't have to worry about it yourself it's quite easy to get going. You just do it all in Python and it's great. That's why I happen to like it. Once again slides are available if you want to go and dive in more detail. As for how to build and train neural networks I think we'll start out with a bit about the architecture. If you want a neural network too if you want to get a nice neural network that's going to work I'm going to try and give you some rough ideas on what you want to use and where in order to get something that's going to give you good results. Your early part of the network is going to be just after your input layer is going to be blocks that are going to consist of some number of convolutional layers two, three, four convolutional layers followed by a max pooling layer that's effectively down samples or alternatively you could also use striding as well. I only have another block the same and you'll note that the notation is that's quite common in the academic literature is you specify the number of filters the number of kernels and then the three specifies the size. So often you use quite small filters only three by three kernels. MP2 means max pooling down samples fat to two and notes that after we've done the down sampling you double the number of filters and then finally at the end after your blocks of convolutional and max pooling layers you're going to have the fully connected or as dense layers where you'll typically if you've got quite a large resolution coming out of there you'll want to work out what the sort of dimensionality is at that point and then roughly maintain that or reduce it perhaps a bit in your fully connected layer you could have two or three fully connected layers if you like and then finally put your output and there's the notation for fully connected layers does that just mean 256 channels OK so overall as discussed previously your convolutional layers are going to detect features in the various locations throughout the image your fully connected layers are going to pull all that information together and finally produce the output there are also some architectures you could look at the inception networks by Google or Resonance for inspiration if you want to go and have a look at what some other people have been out to go on to slightly more complex topics batch normalisation it's recommended in most cases it makes things better it's necessary for deep networks by the way I should tell you deep learning neural networks a deep neural network is simply a network of roughly more than four layers that's all it is that's what makes them deep and so if you want particularly deep networks of more than eight layers you'll want batch normalisation otherwise they just won't train very well they can also speed up training because your cost drops faster per epoch or they can take more each one can take it longer to run you can reach lower error rates as well the reason why it's good is sometimes you've got to think about the magnitude of the numbers you might start out with the numbers of a certain magnitude in your input layer but that magnitude might be increased or decreased by multiplying by the weights to get to the next layer and if you stack a lot of layers on top of each other you can find that the magnitude of your value is either exponentially increases or exponentially shrinks towards zero either one of those is bad it screws the training up completely so batch normalisation it standardises it dividing by the standard deviation subtracting the mean after each layer so you want to insert it so batch normalisation and fully connected layers after the matrix multiplication but before adding the bias and before nonlinearity but the nice thing is lasagna with a single call does that before you see how to do too much surgery yourself on the neural network drop out it's pretty much necessary for training you don't use it at train time but you don't use it at prediction and test time when you want to run the sample it's overfitting overfitting is a particularly horrific problem in machine learning it's going to bite you all it's going to bite you all the time in machine learning it's what you get when you train your model on your training data it's very very good at the samples that are in your initial training set but when you want to show a new example that's never seen before it just dies it fails completely so essentially that means it gets particularly good at those examples it picks out features of those particular training samples and fails to generalise so drop out combats this what you're going to do is you're going to randomly choose units in a layer and you're going to multiply a random subset of them by zero usually around half of them and you're going to keep the magnitude of the output the same by scaling it up by a factor of two and then during test predict you just run as normal with the drop out turned off you're going to apply it after the fully connected layers normally you don't bother you can do it after the convolutional layers as well but the fully connected layers towards the end is normally where you apply it that's how you do it in lasagna and to show you what it actually does this is with your drop out turned off so you see all the outputs going through those little diamonds represent our drop out so we take half of them we pick them and turn them off and you see the grey weight lines what that effectively means is when doing training the back propagation won't affect those weights because the drop out kills them off and then the next time around you turn off a different subset of them and furthermore and the reason it works is it causes the units to learn a more robust set of features rather than learning to co-adapt and develop features that are a bit too specific to those units so that's roughly how it sort of combats overfitting dataset augmentation because train neural networks is notoriously data hungry you want to reduce your overfitting and you need to enlarge your training set and you can do that by artificially modifying your existing training set by taking a sample and modifying it somehow and adding that modified version to the training set so for images you're just going to take the image you're going to shift it over by a certain amount or up and down by a bit you're going to rotate it a bit you're going to scale a little bit horizontally flip it be careful of that one so for example if you've got images of people and you vertically flip them so they're upside down that will just screw up your training set so when you're doing dataset augmentation your dataset and what it should output and think about whether your transformations are a good idea ok and finally dataset standardisation neural networks train more effectively when your dataset has a mean of zero all the values are a mean of zero and unit variance standard deviation of one and also with regression you want to standardize your input data and with regression you want to standardize the output don't remember that in regression we are quantifying something so we're producing real valued outputs you want to make sure that's standardize as well I've personally found that I've been bitten when I haven't done that but when you use your network when you deploy don't forget to do the reverse of the standardization to get it back into the space back into the sort of scale and range that you want it to be in the first place and to do that standardization you extract all the samples into an array and in the case of images you're just going to go through all the images and extract all the pixels and splat them out into a big long array keep all the RGB channels separate and you're going to compute the mean and standard deviation in red green and blue and you're going to zero the mean by subtracting it and then divide by the standard deviation and that's standardization ok when training goes wrong as it often will you'll notice what you want to do is as you train you want to get an idea of what the value of your loss function is when it goes crazy and starts heading towards 10 to the 10 and eventually goes nan everything's gone to hell so you've got to track your loss as you train in your network for this ok if you have the error rate equivalent of a random guess like just throwing a dial throwing a coin it's not damn well learning anything and essentially it's learning to predict a constant value a lot of the time but sometimes it isn't enough data for it to pick up the patterns it can also learn to predict a constant value let's say for instance that you have a data set where say you've got them divided into say 10 classes but say class the last class only has about 0.5% of the examples now one of the best ways the sneaky horrid little neural network will figure out a way to figure out to cheat you is to simply say that it's to simply never predict that last class because it's only going to be wrong in 0.5% of the cases and that's actually a pretty good way of getting the loss down to a pretty low value by concentrating on all the other classes and getting those right and the problem is it's a local minima it's a local minima you can think of it as a local minima of your cost function and neural networks get stuck in those a lot and it will be the bane of your existence the most often don't learn what you expect them to or what you want them to you'll look at it and think as a human I know the result is this and the neural network will learn to pick up features and detect something quite different so yeah welcome to the bane of your existence and I'm going to illustrate this with a really nice cool example that is available online I'm going to talk about how you design a computer vision pipeline using neural networks with a simple problem like handwritten digits you could just throw it at a neural network one neural network and it'll do it great wonderful for complex problems they're often just not enough and neural networks are not a silver bullet so please don't believe all the hype that's around deep learning right now it's theoretically possible to use a single neural network for a complex problem if you have enough training data which is often an impractical amount so for more complex problems you've got to break the problem down into smaller steps and I'm going to talk a bit about Felix Lau's second place solution is to identify right whales so his first naive solution was to train a classifier to identify individuals so I'm going to pull up his website and okay cool so effectively these patterns on the head of the whale is what you use to identify an individual and the challenge is to pick out figure out, given an image of a whale figure out which individual it is and this is the kind of image you get in the training set you've got the ocean surrounding a little whale as he breaches as he pokes his head over the surface and you've got to figure out who he is from that picture so Felix's first solution was just to stick that through a classifier and see what happens so let me scroll and find out okay baseline naive approach, here we go and what he found out is that it gave no better than random chance so what he then did is he used what's called saliency detection where he used a trick to figure out which parts of the image are influencing the network's output the most and he found out that actually bits of the ocean were affecting it why would it do that okay try a thought experiment I want you to imagine that I give you this problem you've got a bunch of images of right whales and I say that's number one, that's number seven number 13 but you've also been given really really horrendous horrible amnesia that has completely warped your mind of the concept of what a whale is what the ocean is just about every human concept you have so you are literally starting out with images there's zero knowledge at all no semantic knowledge about the problem you can't even guess what it is you're just given an images and given numbers and then told from this training set figure out what these are what part of the ocean is it the whale what part of the image is actually helping you make that decision and when you think about it from the perspective of a neural network that's where every neural is starting out from it's starting out from zero knowledge and that's why the initial solution didn't work very well you could do if you had a billion images with all the ground truths and the whale the marine biologists have gone hand-classified a billion images of them and put in enormous amounts of human effort and then the signal will eventually come through the noise but we can't practically do that in real life so his solution I mentioned the region-based saliency so found out they had locked on to the wrong features so he trained what's called a localiser now I've told you about classifiers and regressors localisers what they do is they look at images and they find my target point of interest is over there in the image and so what he did is he found he got the localiser to take that image of the whale and found out that the head is there and after that he ran it through the classifiers the idea is he first gets trains a network to look for whale pick it out, crop it out from the image and then just work on that piece and then furthermore he trained a key point finder to find the front of the head and the back of the head so he could then take the image of the whale and rotate it so that they're all in the same orientation and position and after that having got really uniform images of whales he could then run it through the classifier and eventually that trained the classifier on oriented and cropped whalehead images got him second place in the cattle competition so I think that's kind of a nice illustration of how you've got to be careful how you use these things all right, how am I doing for time great, okay might even have a bit of extra to go through a few extra things, never know, we'll see so OxfordNet VGGNet and transfer learning using a pre-trained network is often a good idea the OxfordNet VGG19 is a 19-layer neural network it was trained on that big million image dataset called ImageNet and the great thing is they have generously made the neural network weights file available under created commons license with CC attribution and you can get it there there's also a python pickled version that you can grab hold of as well they're very simple and effective models they consist of 3x3 convolutions max pooling and fully connected layers that's the architecture and if you want to classify an image of the VGG19 I'll show you an ipython notebook that will do that so we're going to take an image to classify which is our little peacock here we load in our network oh sorry, beggy pardon cool so we've got our little peacock who are going to classify we're going to load in our pre-trained network I think I better skip over the code a bit this is going to be a bit too dull but effectively you can go through the notebook yourself it's on the github I hope you don't mind if I spin through this quite quickly so we're just going through a bit about what the model is like ok so this is where we actually build our architecture so you can see the input layer we've got our convolutional layers max pooling this is all the lasagna api we'll skip all this finally we've got our output which has got softmax non linearity there you go we're going to drop all our parameters in beggy pardon ok sorry this is originally from my tutorial so anyway finally we show our image that we're going to classify and we predict our probabilities here and we notice the output is a one a vector of a thousand probabilities and we find out that the predicted class is 84 with probability 98.99% which is a peacock and you can run that yourself so the cool thing is that you can take the pre-train network and just use it yourself and transfer learning is a cool trick and this is the last trick I want to show you training a neural network from scratch is an atoristic data hungry and the reason is you need a ton of training data and preparing all that is time consuming and expensive but what if we don't have enough training data to get good results we don't have money to prepare it well the image data set is really huge millions of images with ground truths and what if we could somehow use it what if we could somehow use the image net data set with all this vast data to help us with a different task well the good news is we can and the trick is this rather than trying to reuse the data you train a neural network like vg19 or you download vg19 and you're going to take part of that network and retain it far away the end part of it and stick some new stuff on the end that will output what we want and that way you effectively train just the bit that you've added and then fine tune at the end I'll go over that but essentially what you can do is you can reuse part of vg19 to say classify images that weren't an image net and for classes and different kinds of object category that weren't mentioned in the image net so you can reuse it you can reuse it for localisation so you want to find the location of an object at the location of that whalehead maybe or segmentation where you want to find the exact outline of the boundary and to do transfer learning what we do is we're going to take vg19 that looks like that those are all our layers we're going to chop off those last three the stuff on the left just gets hidden for so we can show some text but we chop off those last three layers and then we create our new ones that are only initialised on the end then what we do is you train the network with only your training data but you're only going to learn the parameters you're only going to train the parameters on the new layers that you've created and then you fine tune it where you train parameters on all the layers having trained those initial new ones you train you then fine tune the whole lot you just do training this time updating the parameters of all layers and this will get you some better accuracy and the result is a nice shiny new network with good performance on your particular target domain that's going to be somewhat better than you could get with starting out with your own train with your own data set ok so finally some cool work in the field that might be of interest to you Xyla I think I mentioned this briefly already but they've visualised in visualising the understanding convolutional networks they decide to visualise the responses of the convolutional layers to various inputs so you've seen these images where they decide to visualise what's going on if ever you want to find out what your network's picking up this is a good place to look for how to work out what your network is detecting these guys decide to figure out if they can fool a neural network so they decide to generate images that are unrecognisable to humanised but recognised by the network so for instance the neural network has a high degree of a high confidence so that is in fact a robin it looks like horrible noise but it thinks that's a cheetah but that's an armadillo but that's a peacock I can't see a peacock there they then went on to effectively say well how can we generate images that do make sort of make sense to a human that's a king penguin that's a starfish and you can kind of see where it's picking things up it's looking for texture but it's not really looking for the actual structure of the object so it's picking up certain things and ignoring other quite important features you can run neural networks in reverse you can get to generate images as well as classify them so these guys decide to make them generate chairs so they give the orientation the design, the colour and the parameters of the chair and they train to generate an image so you end up with these chairs and they're even able to morph them this one got a lot of press neural algorithm artistic style if you've got the prism app you'll know what that's all about on my phone they took OxfordNet and they extract texture features from one image and they apply them to the other so you take that photo of say say this waterfront and you take a painting like say starring out by Van Gogh and it repaints the image in the style of Van Gogh or in the style of Evermunt as the Scream or any of these others it's very very cool and the nice thing is there are iPhone apps that do this now and what these guys did this is a bit of a master piece of work they've generated these images of bedrooms are generated by a neural network and the way they did it is they trained two neural networks one to be a master forger and the other to be the detective the master forger tries to generate an image and the detective tries to tell is that a real image of a bedroom is that one that's been generated by the forger and the idea is you co-adapt them to get them both better so that the master forger gets better and better until it generates pictures like that which is kind of cool and they even then took it further by figuring out what the sort of by combining some of the parameters and if you've seen some of the results from the sort of a king minus man plus woman equals queen stuff that's been done on some of the word-to-vec, thank you, the word-to-vec models similar things with facial expressions as well anyway I hope you found this helpful I hope it's been good you've been a great audience thank you very much we have about nine minutes now for the questions it was a great talk thank you I have actually several questions the first one when you are modelling a neural network how do you choose or is there a way to choose how many hidden layers and neutrons are there in them because I know that that was an issue for me when I was modelling some I'm not aware of any particular sort of rule of thumb to choose how to design your network architecture the sort of rule of thumb I use to look at things that have worked for other people and build off that so the OxfordNet architecture where you've got the I found that the small convolutional kernels or people found that the small convolutional kernels worked well a few of those layers followed by max, pooling or striding and those blocks repeated I think I think there are some people who have probably tried things like grid search where you just try to get it to automatically alter the architecture but given the fact that for something like an image net model your training time can extend even into weeks or hours at least on really big GPUs that can be impractical so afraid to say it's just rule of thumb as far as I know Just try it out and see it works I would look up the literature and see what other people have done and just adapt it I'm sorry I can't give you more information now My second question we saw that your guys are analysing images and numbers is there a way that you can like make strings input and recognize patterns in them how would you do that would you have to like transform them somehow I mean for text processing I think what people tend to do is they tend to use something like word to work to convert each word into an embedding which is like a 200 or 600 element vector and then use what's called a recurrent neural network where rather than just having it go through the output it goes partially through and feeds back into an earlier layer so then it sort of has an idea of time I've not I've not implemented those models I'm afraid I'm outside my comfort zone there in terms of being able to advise you but look at recurrent neural networks but they tend to use the word embeddings they tend to use the word embeddings to convert the words into a vector the sort of more trivial way of doing that is to turn it into one hot representation where if you've got 2,000 words in your vocabulary you simply have a one to represent a word with a vector of all zeros except one for the particular word that it is but given the sparsity of that input that often causes problems which is why they use the embeddings and the last question sorry could you train a neural network to do like math, like addition, maybe multiplication and if you can, would it be maybe faster than the usual way that processors are doing it you can train to do addition I think actually there are some people who have managed to take the image net data set where you take two handwritten digits in an image it figures out what they are and then is trained to produce the sum it can work multiplication they don't do actually people haven't figured out how to get a neural network to do that so the models actually can't extend to certain things which is interesting so there are certain things they just don't do very well oh well so I think it's quite limited but as for would it be faster no way it would be faster because you're using a hell of a lot of mathematical operations just to do something that is a one instruction operation on your processor so sure thanks for the talk really interesting and great stuff at the end around the images what are your thoughts on how neural networks could be applied to text analytics because most people don't do that text analytics it's outside my area so I don't know but I would speak to Catherine Jamal she's here and she did a very very good talk describing the sort of she gave a really really good intro and a really really good overview of what the sort of text processing world is like and she gave quite a few neural networks neural networks are some of the best models for it now but it's outside my area of expertise but she knows her stuff on that so I'd speak to Catherine Jamal any other question the name of neural networks comes from the science of the brain do you know if it's used widely in brain science not sure I think that the I think the model that we use for our neural networks that I've been talking about here is quite different from my neurons in the brain work I think that my very very basic layman's understanding of brain neurons is they they operate on spike rate they generate output spikes and it's the frequency of that that is roughly the strength of their output I think I don't know so I think that trying to liken these to one another is I don't think that they're that much alike I think that they're where the similarity is is that people looked at how neurons in the brain all hooked up to each other and they said how can we make some of the models this but what we've got is something that seems to work well given our processes and seems to produce a very good pattern recognition but as for similarities to the brain beyond that I don't feel comfortable saying anymore any other questions ok, have you heard of the self-driving car using deep learning to implement how they drive it I wonder how they would update the cost function because it's a stream of video rather than a fixed static output I've heard about it I'm not sure how the hell they're doing it I don't know hmm I suppose if you were to try and do something like that one of the things you could do is you could prepare a bunch of footage where you say the human who's driving this car has done well as when they haven't crashed it or killed anyone or done something like that so the idea is that all that's good and maybe if there are some footage of some accidents they say that's bad, don't do that or what they do is what you probably want to do is we want to say given this video have these outputs as in a steer like this accelerate and brake like this produce these decisions so that's actually a bit of the bit like a little bit like the Atari gameplay in New York Networks that Google developed the stuff where they got the really good scores on the video games where they take the input the screen and they decide whether to move up, down, left, right, shoot but it's similar thing where instead of deciding whether to move up, down, left, right and shoot you're controlling the steering with all the accelerator and the brakes you could do it like that but given my experience of it and given the fact that as I mentioned if you have particularly rare examples rare situations where quite often the New York Network will just cheat and not just because they might make up 0.0001% of your training set it'll never actually bother to learn anything from those the cost function it'll discover a local minimum that ignores them I would not be very comfortable getting into a car that was just controlled by a New York Network I would not want to put my life in the hands of a vehicle like that but that's might be how you could build it but whether it would, I don't think it would be very good Any other questions? Hi Do you ever combine neural networks with other techniques like approximation algorithms? Approximation algorithms Yeah, like optimization techniques I was thinking about travel salesman problem for example Um I don't know I haven't tried them for that I'm not aware I wouldn't be surprised if someone tried it but I'm not aware because I've not looked at it I'm afraid That's a difficult one I'm afraid I don't know, I'm sorry You'd have to figure out a way of coming up with some kind of cost function that measures how good its solution is I don't want to go about doing that in certain problems I have time for one last question Maybe it's kind of a technical question about Teano but when you apply drop out does the expression get recompiled reoptimised to be efficient not to take account of that sort of those weights or they get actually floating point operation get to the GPU or CPU but they are zero so they don't affect the gradient I think it's the second because what you do is you get a random number generator to generate either a zero or one and then you put that multiply in the expression so I think it's just it's not actually optimising I think it'd be quite difficult to optimise because the problem is for every single sample the mini batch are actually blocking out a different subset of the units so I'm not even sure how one would actually go about optimising in an efficient way because you've got to almost select which units you're dropping out and then from that decide what operations you can save and you've got to do that on the fly and I think that'd be quite tough so I'm guessing it I would guess that it doesn't Since there are no other questions so I'll probably thank Jeff for his wonderful talk and probably I'll say yes enjoy lunch