 Anyone? Anyone. It's small size. Yes, yes. Do you need one of these? Sorry? You've got it. So I guess it's 10.02. What you should have if you go into the presentations folder and look at the index file is something which looks like this. Which says copy the USB key and pass it on. Has anyone not got a copy of what's on the USB? Is that USB key? Have you got a copy of what's on the USB key? Oh yeah, yeah. And you see the instructions file? Oh yeah. Okay, okay. What is it for? Is for the deep learning workshop. Which you're at. On a rainy Saturday morning. Okay. So in this, if you just do a right arrow you'll have some of these things. Basically there's the main presentation. You don't actually need to follow along in that at all. There are links to something which I'm calling cat painting. There's another to TensorFlow. Which is a playground application. So these first two examples here are JavaScript. Which you're going to run without needing the virtual machine at all. Then there's this whole virtual box thing. Which is for the real meat of it. I think this is just me. So this morning's going to be broken into two halves. The first is from 10 to 11. Which is labelled core by the organisers. So it's going to be less hard than the hardcore section which comes after the break. Which is from 11.15 to 12.30-ish. We'll see whether there are people hanging around. I don't know. I have to leave by three. We'll get checked out at some point. So his presentation. This is a deep learning workshop. It's labelled hardcore. This is kind of to dissuade people who didn't have laptop. You couldn't install the virtual box thing. This is not for them. So just a quick bit about me. So I moved Singapore in September 2013. 2014 was basically just me having fun. I had zero clients. I had home office and I just did machine learning the whole time. I guess the reason I did that is back in the early 90s I did a PhD in the UK. And in machine learning. So after that though I went into the kind of quant finance thing. Moved to New York. Lived there for a long time. But now I've decided I should do what I actually want to do which is machine learning. So I had a lot of fun during 2014. 2015 has turned into like serious mode. I actually have a proper client. A local Singaporean company. And I do natural language processing with them which involves some deep learning. A whole bunch of cool stuff. And it was a good switch. Okay. So in this first hour which is going to be simpler. I'm going to go over some of the history. I'm going to go very quickly. I'm going to go over what's doable. Then we move on to kind of very, understanding very, on a rudimentary level. The kind of mathematics which we're talking about. Then having a look at what a natural network does. Then playing around with other networks. And then I'll actually show a sample from the appliance thing. So, rather than sneaking in and avoiding it, how do you get one of these? I doubt it. And then I'll also show you other stuff which is on the virtual machine. Because there's actually quite a lot of content. There's no way in which we can cover this. Next week I'm actually, I've been invited to India to give a six hour thing. So maybe I can have more of it there but that's full already. So in the second part, there are going to be two essentially kind of modules which people can take more seriously. Please pick up the USB key at the front as you're leaving. And so we can, that will be much more hands on. I will try and keep, unless people really want me to, I'll try and keep my talking to a minimum so you can actually play with it yourselves. Okay, so let's get going. Deep learning. What we're talking about is neural networks with multiple layers regularly fed with lots of data. Back, history-wise, this has been around for a long, long time. In the 80s people thought this was kind of one of the answers to the whole brain thing. And thought that this is going to work and we're going to map the brain and everything's going to have AI anytime now. What happened in the mid-90s is that people discovered that it wasn't quite so easy as all that and they had problems solving very, very simple toy examples and everything just kind of, AI-winter setting. So going forward a decade, you get, there was a big improvement. People suddenly realized that if you had deeper networks and bigger networks, things started to work again. And for another five years people then discovered you could train these things on GPUs. So GPUs are obviously for graphics and for gamers which basically you're using a processing of a lot of polygons which is essentially metrics math. And people discovered that NVIDIA had this whole environment where you can actually program these and now people are built layers upon layers of program environment so the GPUs are what powers this stuff. Now in your own machine you might have a GPU I see a lot of apples which is fine but the VM doesn't require it and it's all timed so that these things take about five minutes to train at most. But trust me the GPU would be useful if you're doing this to any large extent. So the list of kind of who's involved is a list of who's stuck with it from the mid 90s. I did not stick through it for the mid 90s I took a vacation in finance and the interesting thing is that now if I'd have started one of the modern networks training it back in the mid 90s it wouldn't have finished yet because the computer hardware has advanced so quickly it was actually worth taking a 20 year holiday in finance. Come back and now I can afford a GPU card which will run it overnight. So here is a list of some of the key players. Hinton's a huge name from way back. He was taken up by Google. Lukun is the inventor of CNNs. So Facebook. Andrew Ung who's popular from the Coursera course. He's a Baidu. Doing a lot of their natural language processing. Apple has been acquiring things. There's kind of one stand out from the original area called Bengio who's still at Montreal and has a large group there doing excellent university tile work. On the other hand, a lot of these companies are publishing. Google is one of the main people publishing this stuff. They've been very, very open about this and they've been producing excellent software. More about that. So here's a quick overview of what deep learning can do now. In areas such as speech recognition, language translation, captioning, reinforcement learning. This stuff which particularly the speech recognition language translation had come to kind of an asymptote in terms of the training. When people were making small increments on some various scores of like a percent a year or something. And then the deep learning guys came along and essentially threw away all of the elegance that people have been developing in, say, linguistics and just trained the thing off tons of data and suddenly scores went up massively. And this is one of the reasons why people are so interested in it because of its effectiveness. So on the speech recognition front, since Jelly Bean on an Android phone, they've been sending your data up to the cloud to do actually your speech recognition. However, since Lollipop, which is 2014, it's actually taking place on your phone. So if you've got an Android phone for sure, there is a deep neural network inside, which all these late comers are going to want one of these USB keys and not avoid looking at me. You will soon find out you need it. So my guess is that the Apple also has this stuff. They are very much less open source about it. There's translation. So Google have got deep models which are on the phone, I believe, whereby they can look at images and translate on the fly into the picture. Now this coming from, say, the 90s, this is insane technology, which is now doable. So it's incredible that it actually picks up the fonts. This is crazy stuff. 26 languages, insane. House numbers. So this is what when you're doing recaptures, you've been giving Google training data for their house number recognition algorithm. So typically in a capture, you'll have two boxes, one of which will be a fairly recognizable number and another one will be a fairly less recognizable number. And what they will do is they will say, well, we'll detect whether you're human and we'll take a quarter of your response on the thing we don't know and then you're providing validation data the whole time. Now if they're not sure of something, they'll just ask another 10 people because Google can do this at Google scale. So they have a huge, huge training set of this. The interesting thing is that they can measure how good humans are at it because they can cross validate. They're better than humans at this level. So this is, sorry, this is used for street maps. You're going to want a US speaking. It's a very interesting psychological experiment. What I found in Singapore is I will get all the keys back, which is super, right? And everyone will refuse them as well. Now one of the other things which has spurred innovation is there's a big competition called ImageNet where, and I'll talk more about this in a bit, they have huge numbers of images which people are trying to classify into a thousand classes. And because there's this competition every year, people can gauge how well everyone is doing. And big companies, because they want to attract good people, pour lots of development time into this or lots of GPU time. And this is, you can check, there's some guy called Carpathie who actually verified for himself how good was he at classifying these images. And so the neural network stuff, better than humans now, or better than Carpathie. It's kind of interesting. So this ImageNet is actually very, it has a lot of classes across a whole spectrum, which we've got mites and containership and motorscooter. But it's also got a very detailed stuff about dogs. It doesn't have much about cats. It has lots of dogs and lots of flowers, I think. So they have either going for a broad spectrum of things or some very narrow definitions where humans miss out is distinguishing between the dogs. Because it's very difficult to distinguish between these things. Well, the networks tend to get the hang of it. Captioning. So this is something which is done by Google if you just have random images which you've snapped at a party or if I took pictures here, it would actually be able to put labels on them because they have Google scale data of labels and pictures. They can then say, well, what label would someone generate for this? And I'll describe how this is done in the second half. Some of these are pretty good. Some of these are not very good. You can have a look at exactly what's going on. For instance, so you can see them in more detail in the presentation as one of them. Here we've got a person riding on a motorcycle on the dirt road which is excellent for a herd of elephants walking across the grass field. Next one across we've got two dogs playing the grass. Now, there are three dogs playing the grass. They're not great. Or a close-up of a cat lying on a couch. So this is a cat actually on a bed. A skateboarder does a trick on a ramp. It's like this. And then on the last column you've got, in particular, this one is pretty nasty, a refrigerator fills with lots of food and drinks. So some are not so good. And particularly the road sign is a horrendously bad mistake. It could be a horrendously bad mistake. So something else which Google has been playing around with and this is they bought a company called D-Mind in the UK. They want to be able to play games. Now, the reason that they're particularly interested in games is because of advertising, which tends to be some of the mechanics of it are kind of like a game because they're playing to maximise their value by showing new stuff. And they know what your moves were after seeing previous stuff and so it's a question of optimising what they can do next to you. Obviously they don't want to... If they give you the same advert again and again, they know a certain percentage of people with certain demographics will eventually look on the advert. Other people will move to being if they start doing that. So there's a game which they're playing. Also they want to test out new adverts all the time on you so that it may be that if they're showing you one advert continuously or clicking it, they'll try another one on the same topic or they'll say, okay, we'll give you a Survey Monkey or we'll do various other things to you. Anyway, but the nice papers that they've produced have been about this Atari, it's an H-Baper where they play Atari games. So the Atari console is an old console probably the 80s, 70s, 80s where you have a cartridge which you plug in and you have this little controller and it has this below VGA kind of resolution when it comes onto your television. But because it's also old and it's all downloadable, you can run these machines in emulators on the machine much faster than real time and there's whole open source packages to do all this and you can then essentially embody a game which is playable. Anyone who hasn't had a USB key should find one and not ignore it. So they can learn to play these games essentially by looking at the screen and playing with the paddle and all that they have is the pixels coming out of the screen and the paddle. So there are no instructions provided apart from your score is X and so the rule that they're told is play however you can looking at these pixels and improve the score. And from this it learns to play more than half of these Atari games better than humans can in like two hours. So this is kind of very, very interesting behavior. For instance when it's learning this space invaders these are tiny pictures. This is kind of different, it looks like a brain but that's just their visualization. This is just showing different styles of play. So if you're playing the space invaders here this is like a beginning game. They're taking pop shots. Over here you can see how they're kind of they're clearing out some column but this is a very unfortunate space invaders setup because you're doing this. Over here this is how to shoot the last guy. So there's some very, this all on its own without understanding the rules only just by pressing. So while you're running in front you should pick up a USB key. So this is very interesting that it can do this. Not that anyone needs a space invaders playing machine but the fact that this is learnable see games is kind of different from recognizing images of house numbers because games you don't know what you're looking for. You know that if sometimes the score will increase but you don't know why it increased. So maybe for instance when you're shooting the last guy on space invaders you have to time it so that you so when you get the score you have on ages ago what you should have done. So there's a whole that the training is a very much more it turns out not to be tricky at all and that's kind of one of the crazily good things about this. And then this is the obvious latest thing this year is that they did the same as Go. Now people looking at rate of progress decided that Go should happen in a decade or two and suddenly it was doable in like six months. So this was very exciting and in fact this thing has been all around the papers this is actually the mechanics of this is very similar to one of the things on your machines right now. So in some ways the deep learning thing is surprisingly shallow but very few concepts will get you to doing what everyone else is doing and part of the reason for the rate of advance is not so much some people understand it really deeply and go really deep into it is that lots and lots of people are trying lots of crazy stuff and it's suddenly working much better. So on the other hand there's this AI effect and the AI effect is as soon as you can do it oh yeah of course my phone understands me that's obvious you know of course why wouldn't I be able to drive a car by machine it's just obvious it's an engineering problem now which is kind of true but it's kind of for the AI people it's kind of annoying right but you know in a way it's all good. So here we're going for some little mathematical content and this is a very light brush across how this stuff works from basics. So this is the same stuff as was originally done in the 80s and 90s nothing new things have actually got simpler rather than more complicated by combining very simple mathematical operations you build up to something much more complicated like playing go or recognising images. So here is the typical single neuron and this is essentially what all of these things are made of. So at the bottom you will have some input which will be a whole bunch of real numbers or some features which are just numbers then you'll have some weights so you'll multiply these numbers by some weights individually and then in the middle you'll sum them up then after that you'll apply non-miniarity so in the 80s and 90s people times are not figuring out the best non-miniarity it turned out to be irrelevant basically because you can just take the max of zero and the number so you just take the positive part this is called a ReLU unit for a and this works and this suddenly makes it very very easy to do with any GPU you're doing and actually this is a microdip multiplier and the non-miniarity is just a please take a USB Okay so this, if we had lots and lots of them turns into a multi-layer in your network where essentially all your inputs might be connected to all your next layer, those and then connected to all the next layer and so on to your outputs so here we'd have essentially three matrix multiplies and some of this non-miniarity at every stage now the trick is how do we train this because if we go back to this thing if I have some outputs and some inputs it's basically like a linear regression or something very similar in this situation I don't know what the hidden unit should be because in order to get my output from my input to my output the hidden units will have some values some intermediate values and I have no way of assigning them now one of the clearest ways of seeing this is if I commuted the hidden units the outputs of the network wouldn't change so the actual function would be the same but the hidden unit representation would be obviously completely different so the question is how do you make these things deeper and this tied up people to a large extent before the AI wind so this should be familiar to everyone here supervised learning, we pick a training case essentially we have an input, we know the output we then feed the input through to find out what we actually get and then we jiggle around all the little weights until the actual output is close to the target and so how do we jiggle them around we use gradient descent so gradient descent basically is you have a loss function which is how bad your solution is and you compute the derivative with respect to every weight in your network which in this case we have 3 times 5 which is 15 weights so we have 36 weights here plus maybe some bias units so we would then jiggle each of these 36 weights until at each stage my output becomes a little bit more like what I intended it to be so if I have an image recognition thing saying is this picture a cat and my network says dog I will change every little weight in this image recognition thing to be more cat like so the output switches from being dog to cat and the crazy thing is if you give it enough images that will do the whole trick so gradient descent so let's train a proper network since everyone has installed the USB key what we will do is we will play with having layers of different widths layers of different depths and this stochastic gradient descent thing so in your in your presentation thing which you may have open on your screen there will be this cat painting which you can click on and you should get something which looks like this but it won't have the word simple network at the top the picture of a cat does anybody has anybody got a picture of a cat showing up on their screen hands up anyone who has a picture of a cat okay got a big void over here does anyone not understand why they have no picture of a cat or do they not install anything I'm just going to leave a couple of moments for the debunker anyone who needs to install please it's only going to get worse you'll realise oh I should have done it an hour ago it's going to really hit me okay so what this page does this is a javascript version of a convolutional version it's called connet.js this is actually a simple multi-layer neural network the first one it comes up with is a single layer of some neurons and if you start this network training basically what it is doing you'll see the code for this in the box and it will start trying to fit as best it can stuff into this cat image okay so here we go so here what we've got is we've got four neurons in one layer and it's trying to predict the colour of each pixel in this cat image and you can see that it's not doing a great job here's a clear example here you can see these four neurons are reducing four lines and it's trying as best it can to fit into this cat image if we then try slightly higher this is a single layer of 12 neurons so I encourage people to play around with this so you can see that a single layer of 12 neurons is doing a better job but it can only cut up space with these lines and it doesn't have any understanding of areas or curves or anything like this this is a very simple thing it's doing single layer of 48 you can see that as we increase the number of lines you're doing obviously we're going to get higher and higher recall of the image but it's all kind of it's definitely overfitting we have no concept of cat in here at all but if we keep say a fixed 48 neurons and have two layers of them suddenly what we see initially we start to see lines and now we start to see it honing in on areas a bit more oh your computer is way faster than mine I'm in this I'm not doing anything special on this machine okay so you can see that two layers is quite a lot better if you go to four layers so this is still 48 neurons the number of connections is different now this takes a while but gradually this thing will start and you can see that there are curves in here starting to formulate how to divide up this picture into nonlinear regions now this JavaScript there's some matrix math in the JavaScript library but it's all just available to have a look at it's very straightforward you can vary this thing play around with it and gradually you'll see this thing converging on a decent cat image and if you want to go deeper it's fine so what this is really showing Chuck I just I copied that line multiple times because I wanted to be able to divide a number by lots of other numbers so there's multiple out to 48 there was no particular reason I found that I can do a single layer of 12 single layer of 48 this would work better and better but still demonstrate the point let's put it in multiple layers I'm not talking about any particular heuristic for doing that and again this is an issue which many people in the 90s thought was terribly important in that people wanted to have the maximum expressive power from the minimum size network and we're really concerned about how to make it as compact as possible which also has a kind of generalisation intuition but it's please have one of this have one it has a kind of intuition that you'll be able to generalise this better if you have fewer neurons because you're then tying them to learning as much as possible about the situation but it turns out really that throwing more neurons at this is better and then potentially stripping them away but more data and just bigger seems to work a whole lot better there's some interesting reasons and I'll go into that next time so here here's one I've got earlier simple networks a wider one two ply deep way so this is what happens if you leave it for some time and you get a decent picture of a cat I'm not saying this is a good way of drawing cats by any means but it's just that it has sufficient just these little linear things which we saw were terrible if we had the same number of neurons arranged as lines which would be a very linear type kind of model versus when it can learn this internal structure once you can learn this internal structure which you can do on your machine in JavaScript like this is a step up so now let's have a look at what's going on inside these networks so one of the things which one of the issues when you're learning this state is what kind of input features can you use and also what features are being learnt at every level and then how does the curve of training work because what we saw before was just it getting better and better the training time and how this thing looks is actually kind of important so on your in your presentations folder on this thing there's a TensorFlow thing which you should be able to click to which I believe I can click to does anyone still need to install this stuff so what we have here and this is a Google JavaScript thing which is kind of advertising for the actual TensorFlow library so what we're going to work on later is a thing called Theano which is a Python library which has GPU and CPU back end so obviously because you've only got CPUs in your machine or on your virtual machine it's going to be using those but it will the same code and switch a command line flag it will do GPU code as best it can Google hired one of the guys who did Theano from the beginning and they built this thing called TensorFlow TensorFlow is a very well engineered Theano essentially it's written in C++ it's got all this Google engineering team behind it on the other hand Google's machines for their developers tend to be rather beefy and the reason that we're not doing TensorFlow here is because I want you to be able to use all these models Theano is conservative with what memory it requires Google the TensorFlow it will laugh at your 8GB machine it was not sustainable to try and do that so these are things that will run easily with them but if you can load the virtual machine these things will run because Theano has benefit there Google on the other hand they have engineers they produce this which is just to get people interested in your networks it's a JavaScript library it's all out of the source so here what we have on the left hand side we've got some input data and here we've got some very typical like toy examples we've got blue dots inside other color dots and then we have features basically rather than just working on the image like we did with the cat we work on particular features which are either say X going this way or X going this way or a bar or patch board so there are features which by clicking on them you can turn whether this feature is available or not then this is there will be you can change the number of hidden layers that you have so this will expand this thing here and you can change the number of neurons in each layer here are two output neurons combined to give the essentially the boundary if you press the big play key this is kind of hopefully your machine thinks this so I mean in a sense this is easy because my input features I've got this is an input feature and this is an input feature and it then uses the hidden units to combine them into these kind of diagonal features and then these two then combine these diagonals into four shapes and then at the end you get a blob in the middle so what we could do is turn that off so this is out of this so this you would think could be sufficient I think this is sufficient so you can play around with this I would recommend playing around with this you can also see that there's a training curve here it typically will have kinks in as it kind of discovers new things it will gradually work its way down this one is fairly obvious except it doesn't have the same symmetry I guess so here's an example where it's not getting the hang of this at all there's too few my... ok of course so here the features I've got are symmetric in various ways which the sample data is not so these features are insufficient to be able to figure it out but if I want to do asymmetry well let's just get this one I'll just give the game away here is another classic spiral problem so in a way these toy examples are one of the reasons there was an AI winter in that people spend a long long time trying to find minimal ways to do these toy problems and any time someone solved it people would point out well you helped your network too much by giving it the raw data it needed and it is kind of true but on the other hand very unhelpful because it prevented people from moving on to better problems so if we turn on more stuff here we can do a better job or maybe we need some more hidden elements so this may need some playing around with until it figures it out ohhh come on anyway if you have a machine which we strongly recommend has everyone got this? can everyone see this? because we're going to move on this looks like a fail but one thing I found is that by playing with this too often is that sometimes you'll get a training which just doesn't get there and other times it will find it out just nicely and this isn't a problem being caught in a local minima now to kind of your question about sizing is that this local minima used to be a big problem in that people say well you know you'll get caught in local minima let's count the number of local minima let's figure out the other one the thing is if you add a few more neurons the local minima essentially you find a way around these local minima heuristically you just get more chance at not having that problem and so in a way people's insistence on trying to make things as compact and efficient as possible was giving rise to a whole bunch of problems one of these big networks doesn't happen because you just throw huge amounts of data and huge amounts of processing power now people were also worried about processing power in the early days but the reality is your brain processing is cheap you've got huge huge numbers huge huge numbers of neurons in there processing is cheap so in a way we should be thinking about how to if the cause were free what algorithm should we be using because the brain can do this with very very little energy so there's a whole different set of arguments about what you should be focusing on rather than trying to make the network small okay okay so this is that so things to do if you want to play with an animal this try and find mineral sets of features mineral airs, mineral widths and then the disadvantages are going minimal okay so now we're going on to slightly bigger this is the image net competition which I mentioned before it consists of 15 million higher resolution images and 22,000 different categories so basically they've taken something called word net which is a class of it like ontology of the English language and they've picked classes from that and said okay let's pick basically they choose huge numbers of essentially larger images or something similar and then we say well let's get humans to label these in terms of these classes so they find ones which are labeled with a class they then check with a mechanical Turk people and once they've verified cheaply that this is an image of a hamburger or a hot dog then then they know what class to do it so this then turns into a huge competition it's a downloadable data set for training but they also have then a server for the testing and you can submit I think 20 things a day to test your algorithm versus because they also don't want you learning the test data now companies are very interested in doing this because it's a big thing to win the image net competition now Baidu was caught cheating by submitting huge huge numbers of requests for this trying they were banned for a while so the stakes are high in some sense even though it's looking at pictures of hot dogs okay so we've got okay so we've got consomme hot dog cheeseburger plate it's not clear when you see this it's probably clear it's a hot dog sorry the images are all pretty small so this is it's not that it's you know if you could see this image more then you could do a better job these are all small images so I can see that this is a hot dog with muston but it's not clear whether it really you can imagine it might be about other things but some of these plates are not at all clear some of these why are I going to call this a hot dog when it's got a huge plate so some of these labelings are not not so clear what they should be and so there are two competitions essentially in this one competition one is can you get the top one but your top label being the right label and the others are the top five what they consider the right label in your top five so top five is obviously much easier than top one and actually top one is fairly debatable even for human labels so one thing I should explain before we talk much more about these image things is about convolutional neural networks now with the neural networks even though these things do discover features on their own there's no reason to say well I won't give it any help one thing about images is that images are organised in that you have up down left right inside an image and why not let the network know about it if you just said here's a bunch of pixels it would not and you don't give it any hints it would not know that this pixel is hardly related to this pixel as pixels on an image get closer together they're more likely to be the same so why not give it that intuition just hardwire it so the convolutional networks basically because you've got this up down left right thing fairly translational invariant what you can do is you can apply a filter which is pretty much like a photoshop filter it's a convolutional filter which you pass over the whole image which will essentially imagine a filter could be blur or it could be sharpened or it could be average or edge these things will pass over this and produce another image with those properties so in fact you pass a whole variety of different filters over it producing a whole variety of new images with which are either blurred or left sides or right sides but the parameters of this convolutional thing will be suppose it's a 3x3 array it's got 9 weights but those 9 weights will then be applied uniformly across the whole image so rather than say well I've got a whole image of 720p and I've got a weight for every cell and so I've got a huge huge number of weights and every every pixel is a special snowflake on its own this convolutional thing well I've got 9 weights I'm just going to learn these 9 weights on average across the whole image so this is a huge a huge reduction in the number of weights you have to train on the other hand you're learning very much less data for everyone you do train but this is kind of one of the secret sources for making the image stuff work so here's like as a matrix operation here's what's going on if you have your image plane you then have a tiny little convolutional filter which then gets multiplied together summed up to reduce your output image and your output image will be just a pixel smaller or you might assume that there's zeros around the edge but your output image will be just a bit less data so now yes it was great, I'm out of the model it's not a separate you're not doing a separate no no no the weights also everything, every parameter I mentioned here is probably has a gradient and therefore will be learned okay so as a single optimization you're basically optimizing everything in single parameter you just you need to find a great, this is the crazy thing you need for every difference you have in your output versus its target you need to know the gradient of every single parameter in your model now these may be a long way away from the output so you need some method of figuring out the gradient and that's where this whole back propagation thing comes in but yes every single parameter here will have a gradient and therefore you can train every single parameter so now it's time to do the virtual box thing since this is only core we're going to do this quickly and maybe so you need to go, you need to start the virtual machine and if you can't make it work there is a coffee break so we can make it work then you can open Jupiter okay there is some data in here about the passwords if anyone really needs them you don't need the password when you start the virtual machine it will run through a booting up a linux box once it's done that you can then go to your regular browser in the host and go to localhost8080 and you should get a nice thing with a tree on it so if I so if you haven't done it already you would go import the client then find find the client so then you accept the defaults so I get the feeling that coffee break may be delayed by 10 minutes just anticipating so here we have this is the new one which we've just added it's switched off and it will switch on so this is fedora24 which is less than a month old does everyone have this happening or does anyone not have this happening is anyone horribly confused that they have no virtual box on the machine anyone needs help I think dealing with this during the coffee break is the right thing to do anyone needs help would be an issue so what you will eventually the screen will scroll up and you'll get this login prompt the login prompt is not relevant this is not what you're looking for what you are looking for is something like this so this is Jupiter which is a wave running python notebooks I'm sure everyone here has either used it or seen it so what I'm interested in doing is going for this image net google net google net and maybe I should I'll give you a quick answer so here what you could do is this thing is a working notebook and I'll explain what it's got in it google net is the 2014 entry to the image net competition or like at the beginning of 2014 they published I guess this is the number of layers which they're choosing so it's not a simple network anymore at the beginning you put in your image which is 200 by 200 also you go through this number of network layers and you get at the end a one of a thousand classification of what that image is what you will find so this is the google net thing if you can step through this notebook and just for argument's sake I will step through this notebook so if you just press the whatever so the first cell here is loading up theano and stuff and there is a 27 meg file of the google net parameters which is pre-trained and then what is happening here is and this might take a while is it builds essentially that full network with all those layers and then fills it with the the next thing we will do is fill it with the parameters so keeping stepping through there is a thing which prepares images because they just need to do the data wrestling and then here is a picture we then prepare its image and we then print the argmax of its output so argmax here is the output of this network is a whole of a thousand different numbers and it just picks the biggest one and puts back that this is tabby now this is kind of this is kind of amazing in as much as this is 27 meg 27 megabytes of data is not enough to store all pictures this is a random picture of a cat on the internet it hasn't learnt this picture it knows about cats 27 meg is tiny compared to what your brain contains so and in fact the 27 meg file can be shown that you can reduce this by factors of 10 or 50 just by crunching away the bit resolution this stuff is all stored in floats doesn't need to be it can be stored in like 5 bit numbers and so this this thing can be trained the reason you've got a pre-trained network is that we would be in a pre-GPU situation training this this takes probably a few days to train using GPUs and GPUs will be a factor of 10 or 10 or 20 on CPU speed on this so then there's another thing which is there's an image directory here it's got a few images in and you can see for yourselves hopefully what's going on so we have Tabby Tabby Cat which is okay we have Golf Ball which is less good on the other hand I have to say owls may not be part of its training it may not even have a category owl I don't know for sure but I somehow doubt that it knows what this could be even on the other hand it doesn't know quite a lot about I don't know what number 4 is that Band-Aid I think it's been distracted by the text to say it's a Band-Aid nipple, muzzle, golden retriever rubber eraser that's not so great and the last one Siamese Cat so this is kind of an example of what's running on your machine right now there's a whole bunch of other stuff so let me talk about the other stuff okay well first of all I should talk about GPUs the Intel Core Intel Core i7 something is like a thousand dollar chip it has a lot of stuff it can do 140 billion floating point operations per se this is huge in terms of the whole Moore's Law thing but we go for an Invidit board which is now a very old board I mean because age in GPU is method in each year we're coming out with something much better we can do 5 trillion flops on this this price is now probably 100 bucks on eBay kind of thing so the Invidit has come out with a whole new line that there's the 1080, the 1060 is like a 250, sorry the Singapore it's like a 399 board will beat this easily in this probably 8 teraflops and then this is when you look at the supercomputer rankings number 500 supercomputer rankings number 500 is coming out like 200 teraflops so these cards which anyone can buy for a couple hundred bucks are say 20th the speed of top 500 computer, this is an insane speed you can get from peanuts and this, I mean we have to thank the gamers in sacial need for like fortunately now there's 4k screens I mean that's a whole new level of performance that they need on the other hand Invidit does actually take the research community pretty seriously and whereas the new cards that they've been originally they were looking at people who design air turbines or do serious modelling who are very interested in or biology or whatever they're very interested in float 64s like actual doubles but the machine learners don't even need float 32s so Invidit has now incorporated float 16 so it's 16 bit floats as part of their thing because it means your models can be twice as big and you know you need much less of a GPU to do this and so some of their GPUs while they're great for gamers have got all the features that a neural network people want like there's no real need for the gamers to have a 12 gigabyte Titan X you don't need that many textures because where do you get them from so whereas I mean maybe some people do need it some gamers who really need it but I have a Titan X in my machine at work and I have a roomful of gamers who wonder why I've never plugged a monitor into it and the reason is this is a deep learning thing it's basically designed with exactly what you need to learn these models on okay so I'm going to give you so we're almost out of core stuff now I'm going to give you a quick overview of what else is in the virtual machine if you decide not to go hard core at least you know what's on the thing so first off I'm just going to say what are we going to do because this is the data science thing we're not going to do some of the reinforcement learning I don't think the reinforcement learning to play games it may be fun and if you really wanted to do it you could do it while you're here but there's two things which seem relevant to me one of which is anomaly detection basically it says MNIST again I'll introduce MNIST very soon this is suppose how do we detect credit card fraud or how do you take anomalous signals can be done with neural networks I would recommend the neural network thing as a tool of last resort try everything else and then try the black box the black box can be surprisingly effective but if you try the black box on day one then you've lost all intuition about how it's working kind of thing playing with the data is very valuable on the other hand if you have no idea how to do it this thing can work well so what the anomaly detection thing is MNIST is a big is a digit recognition thing what we're doing is we'll try and find out what are the most malformed digits handwritten digits in this data set so if you're trying some linear method anyway it's difficult to see how you would go about this from a normal data science point of view neural networks to the rest of you another thing is there's a commerce one so I label this commerce I think some people see immediately what this is for basically training one of these big CNN one of these big convolutional networks is very time consuming huge amounts of data but what if you could take a pre-existing or pre-trained neural network and then pass in your images of your classes of stuff and it could then categorize them into I think this is a in particular this one is going to be our cars so that the these big image things cars aren't really, there's grill I think is one of the things that's important but it doesn't really know much about cars and it's all about dogs doesn't notch out cars so what I train it on here I've got a couple of data sets of old cars and new cars and say okay can we make this network which is train many of the dogs learn the difference between these cars and I do that without retraining the network at all so this is taking all the hard work that Google has done or whatever or Microsoft or whatever it's done to train this network can I just repurpose it so I can recognize what kind of car this is what kind of blouse this is whatever so these two things the anomaly detection and the commerce thing coming up in the hardcore section okay but also on the drive you can see this is a 2015 network so it's got a bit more complicated this is another Google one called the inception network getting more complicated scores are better so there's a version of this on the drive you can have a look see how it's constructed play with it it will do better on the doge dog or whatever it's like another way to abuse these pre-training networks which is kind of fun is you may have seen this deep stream stuff right if you start looking for this I have to say some of these images cannot be unseen so basically in order to generate these images people will give it an input image or maybe even just noise and say I want this to be as much like fish as possible and then I will optimize the instead of just optimizing the network alone I want to optimize the image to make me respond to fish as much as possible now I'll put some other constraints on it that I want it to be as much like an image as possible so I don't just want individual pixels moving to make it a fish I want a nice image fishy and you get this kind of thing happening there's obviously various other tricks they do to make them really cool images what there is on the drive is is another nice application which is called style transfer basically you can take your own image in this case this will be picture of canal and you can take a style which will be some highly stylized art so here's style night whatever but you can put in Picasso or Céline or whatever and it will recast this thing as that style and it's kind of cool to play with and you can do it there are iPhone apps as well to do this but the core of it when you do this on the phone I believe it's being sent up to the cloud and they run a GPU on it here you can do it for yourself it takes maybe a minute to do one of these things on your CPU so it's not too bad on the other hand this is kind of a simplified scenario working on fairly small images but there's some fun language processing so we haven't talked about how to deal with variable length things all of this has been a fixed field so how do we deal with variable length input well what you do is you have a little network with some parameters and then you walk it across the input and you give it a memory of where it was last so each there's an input state within it which remembers what it's last input was so this actually it recursively looks backwards in time but the thing is once you step this forwards and then say well what is the dependence of this parameter back there on my input to my output I can fork out the derivative as long as you can work out the derivative you can train this thing so here is a simple so these networks are getting much bigger now so this is and it's still in this thing but it's people connect you can take this idea of being differentiable in terms being trainable versus working very far so here is what's called an LSTM unit basically you take your input and then you have some weight which will then combine with a gating function which will combine with a memory cell which has a forgetting thing and outputs from here to produce some output and then you do this again and again and because this is related to the previous one these things will all link up and you can then work out the derivatives through the whole network and so you can play with this I have to say this the example I have on the thing takes too long to train so this is why I'm not particularly interested basically here's a very simple example what we'll do there's a bunch of poetry on the drive basically the task here is can I predict the next letter so if it learns a good model of English language if I start starting with shall I compare you to a you want an S, U, M but basically we've trained it on this poetry to try and output new poetry now when it starts it's not very good at poetry so this looks like a pearl or something then as you train it so you've got 100 epochs in it's begun to got the idea of I would say words and it's got some repetition or some wordish stuff if you go in a thousand now it's beginning to much more much more like a lot more like poetry not particularly understandable poetry if instead we we run it on bigger network shorter time but this is Shakespeare's place so here it's begun to understand that people speak in it doesn't understand rhyming it understands introducing characters and how the whole thing's formatted it's not stage directions I think it's pretty interesting it's this stuff these neural networks these recurrent networks can learn a whole bunch of stuff this can then lead into translation there's a whole bunch of all stuff you can do with this but it's kind of not practical to do this on your machine while we sit here and reinforcement learning so there's another this is something because of the whole go thing as I said one of the problems with this is there's no training data you only train by doing something and seeing what the other person does so you train the go thing by either looking at lots of games which actually they didn't do they looked at quite a few games to get them to a critical level of not being a terrible player but then they started playing themselves so optimally you play white then you switch around and say ok well then I'll play black and this thing obviously Lisa Dahl did not expect it to be as strong as it was because the first game he kind of played it like it was an amateur and I was surprised then one of the interesting things about this is after the games the first one is that they would say well the computer played a surprisingly good move for a computer so the second game was like I made some interesting choices maybe we need to think more about what it's doing it's like the third game which the computer lost I think when it made a bizarre move people are saying well there must be some good reason why I made that move what are the gods of Go telling us here and in fact it was a stupid move so it's interesting how people's perception of this thing has changed through time and then I was able to beat him solidly and now there's a Chinese guy who considers himself better who wants to play it now I guess that Google has not switched off the learning on this thing so we'll see how that goes I I don't know obviously we don't want lots of unemployed Go masters right because it seems like a noble pursuit it's people two days afterwards people just say well it's a game it's insane it's not insane it's human nature so what we have on the drive is actually a version of a thing on here called bubble breaker so bubble breaker is like a candy crush kind of idea but it's anyone can install it's free and there's actually a playable version in the python notebook which is kind of neat I've played far too much bubble breaker on my thing and I was kind of curious as to whether there was like a better strategy or whether it could be learned at all one of the differences with bubble breaker compared to Go the mechanics of the game is if you've got a continuous block here and you click it those blocks will disappear and then the blocks will then fall down to take up the remaining space and then if you've got a blank column new blocks will arrive so it's slightly different it's a solitaire game your score is basically like the square of the number of blocks you remove but you also don't know what's coming in on the left hand side so it's not like it's a perfectly predictable game you have to hope and in fact it may not be solvable because you may get a complete domino pattern coming in which you can never get out and you'll understand what I'm talking about so this is kind of fun and the thing is it does in using my GPU learning for 5 hours it's definitely better player than I am and it has interesting tactics but you have to really kind of play the game a bit to understand why they're interesting so I don't necessarily recommend playing this too much but I see many people playing it too much on the trains okay so pre-break wrap up I'm sorry this did take a little over so deep learning is cool some of the hype is probably deserved obviously when people are calling their start-ups AI I would say don't do that it's all AI is a distant thing which is worth striving for but is not here but you can do all of this stuff which was previously completely inaccessible to machines the next thing is going to be much more hands-on if you're working on an iPad it's probably not going to help happen right if you've got the VM installed and working then there's nothing stopping you using these models and playing around with it there's lots more to play with in the VM this whole thing is open source so there's my github account has this thing called deep learning in Photoshop please add a star if you like it, please add a star so this has a whole bunch of stuff and every time I give one of these things more stuff appears so there's also an update script which will enable you to bring what you have up to a higher level up to the latest level and hopefully it installs properly and everyone can use it because it makes quite a nice self-contained thing and break time there we go I will now relax I will probably take a vote as to whether people just want me to flip through this thing what I found at previous workshops is that people are just happy to watch me doing the thing because I kind of like people actually interacting with it so I think that would be better for you so I would prefer just to give you an introduction and then to let people have problems and I can run around if there's a commonality of a problem I'll talk about it so this is now getting harder because I'm going to explain how this stuff actually gets implemented in works people use frameworks for doing this stuff so rather than explain this as being matrix map what you want to do is say I want another layer, I want an LSTM unit I want to do a CNN I want pooling whatever you want to explain the network not as matrix operations so what but the benefit of doing that is that if you can explain it to the machine what the scheme that you want it then has freedom to map it as best it can so there are a variety of frameworks which people do use this is kind of going from the simplest case the cafe basically you will explain the network as a list in a config file basically you explain exactly what you want it to do and it will then implement this as matrix operations on the other hand there's a bit less flexibility to find number of things you can do and that's it Torch is more flexible supported by Facebook and Twitter there they have another fundamental series of operations and then you can coordinate them using Lua as a language now moving but then kind of a higher level version of this or higher level idea is this category of Theano and TensorFlow workshop is based on Theano which is a Python library produced by the Benjo Montreal Lab TensorFlow is very much a well engineered version of Theano produced by Google there's a lot of good reasons for loving TensorFlow but having a small machine is not one of them so sorry and the reason that you want to do this is it can do optimise new output computation so what you do is you describe what you want to do in Python and then essentially you say I want this network to consist of these layers and here's my output and here's my input and then you say and then it goes away and takes that and converts it into a big expression tree of the actual operations which need to be done and then it can output code which runs those operations on its own and if you tell it I have glass installed or I tell it I have GPU installed it will output the right code for the right platform it will make decisions about would it be quicker to do this operation in the CPU or shall I pay for it to be delivered onto the GPU and run it in the GPU so it can do lots of because you're describing it at a higher level it can make decisions and good compiler decisions for you it outputs it makes use of numpy and glass where it can it can write C++ for you or CUDA or OpenCL this is kind of a cool library but it's kind of also been an accretion kind of library in that as graduate students pass through the lab they add the stuff that they need to make it work and then they leave and I think people there would admit that it's a bit it has a duct tape feel to it so this is where TensorFlow has been designed from the ground up to be what Theano aims to be on the other hand Theano has been working from a long time from smaller machines and so they know that they want to keep things self-contained whereas TensorFlow allocates objects at crazy not crazy so maybe TensorFlow they actually get to using it but there's hardly any incentive for Google to do that partly because Google has also produced their own TPU now so instead of using the GPU for everything they have convolutional network Asics which they've designed so they've also put in they've decided to actually put millions of dollars into building their own piece of silicon which runs on the thing which is the size of an SSD and so they calculate that my guess is they calculate that the running cost of this if you can run it on Asics will pay for itself within two months or something just because they need to run a lot of CNNs they know the structure they've got that down they can just blow the whole thing onto silicon get rid of the GPUs or get rid of the CPU farms so Theano basics there is a thing in so this is all gonna this whole thing is all in the Jupiter thing so there's a thing called Theano basics and we're gonna press play to run through this thing just kill this one so I'm gonna walk through this with you basically the first thing we do is we import this Theano and then this describes how instead of using actual Python variables you use Theano objects so this is kind of a symbolic thing so instead of X being a number it refers to an object which can be manipulated symbolically so if we then say well Y is three times X squared plus one what's the type of Y it's yet another symbolic variable that symbolic variable knows its relationship to X but if you say well give me the value of Y there's no value of Y it is the symbolic graph to what X what it came from so if we then actually look at what Y is it's an element that this essentially is telling you the final it has essentially you can think of Y as being a tree of operations to get from X building up to the final output Y so if you ask why isn't add because in fact if you look here final operation Y is going to be add one but if we pretty print it this is how it thinks about it internally that it's a constant three times this squared plus one similarly going through this is a graph of what it thinks so up until this point and so here's a natural graph of what it thinks so here you haven't actually computed any value of Y at all yet and we haven't I don't think we've actually created any C code but if I do this line this will take a surprising amount of time because what it did when it evaluated that value of Y at X is it wrote some C plus plus code which was the symbolic tree it then put in the value of X and then got the C plus plus plus to tell it the answer so you can then so this essentially has shows you how this kind of symbolic thing can be implemented as it chooses rather than as I choose we can then make this into a function which we can now call as simple f it will do the right thing and here's the interesting thing here is it's actually how it is running this function it's actually reorganized the tree that you saw earlier into being a particular composed kind of function which is so it's actually reorganized the tree to make it more efficient or to use operations which it knows how to do more natively so some of let's say the BLAST slide we will define particular kinds of matrix and multiply ads really efficiently so it should use those wherever it can so also in Theana you've got different tensor types where a tensor just means a vector is a one tensor a matrix is a two tensor a cube is a three dimensional matrix would be a three tensor so it's not really like a general relativity tensor it's just a multiple dimensional matrix you can also play if you've used numpy you'll understand there are various kind of indexing tricks but the key thing is that this is not using the numpy library unless you happen to be using the CPU you can do these indexing tricks on the GPU as well it's all completely transparent to you so there's a whole variety of stuff that you might expect there to be and then here is kind of the big trick what you can do is suppose I define y as being so x is a scalar x is an input and y is log I can then say I want gradient to be the gradient of y respect to x and evaluate that at 2 and it knows the gradients so the key thing because each operation through the tree is a well known operation and it knows the gradient it can symbolically derive the gradients of every element within that tree and this is the trick which is required to make this whole thing work because then you can then say well if I know the gradients everywhere I can apply small steps to everything and this is what we then start to go through so here we're doing matrix operations we can square matrices we can then have updates so this is something which I would say the this is kind of low level theano thing and it takes a while to get your head around and in particular you could play with this later if you put in an operation which it shouldn't be able to do like if you transpose a matrix inappropriately when you're actually building this thing symbolically at the beginning it doesn't necessarily know what input shape you're giving it only when it evaluates it at the end does it realise all these dimensions don't match up because you can leave a lot of this stuff unknown and just fill in the blanks later and then you get an impenetrable error message because it will complain about its own tree which is optimised away from your code not having the right dimensionality and this is so in all this development you can be type you type away the network get it looking right then you press the go button and if it's right it's fantastic because it will work and it's beautiful if it doesn't you'll be faced with a page of error message which has actually nothing to do with your code it only has like third hand to do with your code it's like you're GCC producing machine code pop instructions wrong I have no idea why this instruction would ever be and you can't trace it back to any particular node because it may have minimally when it does the tree manipulations it can do all sorts of optimisations and it may have actually taken away some of your your nodes you've no idea what's going on anyway so there's a whole bunch of things that this can do and in particular they're interested in doing neural network kind of updates and so they've taken care of the matrix thing they've taken care of the gradient they've taken care of the updating mechanism that's the very quick introduction to Theon okay so now I'm going to quickly walk through or now I'm going to get into more like here's where you start doing it yourself so there is a thing which I talked about earlier called MNIST it's a data set from the 80s and it used to be considered like a major paper when this thing came out and people had good results now it's kind of like the hello world of neural networks if your network theory can't do this then like throw away the theory so it has 50,028 by 28 images of these tiny little digits people used to use it as a benchmark as pointers now it's like a hello world but I mean very soon ImageNet will be the hello world and these things move along quickly so here you will take some input some hidden layers get some output so what you do the game here is to take an image and say which of the 10 digits is this and what you typically do is you have an output layer which is 10 different strengths then you pick the one which is the highest strength now if you're wrong then how do you train it where you say well I said 6 and it should be 8 I should have a 0 in this position I should have a 1 in this position I bend or everything all likely to say 6 and you just do this again and again now the thing is you don't have to do that much this thing will train quite quickly so on top what we have here is this thing and I will step through this what we're doing here is we're loading theano and then there's another library called lasagna now lasagna is to do layers and the reason you do this is so that you can talk to the theano instead of matrix operations you can talk to it in terms of so this is just loading some data here's the data I guess this is the key thing here here I'm saying I've got an input layer which is the same size as the image there and now I want to get the output and then so this is a way of setting up the whole thing initially I can then say well my loss function I want to have categorical cross entropy so it knows all about the things that typical neural network people want to do this is because I want to classify things into the right category this gets all the parameters so here I'm setting up a training function a validation function and there's a prediction function now the training depends on the updates the updates depend on the gradient so this is just using a gradient function from theano to produce a whole bunch of updates there's a updates as gd to cast gradient descent but because it's a library designed by people who are doing passing age research there are all kinds of updates so there are momentum updates, there are add-em updates, there are add-a-delta, whatever so these update things are essentially ways of making stochastic gradient descent work faster so from a purest point if you know exactly what direction is down and you take a small step then you're doing kind of the right thing you're taking small steps in the same direction all the time really you want to take a large step and so there's a whole bunch of ways of guessing where you're going to go so you can leap there quicker and so this is it used to be a huge focus of what people do now people know that this stuff kind of tends to work even when you don't want it to it used to be a big problem now it's flipped around so it's kind of a non-issue so finally we're going to get to some training and this thing is going to train like this now my machine may be quicker I'm not sure it was not an expensive machine particularly if you have a Mac I'm pretty sure I don't know so this thing so this first MLP M this thing does not do a very good job it's error rate it's like 8% it's fairly terrible but at least it kind of works and this is showing what it knows about a 0 what it knows about a 1 what it knows about a 2 you can kind of see that if my 3 if I had a digit you can kind of match this template that would be a good guess so this is this is a very simple example of lasagna in combination with no training on MNIST and there's another one so this is where I'm going to start to say okay now this is something you can play with there's another example called 2 there's MNIST CNN so this is instead of using in the previous example we were using the pixels individually so they're all kind of independent of each other this MNIST CNN instead says well let's create a convolutional layer work in layers to look at this so we can then judge which class this thing's from so one of the tricks here is I've chosen to have just three convolutional layers so that I can then draw a picture of what the layers actually represent in R, G and B so the point of doing that is you can then see what features these layers and you can see that say on here and this is where I would encourage you to go through if you want to see the result you can just press the cells run all and you'll see how it works basically you've got the green the green layer is picking kind of the ups the upside of every curl here the blue layer is picking the down from the red picking the right now the point is once you have that information it's easier to run your categorization because the other important thing is it's learned that these are important features all on its own we've told it that this is an image but we haven't told it it's good to look for horizontal lines and vertical lines it's figured that out on its own so I'm going to leave that and not run through it just it's worth having a look at if you can I'm going to describe two things which we're going to go through or you're going to go through now and so I'll briefly describe what's in there and then kind of you're on your own but I will come around and if when people have a common problem I'll talk about it question okay, how do we detect outliers in all of this stuff what we have so far is you've got some inputs and some known outputs the question is particularly if you've got you wanted to detect anomalies it may be that you have some anomalous cases but it's not that you want to learn what that case is because the reality is that they're just not the anomalies may have completely different commonalities between them and you may have almost no data for them so the idea here I mean there's obviously other ways to do the anomaly detection the idea here is let's figure out what good data looks like so but that means that everything we're looking at is basically this is data that's all the training you can give it so what people do is they say I will try and make my network produce its own input if I can make my output so I know that everything I give it is about on average from me I know that everything on average I give it will be valid data so the data is itself its own training examples so you only have one label you train the network to reproduce its own input and what you find is that the hidden layer so we have a network which I'll show you which looks like this because we're going to run this on end this which is a 28x28 image we force it through a hidden layer which is much smaller and then we make this hidden layer produce its its guess at the input but the thing is you say well the target value for this Apple layer is the input there so this is called an autoencoder and the trick is by having this hidden layer sufficiently small it has to learn in order to make this work with a kind of auto error it has to learn in general it has to learn a generalized representation of what your data is so then if you have something which goes from input through the hidden layer to something which looks very different you know it's something which doesn't match the rest of your data which is an anomaly so this is a very simple kind of trick to essentially force on your network to learn something you thought of in the first place so there's a thing called anomaly detection which is number 8 I'm not going to run through it here it's on your thing it's a fairly simple neural network you can kind of understand what's going on you could just press do all of this stuff but there are exercises at the bottom which say let's change a few things around you can even try introducing bugs and seeing what the horrible error messages you get out a simple network you can train quite quickly but it is data science another one which is this pre-built networks thing so you've got this image network this pre-trained thing on the drive already what this thing has done is in the same way can I ask a question? on the anomaly detection part we have the hidden layer which is much smaller is there some indication by which you decide how large that should be because a few things because the thing is a lot of companies will say we have some fraud we have some anomalies but it's not labeled so we want to find them which is kind of a but what I the nice thing about this example is it's very quick to train it and relearn it you can set the number of hidden layers for example try 5 my guess is 5 won't work so well on the other hand if you say for 500 it will just learn to produce the output directly so there's some happy medium and so this is one of this is where people legitimately criticise the neural network stuff as kind of black magic is that whilst you do have this stochastic gradient descent which has mathematical foundation there's also a mathematical gradient descent so the idea there is you throw lots of graduate students at the problem and it optimises and this is why I wouldn't say for me but lots of people who do this who come through graduate school they're in quite high demand because they know kind of how to make these things work and there is an art to it it's difficult to put your finger on it's a question of playing with them it's like the 10,000 to happen you just have to subject yourself to misery for quite a long time and this is kind of like a hallmark of a programme I believe programme was people say oh I love to write code but in fact programme as jobs is mainly about debugging you have to love bugs if you're a programme because everyone writes bugs and if you're not writing you're not writing things sufficiently advanced for you if you're not writing bugs you've got to love debugging and hitting a head against a wall and then you can achieve something so this is why it's quite nice to play around with these pre-built things you could add extra layers you could play around and see what happens but yes seems about right 50 I heard on the internet that that being said I built this thing from zero and it worked first time this is good and the same goes for some of these other all of these notebooks basically the first thing I did worked and I was relieved and then I just left so and that's people tend to be publishing the paper when it works first time when it works at all and so then someone comes up with oh I can make it work better but that's hardly a good paper the fact that it works is so surprising at the moment so yes, yes so this is a data science feature engineering kind of story don't be surprised that you have to do the same kind of stuff for the neural networks because you're used to handle unless you do gradient busy trees the whole time whatever so in the early days a lot of people thought the best way to train these hidden things was to train them one at a time and to have this kind of bottleneck approach and so yes, there's a history of doing that is this questionable I don't know, it's a question of trying and seeing in terms of the just in terms of network depth people fight, people have intuitions about how deep is enough now the fact that this works with one is probably good enough for the problem so I don't overthink it because you might over train the whole thing so in a way try and suppress the number of parameters you have on the other hand you want something that works you want to make sure it does actually find an anomaly that you know but maybe you don't ever want to tell it you don't want to ever learn that anomaly just check you're playing a funny game when you optimize the structure with your elevations onto the previewing models so we've got this whole ImageNet train thing trained on a thousand classes from ImageNet as I mentioned before this is not necessarily commercially useful classes but it has learned while doing this useful features for vision it knows a lot about dogs which means it probably knows how to identify fur so in these different layers even though you haven't told it about the different images if you train things to recognize faces of the first layer you'll have the picture the next layer will identify some lines or some edges then you'll start to identify curves then you'll say shapes then you'll have like eyes and noses then you have faces and this thing will look very much like what people find or want to find in the brain and it does this completely unhelped now if you say well let's make it much shallower this thing won't work as well if you make it sufficiently deep it works pretty well the latest ImageNet thing from Microsoft actually has another trick which is they take two layers which work pretty well they train this thing so it works as well as they can with a certain number they then take a pair of layers and say well let's just insert more layers in between to learn where I'm making errors so it's called a residual network thing so basically they learn the residuals to try and to try and pump up the learning between these two layers they cancel out all the residuals so it's like a boosting kind of thing so the Microsoft which one ImageNet recently is 151 layers deep so these people just want to apply computer power they're basically trying to win ImageNet so it's kind of gone beyond what you practically use that computer power is cheap sorry could you quickly explain how you calculate a residual so so you've got these classes which you're training and then you've got a layer here which is now hidden layers and they've got you know what error is produced at the top by the the inputs and outputs here you say well how would I reduce if I were to fix that how would I then reduce that error by adding extra stuff and so if you insert in a network which would initially have a random mean zero it can then gradually bend this thing so you have more opportunity to learn kind of hidden structure so where do you insert this network so gradually so you have this thing you've essentially put it by the side so you have this one thing which is working you then have an extra set of wires which initially are kind of mean zero and it learns how to fix this thing up but eventually you can just look at it like this and once it is done fixing these two layers what does the fixer do you need that fixer yeah yeah that becomes another layer which is a legitimate layer and then why not put more fixers in between it so you kind of fract all this thing out into 150 layers you started with 5 and quickly what is the way you find out the effect of a particular layer on the final output you can differentiate but ok I have to say that one thing about this is that this is a Microsoft paper which came out at the beginning of the year and the code is all online and it's implemented in all different libraries and you can just go and read it what I've described here is it takes you to have the vocabulary to do an awful lot of this stuff just by reading the paper and people just want to spread the if you start getting into the auto encoder stuff then it starts getting very functional it starts to get very obstruous whereas if you go for some of the CNN stuff just recognising digits or doing language is very much oh we added these layers together we figured that if you do this it tends to be kind of practical in a way this is kind of shallow ok so given a trained image net thing we can use these features which it's learnt so it knows about dogs and probably knows about fur it probably knows about eyes it probably knows about scales because it knows about fish but so it knows a lot of the useful stuff about vision but instead of choosing one particular class that's the stuff where we don't know it's not a pre-trained class at all but what we find is that this last output layer which we're only for the real deal we're only looking for the top spike the rest of it is kind of characteristic of the image as well it's called number 5 commerce oh no oh no there is a beautiful image here no it's not somewhere here I'm sure I have it I will produce you a beautiful image while you're doing this basically instead of these thousand values we'll have lots and lots of information in the non-peak values what you can do is you can train an SVM to recognise the difference between this set of thousand values and some other classes thousand values so you can very simply look at images which you've never seen before classes of images you've never seen before and distinguish between them just by applying an SVM on the output instead of the one hot classification so this is something we go in and have a look what I'm doing here is classifying cars into classic cars and modern cars and you can see that this and this is kind of the exercises for the reader I would encourage people to try this is to try and it's all done in the file directory structure such that if you create a new thing other than cars so say for pianos and it explains this at the bottom of the sheet if you create a pianos thing and say I want to distinguish between upright pianos and grand pianos just stuffing images in these directories or let it learn and if you get any nice results please tell me and we'll include this as a good example but this is I have to say this is the thing which worked first time I kind of figured out this is what someone was doing tried it, works first time done so it's kind of it kind of validates this whole approach so why from an intuition perspective why do you use this BF because this BF always tries to maximize I guess this goes to more like being as lazy as possible okay I think psychic learned SVM where I could type in SVM and it would just find the difference so it was the thing is between these these thousand parameter vectors I don't know anything about the structure of what they are identifying other than probably there's stuff which responds in the world in particular for the cars it loves wheels so it's probably something which has more wheels and other things has more angley lines so if I'm interested in wheels there'll be lots of wheel things or lots of eyes it may be identifying wheels as being eyes whereas the modern cars have more spokes so it may think those are flowers so the space which it's identifying is this last layer it has nothing to do with cars per se but it can distinguish between what it knows about vision pretty well SVM I chose just because it was quick quick to do but it's difficult to distinguish that between any other method particularly I think the class is it's going to be easy it's a thousand dimensional so it's going to be easy to draw a linear line in mind for the features being trained for the car data set then a new car data set now we are applying a new SVM to try to identify on the vintage cars but can a similar like dog distinguish dogs to cats cats and dogs oh dogs and cats is easy because it knows some obviously knows tabicats use a dog's data set to train a CNN model then I apply it on cars this is precisely what we're doing here so we're using the pre-trained thing which is trained on all the thousand classes of ImageNet which is primarily dogs and flowers there may be some machinery in there and then we're using his output about its preferences in terms of how much like a husky dog this car looks and in in some ways the SVM would distinguish maybe husky that husky dog dimension is not relevant to this and then it's equally spaced between the modern and old cars yes it's helpful that right so you can imagine that the ImageNet itself is not particularly helpful because they've done a spread of stuff and really specific dog stuff just to try and see what areas this is a commercially trained model what you'd want is to pick uniformly across the English language classes which could identify all sorts of different stuff and then spend a long time training a model which had good versatility in that sense but on the other hand you need a ton of data to do that you need resources to head towards that so if I were doing this in a commercial setting let's just do what we have we can pick off the shelf and get it running today and then decide whether we need to have more information about like have really thousands of images of clothes for instance that we then have more knowledge about what to look for in a collar whereas here it's probably looking at collars as being some kind of plumage so it's entirely possible that you might want a big network to be trained on something more relevant to your data on the other hand this is kind of a neat trick which is we like neat I like neat tricks moreover this worked for us time so if we had, this is just training on ten old cars ten new cars and it seems to work and you should have fun with that now what another thing, before I I can talk about fancy tricks, this is kind of fancier trick that people use to translate language and that the people have in the bottom you have your input sentence being economic growth has slowed down in recent years above it's actually tough to see how you do this because there's some word flipping going on there's some adjective agreement but this is basically you can have this is a diagram which shows currently on that what's going this way and backwards and then you have a thing which comes up and then decides where it should be where this network should be applying attention to get its results feed into this network which is the current which then produces these words so this whole thing is some crazy mess of stuff which is direct in attention which is some kind of weighting vector but the key thing is this whole thing can be differentiated therefore we can train it so you just put in parallel texts of English and French and this thing converges to a crazy extent but for that you need one of these things and we have a guy here who will now talk about the GPU but I would also encourage people to start looking at the analogie and either choose the analogie thing to understand which is a simpler example they're both simple in their own ways so the analogie thing the commerce thing the GPU what do you want do you want this do you want this I'm surprised it works actually because this is a special extension oh can you introduce yourself oh hi everyone my name is Frihan and I currently a researcher at DSO National Labs so today I'm going to talk about deep learning stuff and you alluded to the point that you need a GPU at some point you really want to do serious training and some of you might think that no GPUs are big server racks and you have to go on Amazon EC2 to get instance to run your deep learning stuff but actually no you can actually build this box for example and it has a full flash GPU which can do all your deep learning stuff so this is a 970 I'm going to do a pricing later so this is a 970 GTX it costs about $400 and the setup is interesting so right now I'm going to demonstrate that this box itself is running the notebook which you all be doing anyway and what I'm going to do now is that the motivation of building this box so we wrote an article on KD Nuggets that describes how it was being done let me show you the back end the real motivation to do this is because I do deep learning at work and I have my own GPU clusters of racks of TitanX and Kepler cards so TitanX is for development but then once they are done they go to the server side and they run on K40, K80s for months but then I want to do some deep learning at home and I rest I need a full card the CPU just doesn't card in so we came up with this box design and then for those of us who sometimes need to run on data set which we cannot put on the cloud because of privacy or due to MDA, this box comes in handy so that was actually the primary motivation so these are all prices in USD and the reason why we chose this form factor is it was small enough that we can carry in fact actually it was designed for computer gamers where they have that LAN party bring on computer so we took all the parts from computer gamers, we chose the form factor and such that they cut off nicely so actually these are pictures of the innards and the GPU itself and this is how it looks like so everything slots in nicely one thing bad about this design is that the heat density is very high so if you are running this for overnight you have to take another fan to put on top but we find it's okay because how often do you train a network if you are running this for more than a week then I recommend you run on Amazon but if you are just running for overnight then just using a small desk fan is fine your car will not burn up so I've burned many many GPUs but running for one month and they do catch fire so so just a pro tip is that for those of us right the GPU is actually rated to run about 90 degrees celsius around that region but that's not recommended, 70 is a recommended range but your PCI connectors inside are not rated to go up to 90 so from time to time if there's vibration the connectors might touch the card and they will melt and they will catch fire and it will just melt actually, it's not a bad idea there's no flame, but you get burnt components after a while so if you are ever doing anything serious run on the server setup okay and then so this is how it looks like and the prices are here okay now how do you talk to this box right now okay I'll just apply it so this is one of those machines that Martin was talking about that has no no display ports or whatnot you can see there's like no cable that's running and the way I talk to this box is purely by Ethernet and I use a secure shower to talk to it so usually what I'll do is I'll just chuck this box at the corner of my house and then I'll run my stuff I'll put this into my home network and I run it off my home power supply so I'll show you how to chuck to the box this is a net adapter so you put your code and the data to this box you can actually do it a few ways so one way you can do it everything on this box data code everything on this box or the other way is you can do just in time just transfer via SSH run it and then pull it back but it's a bit slower but yeah so I'm just going to SSH into my box and so we're going to run the Jupyter notebook as well increase the font okay come on fast don't do it the terminal just you shouldn't be seeing a terminal just showing up a little bit and then we'll sit over to the GUI and that's nicer yeah I run the Jupyter notebook so now this is what y'all should be getting on your own machine yeah and I'm running X forwarding so this is actually my file so it's transferring the image to my Mac and similarly you all have all the notebooks for today right like here it looks exactly like your stuff so it's running the code right now I guess what you point out to the is this stuff is not at the moment it's not using this GPU it's Python producing this code the CUDA and then it's compiling and da da da but when it gets to this compile and train that's when it flips over so then it's done so it's exactly your example in a notebook okay Da how does it know that only that block of code for training needs to be run on the GPU if you're using TNO there is an option file that says GPU 0 you just put it in whenever you run any code if it detects a GPU you'll switch over and then you run it it's totally transparent to you, you never have to tell that you have a GPU so you need to go through the whole stack so the one I'm using is on NVIDIA so I use CUDA and this is my setup you can ignore the last I'll show you the framework so the way the GPU works is that you can't already talk to the GPU itself there is a few layers you need to talk so the physical card itself the NVIDIA GTX 970 you need to run the driver which on top is a computational toolbox which is CUDA and then you need to run the deep learning tools you are using there can be TNO, TensorFlow, Cafe it's feasible and then finally the application goes on fully on top so you have to do through everything and one thing really bad about the NVIDIA support for Linux machine is that they tend to be pretty brittle they tend to break with every update so like this morning it actually broke so I was in a panic for like 8 to 9.30 going through all the patches and stuff like that so usually what people do is that they will find one thing that works for them and they just stop and freeze it until the next major LTS come on board so those of you who are exploring this stuff if you are using Ubuntu 16.04 we are facing a lot of problems so consider just hanging around 14.04 for a while even NVIDIA themselves recommend 14.04 even though 16.04 is the latest LTS so if you want to read the Kali Nuggets on how to do that so the notebook that you guys have the data set is actually pretty customised or optimised to be running on a CPU so you don't actually see much effect so I actually have brought along another another example which is NVIDIA Digits so I'm not here to sell NVIDIA Digits everyone has different views but I'm just showing that if you run a GPU and I think we can run the GPU to see the effect maybe I try to make the thing sound really loud let's see no fire no fire let me run my Facebook my Facebook experiments okay I think I can just run so okay then you can see the GPU utilization let's see the temperature climb up anyway this is a very huge data set it's a 60GB of Facebook images I can't tell what it is for so this is one of my customer use case bring the box to the customer and give you the data you don't need to know what's going to happen yeah oh this is NVIDIA Digits so it's designed by NVIDIA it's riding on top of Cafe so what it does is all the stuff set up nicely for you for example you can create data sets let's say you have a task classification you need to just pass it the path to the folder which contains all your data and you will do nicely for you the partitioning of it into the test validation train you can even do all sorts of preprocessing like rescaling, subtracting and mean and all these are done asynchronously and non-blocking so you can carry on doing your other stuff so you just let it run in the background and do everything nicely for you so it's like really the lazy man way of doing things and it offers you other kind of features as well so when you're designing a model one thing you can do is for example you can like select basic networks that you have the famous networks like GoogleNet that have been mentioned already or you can and then we try to customize it they also provide two such visualizations then you can see how the network looks like exactly so most of the setting for cafe is already up in the front I just select before I can change it and if you really want to do customization of the network itself it offers you the option as well you can just modify it here so this is in the cafe google portal bar format so you can describe the network personally if you are really serious on deep learning this is not the way to go so usually what we do is we run this and we have a lot of interns right and we just we gave them all the networks that we need and we just let the interns go and figure out what is the optimal learn rate best and everything and the good thing is this running off a web server so you never need to expose your underlying part to the interns or the unauthorized user do everything by the interface itself because unlike your jupyter notebook the reason why I am running this way rather than using a web page from my mac and then connect to it is because jupyter allows code execution so you can't actually run it easily it means you cannot run it non-local if you want to run it non-local you need a lot of securities before you do that I can actually take control of your machine by the jupyter notebook just in case you are wondering why it's not done okay it looks like it's really getting more easy okay let's take a look oh right the temperature is utilizing the most higher percent but anyway this one this is the dataset I ran more than a day it doesn't really work that well so it's trying to identify I mean just the really abstract I try to identify people affiliation based on their Facebook profile pages yeah someone is interested in this turns out to be a really hard problem but there's only a certain class that we can identify pretty easily okay anyway that's all for the demo so I guess you all have the notebook you all can have fun with the jupyter notebooks in the meantime so any questions regarding this I need to stop this it's getting really hot you can smell it right I can smell it okay let me stop this before it catches fire but the good thing is that this saves checkpoints as well so one thing that as you learn as you do deep learning right it's important that you save the checkpoints because things will go wrong like your machine will die or so yeah catch fire so having the checkpoint saved out is pretty important so you think on fire we will replace by another set of the dv will it continue to work yes it will and because you have the checkpoint saved let me show you how does the checkpoint look like so when you have a checkpoint saved right like all these are the checkpoints you can actually restart the training as at that point in time because a lot of the today right when we do like deep learning training like CNN, google net all this no one actually trains them from scratch anymore from random weights what we will do is they write the architecture they will take the pre-trained weights from google net or microsoft net and they just put the weights in and then they carry on training from them on so similarly if you already have trained maybe 200 epoch and you have a checkpoint save you just need to take those weights put it in and carry on training after the fire it will be a good business to sell the parameters training weights okay any questions for this little box so the most common question I have I just gave my last comment the most common question I have online forums is that where is the great even point for this type of setup and getting a EC2 instance so we did the math right so if you want to run a deep learning problem for one month or more then you should go for EC2 anything less maybe you can go and play around with one or two days then this kind of personal setup will work fine so if you want to train a model if you are more training requires for total time for one month you do the math and the cost you pay this is about the breathing point so we do routinely run training that goes more than a month so usually for our case what we do is we sample the data maybe get into about 10 gigabytes put into this box design a network sees performance when it's good enough then we throw it to our clusters to run to do whatever we need to do okay any questions any questions okay so just a couple few minutes more so I explain how this given one of these if you run it just under burning temperature then you can do this kind of stuff and learn to translate languages if I can just quickly we've already seen Google on that as being an image processing thing this is the components you need for each labeling so it's crazy that it works this is the same as we saw before they've taken a whole bunch of images which are labeled presented each of these to the Google on that or similar thing to get the initial vector and say we're using that initial vector I want to produce that sentence and done that hundreds of millions of times and it produces decent labeling so now the next competition that the next generation of competition after ImageNet could be let's caption these things properly or there's all sorts of other games higher level games people want to play now the ImageNet seems to be asymptotically getting to as good as humans in things once you get as good as humans what does it mean to be better than humans because you're actually competing against a crowd of humans that point because the guy's going to disagree with you as much as you you agree can I end this array of LSTMs because the last element it will produce is stop the sentence so you have they have what is kind of easy to make this like the initial state or say I have the start of the sentence symbol series of words which will be I have vocabulary of say 50,000 words so the output of this thing this will be a number from one to 50,000 of my current word but we then translated via something like a word embedding which you have word2vec or something like this so word2vec you can learn from text without any other signal so you pass this one number through a word2vec kind of thing you run an LSTM with its hidden states derived from the end this will give you 150,000 intensities of the next word you pick the best word and you start again it may be that you do it in say a greedy fashion where you just pick the highest word or maybe you run it a few times you pick several of the slightly less likely words and then see what sentence has come out which will be more like a beam search but there's a whole bunch of people doing this kind of cool stuff because they want to label images I don't upload that many images I don't upload that for something Can I give you a piece to label the image by individual words or it's like a short paragraph sorry it's a paragraph so the words it will spew out will be a person riding on a motorcycle on a dirt road just for me my thing doesn't know anything so you know there's nothing about the structure of the English language other than it knows what people write in captions and what people have done they've run word-to-vic on a huge corpus but the word-to-vic on a huge corpus may tell you about parts of speech implicitly we don't really know what's going on but that isn't enough that it can string together these words into something which looks meaningful and is high probability what a person would have written what? some of these and then it gets worse so wrapping up deep learning may deserve some hype getting the tools in one place is very helpful because one of the things about this virtual box thing is that setting this stuff up is a bit of a labour particularly when you start to get to the GPUs you've got, Nvidia gives you this huge blob of code which is not open source and that's kind of a sore point so the question is do all the libraries agree the Linux people can't see what the video really needs so your C-compilers may be getting out of step it becomes a bit of a black art setting up these things having everything in one place is useful so this is why it's nice to have a notebook which is ready to go you'll find quickly if you want to do more complicated stuff having a GPU is very helpful if you like this stuff this project this project if you take this home and you play around with it and you find a problem or it needs more explanation put an issue on here and I can fix it up rather than complain on Meetup best to do it here and then we can keep track of everything and that's it I will be sticking around here until I kind of get chucked out so if you want to play around with it I'll be around and if there are any questions or what ok, one year last time Martin and Hui Han for sharing with us ah good question we were missing two white ones six black ones like Martin said so we hang around please feel free to ask them any questions you want I understand people are really shy about the questions I'll end there I think that's it for today's Meetup thank you for coming down on Saturday at 10am in such nice weather thank you guys