 Yn dweud, mae'r byw yn ymgyrch o'r rai gwirionedd ymgyrch, a dwi'n dweud eich bod yn ymgyrch o'r real ymgyrch yn cyfnodau'r ymgyrch. Sam yw ydy'r cyfrifedd ymgyrch a'r cyfrifedd ymgyrch yn cyfrifedd ymgyrch, ac Andreau, ymwyl wedi dwy'n cyfrifedd, sy'n ymgyrch o'r cyfrifedd ymgyrch yn cyfrifedd ymgyrch i'r rydych chi'n gweithio'r ymgyrch, I'm not really a mobile person, but I can talk a little bit about the models, so that's what I'm going to do. So smaller networks, mobile will all require smaller. So about me, hopefully someone here hasn't seen this before. I'm background in machine learning, start-ups at finance, and I moved to Singapore in 2013. Basically 2014 I just read papers about deep learning, natural language processing, played with robots, played with drones. That was my year of fun. Since then, since 2015 I've had like a serious natural language processing job here with a local company. Doing natural language processing, some deep learning, I've happened to write a few papers along the way and we're now just starting, Sam and I are just starting an eight-week deep learning developer course. So this is something else we can talk about. And also Google has been good enough to give us a GDE qualification just now, so we're kind of smarting from the shock anyway. So basically when we're talking about these kind of, you want it really, okay. Basically when we're talking about here are CNN models. And the basic flow which I've got here is you're putting an image typically, typically this is for visual stuff, though it can also be applied to voice or text even, but here I think the use case is vision. And what you'll do is you'll pass it through a number of neural network layers. Basically at each stage you've got another little image which it's producing, but at each stage also the image is subtly different from the layer before. And what the game is in training this model is to make it so that these images by the end of flowing through have been featureised in such a way you can then determine what is in the image. So in this case it will be car, but there will be a whole bunch of categories at the end to try and classify this. And so this is what people have been using CNNs for and it's suddenly been, I guess since 2012, been like the hottest thing in vision. So what's a CNN? Basically it's using the fact that pixels in an image are organised. They have a kind of a layout, a natural relationship with each other compared to just random inputs from sensors around the world. An image has sensors which are organised in kind of a grid. And the idea here is we're going to use the whole of this image as a feature for the next level of the network. And this will have successive levels and that's what's going to make a deep network. And the way to which you're going to create these features is by using essentially a Photoshop filter on the previous layer. On the Photoshop filter in mathematical terms there's a convolutional kernel and that's why these are called convolutional neural networks or CNNs. So online there's a thing, so this is my little thing, there's a redlabs.com. If you go here and there's also redlabs.com.c you can actually play with one of these convolutional filters. And basically you'll take this image as its input, you'll pass it through this little mathematical operator with these parameters and you'll get out another little image. So if I were to change these parameters and it's super difficult for me to do because I can't see what's going on, if I change the parameters hopefully, oh sorry what I should do, I should go on the actual one. If I change these parameters I can actually start altering the image. So you can see that by changing the parameters I'm going to have some altered version of this. And the point of that is that if I can now let a model choose the parameters which best suits its purpose, i.e. identifying what's in the picture, the model will learn how to do that. In some mysterious way just by updating these parameters. So mathematically what's going on is this CNN filter is taking this image at the back, passing this little matrix across the whole thing, and then summing up the results and multiplying each element by the element under it to produce the answer in the middle. So this is essentially scanning this kernel across the image. But because of the way, for instance a GPU is organised, it can do this extraordinarily fast. So this is how essentially you take these nine parameters here, and typically you have one extra kind of bias parameter. You'll take these parameters here and you'll map one image into an updated image. It could be a blurred, it could be a sharpened, it could have edges detected. So basically it will change this into something of a slightly different nature. So what people do with this, and in fact what has powered this from the beginning, has been a competition called the ImageNet competition, or the IELSVRC. In this competition you have 15 million labelled images, which are all of different stuff in 22,000 categories. And the competition is to how there's a thousand of these categories have been chosen. And the question is, for instance, for this image, which is a picture of a hot dog, which of these categories does it belong to? And so this is very extremely difficult for me to tell here, but there are various foods along here, and I guess this one has got lots of hot dogs in it. This one's more kind of burgers maybe. So the reason that there's this nice display is because this was actually scored by human. There's a guy who's now head of the Tesla AI called Carpathie, who actually rated himself over weeks of actually doing this manually. And so we now know that the ImageNet task has got about a 5% human error rate for one particular human who went to Stanford. So having got all of these images and the ground truth, people then started training vision tasks on this. And up till 2012, people were inching their way up in performance. But when people came along and suddenly did this using the CNN methods rather than an open CV kind of method, suddenly the field was evolutionised and the error rates came well down. So basically back in 2012 there's a network called AlexNet, which would take in an image here and then pass it through these various blocks of creating these various blocks of features up until the end when you get to the categories. So what you find is the features which are created at the beginning would be kind of edges or colours. We're just looking for particular pieces of highlighting particular pieces in the image. And the next level up or the next few levels will start to look for textures. Is it seeing fur? Is it seeing cheetah kind of pattern? Or is it brick or whatever? But as you start to move further up it will then be identifying shapes and then further up still shapes which are really characteristic like dog's noses or eyes. So there will be, as you move further up this hierarchy you're getting more and more sophisticated features until you get to the actual answer. Now the crazy thing is that these features are created just within the process of giving it a photo and telling it the category. And there are a thousand categories, many of which are dogs but it basically learns to do this featureisation including all of this detection of these interim stages all on its own. Now interestingly these also correspond to kind of some of the feature levels seen in the brain which is suggestive though it is interesting. So having done this essentially eight layer, ten layer network two years later Google came along and came up with something called Inception 1 for Google and Net. So this is 2014, this is now quite a lot more layers move along to 2015, the next version of Ascension this is getting quite big network and basically by increasing the number of parameters they're able to fit this better and better and better. But the downside of this is the more parameters you've got the more you've got to store and if you're on a mobile device the more parameters you're storing the bigger your app is and so in a some sense the push to get to the highest rates of recognition for this task is kind of contrary to what you want for your phone app. So the early models which will be these ones are super big these are hundreds of millions of parameters the later ones they kind of because of this kind of crazy modular design they've got the parameter count down but still improved performance so basically this is how long it will take to compute this is how well it performs and the size of the circle is how many parameters it's taking so you can see that there's various trade-offs and people obviously you want to be up into this corner with a small circle if you can. So what we have is basically I've got a so online in my GitHub account I've got a thing called deep learning workshop so there will be links in the presentation and in here which has got all of these examples including essentially lots of different ways of tackling these things so there's conclusions so I've got a simple example here where I'm going to use this Google net and I'm not going to go through this in detail but having loaded up the model I can give it an image I then have to kind of clean up your image and what happens is if you print out the maximum argument of the features or the predictions it says this is a tabby tabby cat so this can work live in my laptop or not live it can work quite quickly in my laptop no GPU required this is kind of an old network now I can run it on multiple images in this directory so we've got the first one Siamese Cat which is pretty good Golf Ball for this one which is not so good but then it has not seen many owls probably this kind of owl is not in its data set so there's no way it can know Band-Aids this is clearly not a good idea Nipple, Muzzle, Golden Retriever, not very good This Tabby Cat one, yes it understands that but it's also interesting that the second best choices would be Tiger Cat, Egyptian Cat, Lynx, Persian Cat so it understands the cattiness of this just from seeing a whole bunch of these things so what about mobile so what we've seen or what the history has shown is the better performance means the larger network in most cases but for mobile we're going to want to compress this we're going to want to kind of downgrade or just squeeze these networks or even restructure it so it's more suitable for putting on mobile devices so I've got just some quick one quick reason for thinking about this is looking at the energy usage so back up the top here we've got the energy uses of a 32-bit integer ad is .1 picajoules so let's call that 1 but as you go up the 32-bit float costs you 9 32-bit integer multiplied costs you 37 but to retrieve a parameter from static RAM like on-trip cache costs you 50 and to retrieve it out of main memory costs you 6400 so you can see that just storing these parameters storing many many parameters and just pulling them into processing is much more expensive than creating it in place if you can do that so actually storing the data compressed in any way is a huge win because the compression is free compared to pulling an extra byte which would cost me 6400 to creating that byte out of some compression algorithm which cost me 1 I can waste tons of cycles uncompressing compared to just pulling this from memory so this is one reason why storing a compressed disk system may be in fact free to you because pulling off the disk is more expensive than uncompressing in your processor nowadays so compressing and downgrading there are several techniques which are commonly used people are interested in sparsity which is all about how many zeros are in here can I do this at a lower precision than 32-bit and maybe I could quantise these weights in some way so for sparsity what people typically do is they will look at the this is all assuming you've got a pre-trained network so basically you take a Google in there to one of these big networks and say well let's crunch this down I'm not going to do anything special from a structural point of view I'm going to make this what I've got as small as possible so basically I'll look at all the weights if they're small I'll set them to zero and then if I were to start to retrain this I'll just keep all the zeros at zero because I can store zeros cheaply by having just it's going to be a bit stream I'm going to have a little flag which says is this a zero or is this a proper number and so that it may be that 30% of my weight is just a zero anyway and so that saved me a third of the model size another thing you can do is once you have a slightly less precise model maybe you want to quantise these into bite-wise weights and in fact this is that Google's TPUs are a bite kind of which is their fancy new thing for doing computations they have 64,000 simultaneous 8-bit multiplies so they're reducing all of their weights to a pretty small size a pretty small resolution but they want to do those hyper-fast so what you do when you're trying to train one of these networks and the thing that your resolution of what you're doing is pretty low is you quantise on the forward pass because you want to know what you're actually going to get but when you calculate the gradients going backwards you kind of cheat and you tell it the true gradient of what you're trying to do and this has been shown to just it shouldn't work but it shouldn't work on a kind of mathematical sense but it does work pretty well, so well enough and what people find is that even if you go down to as low as six bits per parameter it's probably going to work eight bits may be too many and this is compared to proper scientific computing which would want 64-bit double precision numbers this is a whole kind of different world obviously storing six bits per parameter is a bit nasty because the bytes won't even line up but that's a simple arithmetic operation another simple thing that people do is just quantise the weights so if you've got the weights in some distribution from very, very small to around zero to being very, very big I would just want four actual weights that I'll use so I'm going to pick four separate levels or maybe this is a five including zero four separate levels and instead of using the actual weight I'll use this quantised weight and then I'll fiddle around with the levels so that it best represents my data and the point of doing that is in order to store those weights and so I might store this for an entire level say or a tier level in my network is if I have four buckets I'll only need to store a two-bit index into my quantisation levels and then I store the quantisation levels as well to maybe eight-bit or sixteen-bit resolution but I know that this kind of thing will work really well and it will compress well, it is a compression on this stuff which works pretty well another thing that people do is start to actually restructure the networks so in the earlier diagram I just had a three-by-three kernel which I was passing across for my CNN but people will use seven-by-sevens or five-by-fives as well in the big networks but what people commonly do to have found that this is an interesting simplification is instead of having a single five-by-five operation which will have five-by-five plus one parameters that's twenty-six parameters I can actually just do this by stacking a three-by-three on top of a three-by-three that the actual coverage area of that would be the same as a five-by-five but it's kind of factorised out so it would be slightly less flexible in what it can represent but it saves me parameters so if I'm having like fifty input channels the first way costs me twenty-six-by-fifty which is thirteen hundred second way costs me twenty-by-fifty so it's a thousand so I've now saved, I guess it's a quarter of the weights like that another thing that people can do is suppose we have a three-by-three and this is just kind of counting suppose we have to do a three-by-three over a channel for fifty different input channels in the diagram which I showed it looks very simple that it's just one plane into another plane but the reality is when one does a three-by-three kernel it's doing a three-by-three times by all the input channels so all of these parameters are independent so it allows any channel or any colour on the first plane any channel to interact with any other channel and that seems kind of excessive particularly since if I have fifty channels in my preceding layer I will have a three-by-three plus one so it's ten times fifty so it's five hundred parameters what people are now playing with is a separable CNN so for every output channel you will run one three-by-three kernel identically the same of all the input channels and then you sum up a weighted sum across all the channels so essentially this is separating out the idea of texture which is kind of within a channel shapes versus the interaction between the channels as a different thing so this is kind of a factorisation of the big volume of parameters into something which is now a three-by-three by plus one times one ten parameters for this kernel and then fifty one parameters for the summing it all up in a weighted way so I've changed the five hundred parameters down to sixty one and this is another huge win in making this model smaller but it has made the assumption that this model can be factorised like that so you have to be careful that the factorisation isn't destroying your performance and so every time you do this you're going to want to be retraining and retraining or experimenting and so in my presentation there's kind of links to there's a model called exception where they started doing this kind of factorisation even in a language paper so a natural language paper so in practical terms it's kind of important when you're building these things if you're building a model to understand the trade-offs that are being made on the other hand most of the time you're going to be using a pre-defined model so the very common thing for doing these tasks is to go to Google or TensorFlow in particular there will be a model zoo download the weights for one of these networks and just use it and typically rather than categorising this as being tabbycat or golf ball you will take off one of the top layer and then do what's called transfer learning or you will use it in various other you'll kind of abuse these pre-trained weights but because Google has done all of this work you can download a pre-trained network which is all fully optimised in two shakes of lanthotail so using these pre-defined models is super easy there should be hardware coming to phones soon but this is why I guess Apple is coming up with their own kind of interface for these things because they're soon going to want to put in hardware which will do this so you won't but you won't see that on the surface so for squeeze net so I just added this kind of completeness in models they've got AlexNet level of accuracy which is one of the first huge advance in this stuff but with 50 times fewer parameters so this is now a model which has less than half a meg of parameter set so this is something which you could easily put half a meg is less than many JavaScript libraries this is quite a small model and this has a variety they play around with these one by one and three by three layers they even eliminate the fully connected layer at the end what's now appeared in TensorFlow and is available in TensorFlow Slim is a whole thing called MobileNet so you can see that here are some of the original ones which is GoogleNet GoogleNet, there's the VG16 this is a log scale so here's AlexNet the MobileNet basically you can choose higher performance smaller models and many fewer operations but you can also look at the trade-off in performance versus the model size so basically by changing one parameter and they have all these pre-trained parameter sets set up for you you can choose whatever model you want and however accurate you need it to be the way you do this in Keras which is also coming into TensorFlow soon but is fully TensorFlow aware there's now a MobileNet thing in Keras which enables you to essentially you import MobileNet if you pre-process the image like they want you just say model equals MobileNet predictions equals model, predict of image and that will give you that that is all you would need to do for a Keras kind of model and it's all pre-baked the first time you run it it will download the parameters it needs so wrapping up so basically you've got to understand what kind of structure you're playing what kind of structure is underlying because if you're going to want to cut up the network you're going to need to know where it can be cut so if I'm just doing a simple transfer learning I can cut it off at the end but I'm going to do something more clever like a style transfer where I need to look at some intermediate layers you've got to understand where's good cut points to put and your small model may not have these cut points because it's been sliced and diced in such a way but for many many tasks particularly for mobile like app use cases tiny models are pretty good and they work well enough but there's a whole lot more behind this obviously so just there's a quick this obviously you probably know about the deep learning meetup group here since you're here our next meetup is going to be on the 21st of September here again we've got certainly someone from the Google actually a native Singaporean coming from the Google Brain team in Mountain View he's going to talk about some of the new and up and coming stuff and I'm sure we'll have some other cool stuff to fill it out but we're kind of focusing on him for the moment but what we always try and do is talk which addresses stuff for people starting out something kind of from the bleeding edge and we'd love to have more lightning talk so if anyone has anything they can be enthusiastic about and are willing to talk for five minutes everyone would love to hear it there can be no disaster lightning talks another thing which I think Sam will talk a little bit about at the end and we've been pre-announcing this for a long time the eight week deep learning developer course we're actually going to start it and it's going to be starting on the 25th of September and the format will be twice weekly three hour sessions which will include both instruction so teaching, slides, Python books, whatever but also individual projects because we figure that someone doing an interview or building their resume needs to have actual projects which they have ownership of one of the problems with doing a Udacity course there are great courses out there and even though you did the coursework in an interview all you can really say is I did this really interesting topic like everyone else did what we've aimed to do here is make this very individual so that the project you do is yours and so I'm not sure what people will want to look at maybe it's their own heart rate Fitbit measurements or their cat feeding patterns or whatever it is or they want to do something really jazzy with images text or whatever or even something for their company bring it along, you can do that we can make sure you get something worth talking about at the end of it the cost of this for regular cost of this is going to be $3,000 but because we have now WSG funding approved Singapore citizens and PRs can get like a 70% rebate if you work with the company to make that happen or make with your employer or however it works we'll try to make that happen if you want to have a look at some more details there's a website address called reddragon.ai slash course so we previously sent this out during the week to everyone who kind of pre-registered I encourage people to have a look even if you pre-registered and were wondering shall I bother, have a look and you'll see there's a document we're covering I guess one of our FAQs is well what about the Andrew Oon course which is kind of out there we will kind of not set ourselves against that probably there's one called Jeremy Howard we're kind of aiming at getting the stuff into people's hands practically so understanding from the very basics is very good but we want to have the newest and hottest stuff if there's a new paper we want to be talking about it and we want to do the practical stuff which is going to be difficult to do online and it's not going to be easy and questions while I sit up I'll take questions while he does his thing