 Some of you have Finished part one in the last few days. Some of you finished part one in December. I did ask those of you who took it in person to Kind of revise the material and make sure it was up to date, but let's do a quick summary of the kind of Key things we learned and so I I came up with these five things I've interested to hear if anybody has other Key insights that they kind of feel that they came away from So the five things of it Stacks of non-linear functions With lots of well stacks of differentiable non-linear functions with lots of parameters Of nearly any predictive modeling problem. So when we say Neural network a lot of people are suggesting we should use the phrase differentiable network You know if you think about things like the collaborative filtering we did It was really a couple of embeddings and a dot product and that gave us put us quite a long way. There's nothing very Neural looking about that But we know that when we stack Certain kinds of non-linear functions on top of each other the universal approximation theorem tells us that can approximate any Computable function to arbitrary precision. We know that if it's differentiable we can use SGD to find the parameters which match that function So this to me is kind of like the key insight But Some Sacks of functions are better than others for some kinds of data and some kinds of problems One way to make life very easy. We learned is transfer learning and I think nearly every Network we created in the last course We use transfer learning. I think particularly in vision and in text So pretty much everything so transfer learning generally was Throw away the last layer Replace it with a new one that has the right number of outputs Pre-compute the penultimate layers output Then very quickly create a linear model that goes from that to your to your preferred answer You now have something that works pretty well and then you can fine-tune more and more layers backwards as necessary and we learned that Fine-tuning those additional layers generally the best way to do that was to pre compute the last of the layers Which you're not fine-tuning and so then you could just Calculate the weights of the remaining ones and that saved us lots and lots of time And remember that CNN layer in N layers convolutional layers. That's meant to say Conflayers are slower Let's fix up the previous one as well stacks of If a wrench sheable Convolutional layers are slower dense layers bigger And there's an interesting question. I've added here, which is we're remember in the last Lesson we kind of looked at resnats and inception nets and in general More modern nets tend not to have any dense layers So what's the best way to do transfer learning I'm going to leave that as an open question for now We're going to look into it a bit during this class But it's not a question that anybody has answered to my satisfaction So I'll suggest some ideas But no one's really No one's even written a paper that attempts to address it as far as I'm aware. So Given we have transfer learning to give us get us a long way the next thing we have to get us a long way is to try and create an architecture Which suits our problem both that data and our loss function. So for example if we have Auto correlated inputs so in other words each input is related to the previous input So each pixel is similar to the next door pixel or in a sound wave each sample is somewhat similar to the previous sample Something like that that kind of data. We tend to like to use see it ends for as long as it's of a fixed size It's a sequence. We like to use an iron in for if it's a categorical output We like to use a softmax for so there are ways we learned of tuning our architecture Not so that it makes it possible to solve a problem because any Any standard dense network can solve any problem But it just makes it a lot faster and a lot easier to train if you've made sure that your Activation functions and your architecture Do the problem. So that was another The thing I think we learned And something I hope that everybody can Narrate is the five steps to avoiding overfitting if you've forgotten them. They're both here and discussed in more detail in lesson three Get more data Fake it more data using data augmentation Use more generalizable architecture. So architectures that generalize well and particularly we look at batch normalization Use regularization techniques as as few as we can Because by definition they destroy some data and but we look particularly at using dropout and Then finally if we have to we can look at reducing the complexity of the architecture So the general approach we we learned this was Absolutely key is first of all with a new problem Start with a network that's too big. It's not regularized. This can't help but solve the problem Even if it has to overfit terribly If you can't do that, there's no point starting to regularize yet So we start out by trying to overfit terribly once we've got to the point that we're getting a hundred percent accuracy in a validation That's terrible because it's overfitting then we start going through these steps Until we get a nice balance Okay, so that's kind of the process that we learned And then finally we learned about embeddings As a technique to allow us to use categorical data and specifically the idea of using words Or the idea of using latent variables. So in this case, this was the movie lens data set So That's the five main insights I thought of Did anybody have any other kind of key takeaways that they think people revising should Think about or remember or things they found interesting No, okay, that's good. If you come up with something like you know One question. Yeah, how does having duplicates and training data affect the model created? And if you're using data augmentation, do you end up with duplicate data? Duplicates in the input data I mean, it's not a big deal right because we Shuffle the batch and then you select things, you know randomly Effectively you're you're waiting that data point higher than its neighbors So in a big data set, it's gonna make very little difference If you've got one thing repeated a thousand times and then there's any other hundred data points That's gonna be a big problem because you're waiting one data point a thousand times So as you will have seen we've got a couple of big Technology foundation changes and the first one is we're moving from Python 2 to Python 3 Python 2 I think is a good place to start Given that a lot of the folks in part one had Never coded in Python before and many of them had never written, you know, very substantial pieces of software before and a lot of the Tutorials out there like for example one of our preferred starting points, which is learn Python the hard way in Python 2 a lot of the existing codes out there in Python 2 So I thought Python 2 is a good place to start. Yes, right two more questions One is are you going to post the slides after this? I will post the slides Yes, and the other is could you go through steps for underfitting at some point? You have those steps for overfit or you know how to deal with overfitting. Yeah, let's do that in a forum thread So why don't you create a forum thread asking about underfitting? But you don't need to do that in the part 2 forum You could do that in the in the main forum because lots of people would be interested in hearing about that If you want to revise that part lesson 3 started by talking about underfitting Okay, so that seemed like a good place to start I don't think we should keep using Python 2 though for a number of reasons. One is that since then The IPython folks have come out and said that the next version won't be compatible with Python 2. So that's a problem Indeed from 2020 onwards Python 2 will be end-of-life, which means there won't be patches for it. So that's a problem Also, we're going to be doing more stuff with concurrency and parallel programming this time around and they're the features in Python 3 are a lot better And then Python 3.6 was just released which has some very nice features in particular some spring formatting which Some people it's no big deal, but to me it saves a lot of time and makes life a lot easier So we're going to move across to Python 3 and there are hopefully you've all gone through the process already And there's some tips on the forum about how to have both running at the same time Although I agree with the suggestion. I read from somebody which was Go ahead suck it up and do the trans the translation once now so that you don't have to worry about it Okay Much more interesting and much bigger is the move from Theano to TensorFlow So Theano, we thought was a better starting point because it has a much simpler API Uh, there's very few new concepts to learn to understand Theano And it doesn't have a whole new ecosystem. You see TensorFlow lives within google's whole ecosystem. It has its own build system called basal It's got its own file serialization system called protobuf. It's got its own profiler method based on from it's got All this stuff to learn But if you've come this far Then, you know, you're already investing the time. We think it's worth investing the time in TensorFlow because there's a lot of stuff which Just in the last few weeks it started being able to do. That's pretty amazing. So Rachel and I so Rachel wrote this post Here it is about how much TensorFlow sucks For which we got invited to the TensorFlow dev summit and got to meet all the TensorFlow core team and So looking at moving from Theano to TensorFlow We got invited to the TensorFlow dev summit and We were pretty amazed at all the stuff that's Literally just been added. So TensorFlow one just came out and Here are some of the things if you google for TensorFlow dev summit videos, you can watch the videos about all this Perhaps the most exciting thing for us is that they are really investing in a simplified API So if you look at this code You can create a deep neural network regressor on A mixture of categorical and real variables I'm using a kind of almost R like syntax and fit it In two lines of code You'll see that those lines of code at the bottom the two lines to fit it look very much like keras Francois Chalet the keras author has been a wonderful influence on google and in fact Everywhere we saw at the dev summit was keras api influences So TensorFlow and keras are kind of Becoming more and more one which is terrific. So one is that they're really investing in the api A second is that some of the tooling is looking pretty good. So tensor board has come a long way Things like these graphs showing you how your different layers are distributed and how that's changed over time Can really help to debug What's going on? So if you get some kind of gradient saturation in a layer, you can dig through these graphs and very quickly find out where This was one of my favorite talks actually This guy remember correctly. His name was daffodil and his signature was an emoji of a daffodil very google But his I actually think it's daffodil lion or daffodil lion. I think you're right. Yes And it was two it was two emojis at daffodil lion If you watch this video, you kind of has a walkthrough showing some of the functionality that's there and how to use it And I thought that was pretty helpful um One of the most important ones to me is that tensor flow has a great story about Productionization and you know for part one. I didn't much care about productionization. It was really about like Playing around what we can what can we learn? You know at this point, I think we might be starting to think about how do I get my stuff online in front of my customers And these points are talking about something in particular Which is called tensorflow serving and tensorflow serving is a system that can take your brain tensorflow model and create an api For it, which does some pretty cool things for example Think about how hard it would be Without the help of some library to productionize your system. You've got one request coming in at a time You've got n gpu's How do you make sure that you don't saturate all those gpu's that you send the request to one that's free that you don't use up all of your memory Better still, how do you grab a few requests? Put them into a batch Put them in all to the gpu at once get the bits out of the batch Put them back to the people that requested it all that stuff, right? So serving does that for you. It's it's very early days for this software A lot of things don't work yet But you can download an early version and start playing with it. I think that's pretty interesting with the high level api In tensorflow, what's going to be the difference between the keras api and the tensorflow api? Yeah, that's a great question. So In fact the tensorflow or tf dot keras Will become a namespace. So keras Will become the official top level api For tensorflow and in fact rachael was the person who announced that Yes, go on Oh, I was just going to add that tensorflow is kind of introducing I think a few different libraries at different layers kind of depend or Different levels of abstraction. There's kind of like there's this concept of kind of an evaluation api that appears everywhere and is Basically is the keras api I think there's a layers api below. Yeah. Yeah, so there's it's being mixed in lots of places So all the stuff you've learned about keras is going to be very helpful Not just in using keras on top of tensorflow, but in using tensorflow directly um Another interesting thing about tensorflow is that they've built a lot of cool integrations with various cluster managers and distributed storage systems and stuff like that So it'll kind of fit into your production systems more neatly Use the data in whatever place it already is more neatly So if your data's in s3 or something like that you can generally throw it straight into tensorflow Something I found very interesting is that They announced a couple of weeks ago a machine learning toolkit Which brings really high quality implementations of a wide variety of non-declining algorithms tensorflow. So these all these are gpu accelerated parallelized You know and supported by google And so and like a lot of these have a lot of back behind them. So for example the random forest There's a paper. Um, they actually call it the tensor forest Which explains all of the interesting things they did to create a you know fast gpu accelerated random forest Two more questions. One is um, can you or will you give an example of how to solve gradient saturation? tensorflow tools Um, I'm not sure that I will we'll see how we go because I think the the video from the dev summit Which is available online kind of already shows you that so I would say look at that first And see if you still have questions and all the videos from the dev summit are online um someone asked about um Is there an idea for using deep learning on aws lambda? Um, not that I heard of no and in fact Um, in general, you know It's google has a kind of an on a service version of tensorflow serving called google cloud ml Where you can pay them a few cents, you know a transaction and they'll Post your model for you. There isn't really something like that um through amazon as far as i'm aware Um, and then finally in terms of tensorflow, I had an interesting and infuriating few weeks trying to prepare for this class and trying to get something working that would translate French into English and every single Example I found online had major problems even the official um tensorflow tutorial missed out a key thing which is that the lowest level of A language model really should be bi-directional as this one shows bi-directional rnn and their ones wasn't i'm trying to figure out how to make it work Horrible i'm going to get it to work in keras. Nothing worked properly. Um, finally, um, basically the issue is this um modern rnn systems like a full neural translation system involve a lot of tweaking and mucking around with the innards of the rnn Using things that we'll learn about And uh, there just hasn't been an api that really lets that happen So I finally got up working by switching to pi torch, which we'll learn about soon, but um I was actually going to start the first lesson was going to be about neural translation And I've put it back because tensorflow has just released a new system for rnn which looks like it's going to make all this a lot easier. Um, so this is a exciting idea is that there's a there's an api that allows us to Create some pretty powerful rnn implementations and we're going to be absolutely needing that when we learn to create translations Um, oh, yeah, there is one more. Um, again early days, but there is something called xla, which is the accelerated linear algebra, for sure, I think Yeah Which is a system which takes tensorflow code And compiles it And so for those of you that know something about compiling, you know that a compilation can do a lot of Clever stuff in terms of like identifying dead code or unrolling loops or fusing operations or whatever xla tries to do all that now at this stage Um, so it takes your tensorflow code and turns it into machine code And one of the cool things that lets you do is run it on a mobile phone with almost no supporting libraries Um, uh using native machine instructions on that phone much less memory But one of the really interesting discussions I had at the summit was with um scott gray who some of you may have heard of He was the guy that massively accelerated neural network kernels when he was at nirvana um, uh, he had like Kernels that were two or three times faster than invidious kernels Uh, I don't know of anybody else in the world who knows more about neural network performance than him Uh, he told me that he thinks that xla is the key to creating uh, performant concise expressive, uh neural network code And uh, I really like that idea the idea is currently if you look in the tensorflow code It's thousands and thousands of lines of c++ All custom written the idea is you throw all that away and replace it with A small number of lines of tensorflow code that get compiled through xla So that's something that's actually got me pretty excited so, um So tensorflow is pretty interesting having said that um it's kind of hideous The api is full of not invented here syndrome Um, it's clearly written by a bunch of engineers who have not necessarily spent that much time learning about the user interface of apis um, it's full of these uh Google isms in terms of uh, you know having to fit into their ecosystem. Um, but most importantly it's still a uh A kind of you like the arno you have to set up the whole computation graph and then you kind of go Run, um, which means that if you want to do stuff in your computation graph that involves like Conditionals if then statements, you know, if this happens you do this other part of the loop. It's basically impossible. That's Rachel um so, um It turns out that there's a a very different way of programming neural nets Which is uh dynamic computation otherwise known as uh defined through run There's a number of libraries that do this uh torch by torch trainer dinet They're the ones that come to mind And we're going to be looking at one that was uh, uh, I wouldn't say released but an early version was put out about a month ago called pie torch Um, which I've started rewriting a lot of stuff in and a lot of the more complex stuff just becomes suddenly so much easier And because it becomes easier to do more complex things I often find I can create faster and more concise code By using this approach So even although pie torch is very very very new Um, it is coming out of the same people that built torch, which really all of uh, facebook's systems built on top of I suspect that facebook are in the process of moving across from torch to pie torch It's already full of incredibly full stuff as you'll see So, uh, we will be using increasingly more and more pie torch during this course There is a question. Does pre-compiling mean that we'll write tensorflow code Um, and test it and then when we train a big model, then we pre-compile the code and train our model Yeah, so, um If we're talking about xla, xla can be used in a number of ways One is that you come up with a some different kind of Some different type of kernels and different type of factorization Something like that. Um, you write it in tensorflow You compile it with xla and then you make it available to anybody. So when they use your layer You know, they're getting this compiled optimized code Um, it could mean that you take when you use tensorflow serving Um tensorflow serving might compile your code using xla and be serving up an accelerated version of it One example which came up was for r and n's R and n's often involve, you know Nowadays as you'll learn some kind of complex customizations of, you know, a bi-directional layer and then some stacked layers and an attention layer and then a Fed into a separate stacked decoder You can fuse that together Into, you know, a single Layer called a bi-directional attention sequence to sequence which indeed People have actually bought that kind of stuff. So there's various ways in which Neural network compilation can can be very helpful What is the relationship between tensorflow and pytorch? There's there's no relationship. So, um tensorflow is google's thing pytorch is I guess it's kind of facebook's thing, but it's also very much a community thing um tensorflow is a Huge complex beast of a system which uses all kinds of advanced software engineering methods all over the place In theory that ought to make it terribly fast in practice a recent benchmark that actually showed it to be about the slowest And I think the reason is because it's so big and complex It's so hard to get everything to work together in theory pytorch ought to be the slowest Because this defined by run system means it's way less Optimization that the systems can do but it turned out to be amongst the fastest because it's so easy to write Code it's so much easier to write good code um So, you know, it's interesting. I think There's such different approaches. I think it's going to be great to know both Because there are going to be some things that are going to be fantastic in tensorflow and some things that are going to be fantastic Um, they they couldn't be more different, which is why I think there are two good things to to learn All right, so kind of wrapping up this uh introductory part um I wanted to kind of change your expectations about what how you've learned so far to how we're going to learn in the future Part one to me was about showing you best practices Right, so genuinely, it's like here's a library Here's a problem you use these library in these steps to solve this problem and you do it this way and low and behold We've gotten the top ten of this capital competition um So it was and I tried to select things that had best practices So you now know everything I know about best practices. I didn't you know, I don't really have anything else to tell you so we're now up to like Stuff I haven't quite figured out yet nor is anybody else but You probably need to know um, so some of it for example like um neural translation um, that's an example of something that Is solved right google's google solved it But they haven't released the way they solved it right so kind of the rest of us are trying to put everything together and figure out How to make something work as well as google made that work um more often um, it's going to be um Here's a sequence of things you can do that can get some pretty good results here But there's a thousand things you could do to make it better that no one's tried yet um So that's interesting or thirdly it could be Here's a sequence of things that solves this pretty well But gosh, we wrote a lot of custom code there, didn't we? I'm sure this could be abstracted really nicely, but no one's done that yet So they're kind of the three main categories and so generally at the end of each class It won't be like okay. That's it. That's how you do this thing. It'll be more like Here are the things you can explore right and so the kind of the homework will be You know pick one of these interesting things and dig into it and generally speaking that homework will get you to a point that Probably no one's done before or at least probably no one's written down before like I found as I've built this um, I think Nearly every single piece of code I'm presenting um, I was unable to find anything online which Did that thing correctly? Like there was often example code that claimed to be something like that But again again, I found it was missing huge pieces and we'll talk about some of the things that you know It was missing as we go But one very common one was it would only work on a single item at a time. It wouldn't work with a batch Therefore the gpu is basically totally wasted Or it failed to get anywhere near the performance that was claimed in the paper that it was meant to be based on So, you know generally speaking There's going to be lots of opportunities if you're interested to you know Write a little blog post about the things you tried and what worked and what didn't And you'll generally find that they're probably There's no other post like that out there, particularly if you pick a data set that's in your Domain area. Um, it's very unlikely that somebody's written it Yeah, going back. Um, can we use tensorflow and torch together? Um, so I don't say torch say pie torch. Um torch is very similar, but it's written in lure which is a Um very small embedded language very good for what it is, but not very good for what we want to do So pie torch is kind of a Fort of torch into python, which is pretty cool. Um, so can you use them together? Yeah, sure, you know, we'll we'll kind of see a bit of that in general you can you know do a few steps with tensorflow to get to a certain point and then a few more steps of pie torch You can't like integrate them into the same network Um, because they're very different approaches, but you can certainly Solve a problem with the two of them together All right, so for those of you who have some money left over I would strongly suggest building a box Um, and the reason I suggest building a box is because you're paying 90 cents an hour for p2 Um, I know a lot of you are spending a couple hundred bucks a month on aws bills Um, here is a box that costs 550 dollars and will be about twice as fast as p2 um So it's it's just not good value to use a p2 and it's way slower than it needs to be um So, uh, and also building a box, you know, it's One of the many things it's just good to learn, you know, is is understanding how everything fits together So I've got some suggestions here about Um, what box to build for various different budgets. Um, you certainly don't Have to um, but this is my recommendation. Um, a couple of points to me Um More ram Helps more than I think people who discuss this stuff online quite appreciate Um, 12 gig of ram means twice as big of batch sizes Which means half as many steps necessary to get through an epoch. That means more stable gradients, which means you can use higher learning rates Um, so more ram. I think is often underappreciated the Titan x Is the card that has 12 gig ram It is a lot more expensive But you can get the previous generations version second hand. It's called the max world So there's a titan x askel, which is the current one or the titan x max well, which is the previous generation one The previous generation one Is not a big step back at all. It still has 12 gig ram if you can get one used that would be a You know a great option um the um GTX 1080 and 1070 are Absolutely fantastic as well. They're nearly as good as the titan x, but they just have 8 gig rather than 12 gig um Going back to a GTX 980, which is the kind of previous generation consumer top end card You have the ram again, you know, so Of all the place you're going to spend money on a box Put nearly all of it into the GPU, you know every one of these steps 1070 Um the titan x pascal, you know, uh, they're big steps up Um And as you will have seen from part one if you've got more ram It really helps because you can pre compute more stuff. Keep it in ram. Having said that there's a new kind of hard drive Or an nvme drive On volatile memory, I think um nvme drives are quite extraordinary They're like not that far away from ram like speeds Um, but they're hard drives, you know, they're they're persistent You have to get a special kind of motherboard But if you can afford it sometimes somebody like, I don't know four or five hundred bucks, right to get an nvme drive um, that's going to really allow you to Put all of your kind of currently used data on that drive and access it very very good So that's my other tip Doesn't the batch size also depend heavily on the video ram Or does just that's what I was referring to the 12 gig gig. I'm talking about the ram that's on the video card on the gpu Does upgrading ram allow bigger batch sizes Upgrading the card the video the card the video card's ram You can't upgrade the ram on the card you buy a card that has x amount of ram So titan x has 12 btx 1080 8 btx 984 So that's that's on the card Upgrading the amount of ram that's in your computer doesn't change your batch size. It just changes the amount you can like pre-compute Unless you use an nvme drive in which case ram is much less important Um, you don't have to plug everything in Um, you can go to central computers, which is a san francisco computer shop for example, and they'll Put it all together for you. Um, there's a fantastic thread on the forums um Brendan who's one of the participants in the course has a great medium post went there just explaining his whole Journey to getting something built and set up. Um, so there's lots of stuff there to help you All right, it's time to build your box and while you wait for things to install. It's time to start reading papers So papers are if you're a philosophy graduate like me terrifying Um, they look like theorem 4.1 and colorally 4.2 on the left but That is an extract from the adam paper And you all know how to do adam in microsoft excel Right, it's amazing how Most papers managed to make simple things incredibly complex and a lot of that is because academics need to show other academics how worthy they are of a conference spot Which means showing off all their fancy math skills so if you Really need a proof of the convergence of your optimizer Rather than just running it and see if it works. You can study theorem 4.1 and corollary 4.2 and blah blah blah in general though the way philosophy graduates read papers Is to read the abstract Find out what problem they're solving read the Introduction to learn more about that problem and how previous people have tackled it Jump to the bit at the end called experiments to see how well the thing works If it works really well jump back to the bit which has the pseudocode in and try to get that to work Ideally, hopefully in the meantime finding that somebody else has written a blog post in simple english like this example with adam right so Um, don't be disheartened When you start reading deep learning papers and unless you have a math background Well, even if you're a phd in math and they're still terrifying. Yeah, I still felt disheartened Rachel was complaining about a paper just today in fact So Yeah, you will learn To read the papers the other thing i'll say is that like you'll you'll even see now There'll be a bit that's like And then we use a softmax layer and there will be the equation for a softmax layer and you'll look at the equation with like What the hell and then it's like, oh, I already know what a softmax layer is, you know Or it'll be like and then we use an lstm Like literally still in every paper. They write the damn lstm equations as if that's any help to anybody Right, but okay. It adds more greek symbols. So be it talking of greek symbols Um, it's very hard to read and remember things you can't pronounce Right, so if you don't know how to read the greek letters Google the greek alphabet and learn how to say them It's just it's just so much easier when you can look at an equation rather go Squiggle something squiggle something you can say alpha something and beta something I know it's a small little thing But it does make a big difference So we are all there to help each other Read papers and the reason we need to read papers is because as of now A lot of the things we're doing only exist in very recent paper form Okay So, uh I really think writing is a good idea in fact all of your projects I hope will end up in at least one blog if you don't have a blog medium.com is a great place to write We would love to feature your work on fast.ai So, you know, tell us about what you create You know, we're very keen for More people to get into the deep learning community. So, you know, when you write this stuff say, hey, you know This is some stuff based on this course. I'm doing and here's what I've learned and here's what I've tried and here's what I found out Put the code on github it's amazing like even us putting our little aws setup scripts on github for the MOOC Rachel had like a dozen pull requests within a week with More than that. Yeah with all kinds of little Fitbits of like, oh, if you're on this version of Mac, this helps this bit or here's a I've abstracted this out to make it work in Ireland as well as in America and other blasts. So So there's lots of stuff that you can do um I think the the most important tip here is don't wait to be perfect before you start writing, you know What was that tip you told me, Rachel? It's you should think of your target audience as the person who's one step behind you So maybe your target audience is someone that's just working through the part one MOOC right now. Yeah, so your target audience is not Jan McCoy or Jeffrey Hinton Exactly, it's you six months ago You know, right the thing that you would love to have seen because there will be far more people in that target audience than in the Jeffrey Hinton target audience Okay, how are we going for time, Rachel? It's seven forty five. So this might be a good time for a break. Okay. Well, let's let's Just get through this and then we can get on to the interesting. So um I've tried to lay out what I think we'll study in part two as I say What I was planning until quite recently to present today was neural translation and then two things happened Google suddenly came up with a much better R&N and sequence-to-sequence API And then also two two three weeks ago a new paper came out for generative models Which totally changed everything. So that's why we've redone things and we're starting with CNN generative models today. Anyway, if things yes, we have a question. Where to find the current research papers Okay, we'll get to that for sure Um, let's do that after our break So assuming that things go as planned the general kind of topic areas in part two will be CNN's and nlp beyond classification. So if you think about it pretty much every with everything we did in part one was classification or a little bit of regression We're going to now be talking more about generative models A little hard to exactly define what I mean by generative models, but We're talking about creating an image Or creating a sentence You know, we're creating bigger outputs Okay, so cnn's beyond classification. So generative models for cnn's means the thing that we could produce could be a picture Showing this is where the bicycle is. This is where the person is. This is where the grass is that's called segmentation Or it could be taking a black and white image and turning it into a color image Or taking a low res image and turning it into a higher res image or taking a photo and turning it into a bang-golf Or taking a photo and turning it into a sentence describing it Nlp beyond classification can be taking a English sentence and turning it into French Or taking an English story and a question and turning it into an answer of that question about that story That's chatbots and q&a Um, we'll be talking about how to deal with larger data sets So that both means data sets with more things in it And data sets where the things are bigger And then finally something i'm pretty excited about is i've done a lot of work recently Finding some interesting stuff about using deep learning for structured data and for time series So for example, we heard about fraud All right, so fraud is both of those things that combines time series of kind of transaction histories and click histories And structured data, you know customer information Traditionally that's not been tackled with deep learning. Um, but I've actually found some State of the art, you know world-class approaches to solving those with deep learning. So I'm really looking forward to sharing that with you Okay, so let's take a um A 10 minute or eight minute break come back at five to eight. Um, thanks very much Okay, so we're going to learn about this idea of artistic style or neural style transfer And the idea is that we're going to take A photo And make it look like it was painted in the style of some painter So our inputs are Um a photo and I'm going to call it. Oh, that's way off and style And so these two things are going to be combined together to create a Um an image Which is going to hopefully have the content of the photo not working And the style and the style of the image um The way we're going to do this is we're going to Assume that there is some function Where the inputs to this function are The photo Uh, actually they're going to be the input to the function is going to be Yeah, the photo the style image and some generated image that I've Created And that will return some number where this function will be um higher if the generated image um Really looks like this photo in this style And lower if it doesn't So if we can create this loss function that basically says here's my generated image And it returns back a number saying. Oh, yes, that generated image does look like that photo in that style Then we could use sgd And we would use sgd not to optimize The weights of a network We would use sgd to optimize the pixel values of the generated image So we would be using using it to try to Optimize the value of this argument so We haven't quite done that before but conceptually it's identical right conceptually We can just find the derivative of this function With respect to this input Right, and then we can try and optimize that input which is just a set of pixel values to try and maximize the function So all we need to do Is come up with the function come up with a function which will tell us How much does some generated image look like this photo in this style? And the way we're going to do that step one is going to be very simple. We're going to turn it into two functions F content Which will take the photo And the generated image and that will tell us Um a bigger number if the generated image Looks more like the photo The content looks the same And then there'll be a second function Which takes the style image and the generated image And that will tell us Be a higher number if this generated image Looks like it was painted in the same style as a style image So we can just turn it into two pieces and add them together. So now um We need to kind of come up with these two parts Now the first part is very easy Um, what's a way that we could Create a function that returns a higher number If a generated image is more similar to some photo When we come up with a loss function Do that So The really obvious one is The values of the pixels the values of the pixels in the generated image The mean squared error between them and the photo That mean squared error loss function would be one way of doing this Um part Now the problem with that though is that as I start to Turn it into a van Gogh Those pixel values are going to change They're going to change color because the van Gogh might have been a very blue looking Um, they'll change the relationships to each other. So it might you know, the might become a curve or it used to be a straight line so really The pixel wise mean squared error Is not going to give us many much freedom In trying to create something that still looks like the photo So here's an idea instead Let's look at not the pixels But let's take those pixels and stick them through a pre-trained Um cnn like vgg And let's look at like the fourth or fifth or eighth convolutional layers activations So remember back to those like that those matt matt xyla visualizations where we saw that the later layers Kind of said how much doesn't how much like does it doesn't eyeball look like here or how much does this look like A star or how much does this look like the third of a dog? right the later layers We're dealing with bigger objects and more kind of semantic concepts So if we were to use a later layers activations as our loss function Then we could really change the style and the color and all kinds of stuff And really would be saying does the eye still look like an eye? Does the beak still look like a beak? Does the rock still look like a rock and if the answer is yes Then okay, that's good. You know, this this is something that matches in terms of the meaning of the content Even though the pixels look very different And so that's exactly what we're going to do so for f content right, we're going to say that's just The vgg activations Of some convolutional layer Which one we can try some right? So that's actually enough For us to get started. Let's let's try and build something that optimizes pixels using a lost function of the vgg network some convolutional layer so this is the neural style notebook and Much of what we're going to look at is going to look very similar The first thing you'll see which doesn't look similar to before is I've got this thing called limit mem limit mem Remember you can always see the source code for something but to put in two question marks limit mem is just these three lines of code Which I notice somebody kindly is already pasted in the forum TensorFlow one of the many things I dislike about tensorflow for our kind of work Is that it's all of the defaults are production defaults So one of the defaults is it will use up all of your memory on all of your graphics cards So I'm currently running this on a server with four graphics cards Which I'm meant to be sharing with my colleagues at the university here If every time I run a notebook nobody else can use any of the graphics cards They're going to be really really pissed and this nice little gig I have of Running these little classes is going to disappear very quickly So I need to make sure I run limit mem very soon as soon as I start running a notebook I mean, honestly, I think this is a poor choice by the tensorflow authors because like Somebody putting something in production is going to be taking time to optimize things I don't give a shit about the defaults Or somebody who's hacking something together to quickly see if they can get something working Very much wants nice defaults, right? So this is like one of the many many places where tensorflow makes some odd little annoying decisions, right? But anyway Every time now I create a new notebook I copy this line in and make sure I run it and so this Does not use up all of your memory okay, so I've got a link to the paper that we're looking at and indeed we can open it And now is a good time to talk about how helpful it is To use some kind of paper reading system I really like this one. It's free. It's called Mendeley desktop Mendeley Let's use as you find papers you can save them into a folder On your computer Mendeley will automatically watch that folder any PDF that appears there gets added to your library And it's really quite cool because What it then does is it finds the um The archive id and then you can click this little button here and it will go to archive Yes, do you want to explain what archive is? I will in a moment. Yep It'll go to archive and grab all of the information such as the abstract and so forth and fill it out for you And so this is really great because now anytime I want to Find out what I've read which I've got anything to a style. I can type style and Up up all of the papers Believe me after a long time of reading papers without something like this Reading papers without something like this. It basically goes in one ear and out the other and like literally I've read papers a year later and at the end of it. I've realized I read that before I don't remember anything else about it, but I know I read it before whereas this way I really find that my knowledge builds As I find references, I I'm immediately there looking at the references The other thing you can do is that as you start reading the paper is that as you can see my My notes and highlights are saved and they're also Duplicated on my mobile devices and my other computers and they're all synced up. It's really cool so Talking about archive is a great time to answer a question we had earlier About how do you find papers? so the vast vast vast majority of deep learning papers Get put up on archive org before a long long long time before they're in any journal or conference So if you wait until they're in a conference proceedings, you're like Many many months or maybe even a year behind so pretty much everybody uses archive um, you can go to the AI section of archive and kind of See what's there But that's not really what anybody does um, what everybody does instead is um Use something well My favorite is archive sanity the archive sanity preserver And um, this is something that the wonderful andrew capati built um, and what it lets you do is to create a library of articles that Somebody tells you to read or that you're interested in or you come across And as you create that library by clicking this little save button It then recommends More papers like it or even once you start reading a paper You go show similar And it will then show you Other papers That are similar to this paper and it seems to do a pretty damn good job of it So you can really explore and get lost in in that whole area um So that's one great way to do it and then as you do that, um, you'll find that if you go to archive One of the buttons that it has is a Bookmark on mendeley button. So like even from the abstract here Bang straight into your library and the next time you light up mendeley. It's all there And then you can put things into Folders So the different parts of the course i've created folders for them and kind of kick track of what i'm reading that way A good little trick to know about archive.org is that the Um You often want to know where it's from and if you go to the first page on the left hand side You can see the date here and another cool tip is that the file name The first four digits are the year and month um for that file. Um, so there's a couple of annual tips um as well as archived sanity um another really great place for Finding papers is twitter Now if you haven't really used twitter before i haven't really used twitter for this purpose before it's hard to know where to start um So i try to make things easy for people by favoriting Lots of the interesting deep learning papers that i come across So if you go to germy p howards page and click on likes Um Do that um, you'll find that there is a thousand links to Papers here and as you can see there's generally a few every day That's useful for a number of reasons one is to get some ideas of papers to read But perhaps more importantly is to see who's posting these cool links Right, and then you can follow them as well actual can you throw that box to that gentleman? The black Just speak okay It's not a question just just information about the archive. There is someone who has built a skill on amazon Alexa And actually by asking Alexa to give the most recent paper from archive and actually she reads abstract for you And you can filter the most, you know, okay, that's Papers for you. Thank you um Great. Um, the other place which I find extremely helpful is um, uh reddit machine learning Again There's a lot less Goes through reddit than goes through twitter, but generally like the really interesting things tend to turn up here um, and um You can often see kind of the discussions of it. So for example, there was a great discussion of pie torch versus tensorflow um in the last day or two Um, and so there's a couple of good places to get started Anything I missed Rachel I think that's good. I have them two questions on the image stuff when you go back to style. I'm ready So one of them was if the app prisma is using something like this. Yes prisma is using exactly this And the other is is it better to calculate f content for a higher layer for vgg and use a lower layer for app style Since the higher layer of abstracts are captured in the higher layer and the lower layer captures textures probably let's try it Shall we? We haven't learned about f style yet. So we're just going to look at f content first Okay, so I've got some more links to some things you can look at here in the book um So the data I've linked to um in the lesson thread on the forum I've just grabbed a random sample of about 20 000 image net images Um, and I've also put them into b-calls arrays. Um, so you can set up your paths appropriately Uh, I haven't given you this pickle you can figure out how to get the file names easily enough So I'm not going to do everything for you I've grabbed a little one of those photos pictures Oh, thank you for the person who's showing all the other stuff to pip install that's very helpful So this is going to be our content image So given that we're using vgg as per usual, we're going to have to subtract out the the mean pixel value from image net and reverse the um channel order because of course that's what the original vgg authors did Um, so we're going to create an array from the image by just running it through that preprocessing function Um, later on we're going to be running things through a network and generating images those generated images We're going to have to add back on that mean And undo that reordering. Um, so this is what this d processing function is going to be for um Now I've kind of uh hand waved over these These functions before and how they work But i'm going to stop hand waving for a moment because it's actually quite interesting Have you ever thought about how is it that we're able to take x which is a four dimensional tensor batch size by height by width by channels Notice this is not the same as tensor Theano so theano was batch size by channels by height by width so we're not doing that anymore Now kind of more natural right batch size by height by width by channels taking a four dimensional tensor and we're subtracting from it a vector How are we doing that? How is it? Making that work and the way it's making that work is because it's doing something called broadcasting And broadcasting refers to any kind of operation where you have Um arrays for tensors of different dimensions and you do operations Element wise operations on two tensors of different dimensions and and kind of how does that work? So this idea actually goes back to the early 1960s to an amazing programming language called apl apl stands for a programming language Uh apl was written by an extraordinary person called kenneth iverson. Originally apl was um A paper describing a new mathematical notation And this new mathematical notation was designed to be more flexible And far more precise than traditional mathematical notation And he then went on to create a programming language that implemented this mathematical notation so apl refers to the notation Which he described as notation as a tool for thought so he really Unlike the tensorflow authors understood the importance of a good api, right? He recognized that the mathematical notation can change how you think about math And so he created a notation which is incredibly express incredibly expressive um and um His son now Has gone on to carry the torch and he now Continues to support A direct descendant of apl which is called j um, so if you ever want to find I think is the most elegant programming language in the world You can go to j software dot com and check this out now How many of you here have used regular expressions? Okay, how many of you the first time you looked at a complex regular expression thought that is totally intuitive Okay You will feel the same way about j The first time that you look at a piece of j You'll go What the bloody hell? Because it's an even more expressive um And a much older language Um than regular expressions. Um, here's an example Of a line of j But what's going on here is that this is the language which at its heart Almost never requires you to write a single loop Because it does everything with multi-dimensional tensors and broadcasting so everything we're going to learn about today with broadcasting is a Very diluted simplified crapified version of what apl created in the early 60s Which is not to say anything rude about python's implementation. It's one of the best But j and apl Totally blow it away. Okay, so um, that's if you want to really expand your brain and have fun Check out j in the meantime, um, what does um keras slash theano slash tensor flow broadcasting work like let's look at some examples All right, so here we have Um, oh, I like them too import ai and wild ml newsletters. Oh, that's from rachel. Thank you rachel uh there Here is a vector A one-dimensional tensor minus a scalar Okay, that makes perfect sense that you can subtract a scalar from a one-dimensional tensor Right, but what is it actually doing? What it's actually doing is it's taking this two And it's replicating it three times So this is actually element wise one two three minus two two two It is broadcasted the scalar it is broadcasted the scalar across The three element vector one two three. Okay, so there's our first example of broadcasting so in general Broadcasting Has a very specific set of rules, which is this you can take two tensors And you first of all take the shorter tensor the tensor of less dimensions And prepend unit axes to the front. What do I mean when I say prepend unit axes? Here's an example of prepending prepending unit axes. Take the vector two three and prepend Three unit axes on the front. It is now a four-dimensional tensor of shape one one one two So if you turn a row into a column you're adding One unit axis if you're then turning it into a single slice You're adding another unit axis All right, so you can always make something into a higher dimensionality By adding unit axes All right, so when you broadcast it takes the thing with less axes Less dimensions and adds prepends unit axes to the front And then what it does is it says so let's take this first example It's taken this thing which has no axes. It's a scalar and turns it into a vector of length one And then what it does is it finds anything which is of length one and duplicates it Enough times so that it matches the other thing So here we have something which is a four-dimensional tensor of size five one three two All right, so it's got Two columns three rows one slice and five cubes something And then we're going to subtract from it a vector of length two So remember from our definition It's then going to automatically reshape this by prepending unit axes until it's the same length And then it's going to copy this thing three times This thing one time and this thing five times so The shape is five one three two All right, so it's going to subtract this vector from every row every slice every cube So you can play around with these little broadcasting examples and like Try to get a real feel for how to make broadcasting work for you So in this case we were able to take a four-dimensional tensor and subtract from it a three-dimensional vector Knowing that it is going to copy that three-dimensional vector of channels to every row To every column to every batch Right, so in the end it's like it's just done what we mean Okay, it's subtracted the mean average of the channels from all of the images the way we wanted it to but It's been amazing how often I've taken code that I've downloaded off the internet and made it I often 10 or 20 times more in terms of lines of code just by using lots of broadcasting And the reason I'm talking about this now is because we're going to be using this a lot All right, so play with it All right, and as I say if you really want to have fun play with it in j Okay So that was a diversion, but it's one that's going to be important throughout this. Okay, so we've now basically got the The data that we want So next thing we need is a vgg model Here's the thing though When we're doing generative models, we want to be very careful of throwing away information And one of the main ways to throw away information is to use max pooling When you use max pooling Let's say 2 comma 2 max pooling you're throwing away three quarters of the previous layer and just keeping the highest one right In generative models when you use something like max pooling You make it very hard to undo that and get back the original data So if we were to use max pooling with this idea of our f content, right and we say oh, what does the Fourth layer of activations look like if we've used max pooling then we don't really know What three quarters of the data look like slightly better Is to use average pooling instead of max pooling because at least with average pooling We're using all of the data to create that average We've still kind of thrown away three quarters of it But at least it's all been incorporated into calculating that average so The only thing I did to turn vgg 16 into vgg 16 average was to do a search and replace in that file From max pooling to average pooling All right, and it's just going to give us some slightly smoother slightly nicer results And you're going to see this a lot with generative models We do little tweaks just to try to make just to try to lose as little information as possible Okay, you can just think of this as vgg 16 Shouldn't we use something like resnet instead of vgg since the residual blocks carry more context? um We'll look at using resnet Over the coming weeks It's it's a lot harder to use resnet for Anything beyond kind of basic classification Because well for a number of reasons What is that just the structure of resnet blocks is much more complex? So if you're not careful You're going to end up picking something that's on one of those like little arms of the resnet rather than one of the additive mergers of the resnet And it's kind of not going to give you any meaningful information um You also have to be careful because the resnet blocks Most of the time are just slightly fine-tuning their previous block, you know, rather like adding the residuals It's it's not really adding new types of information so um Honestly, the truth is I haven't seen any good Uh research at all about how to use or where to use resnet or inception architectures for Things like generative models or for transfer learning or anything like that So we're going to be trying to look at some of that stuff in this course, but it's far from straightforward Two more questions. Um, should we put in batch normalization? um so in um in Part one of the course, I never actually added batch norm to the convolutional part of the model So that's kind of irrelevant because we're not using any of the fully connected layers um More generally, um, is batch norm helpful for generative models. I'm not sure that we have a great answer to that Okay Well, the pre-trained weights change if we're using average pulling instead of max pull. Yeah, that's a great question. Um, the The pre-trained weights Probably I mean clearly the optimal weights would change um, but having said that it's still going to do a reasonable job without tweaking the weights because the relationships between the activations isn't going to change so, um Again, this would be an interesting thing to try if you want to download image net and try Fine-tuning it with average pulling see if you can actually see a difference in the outputs that come out or not It's not something I've tried Okay So here is the output tensor of one of the late layers of vg g 16 so the the way that If you remember, there are different blocks of vg g Where there's a number of three by three comms in a row and then there's a pulling layer And then there's another block of a three three by three comms and then a pulling there So this is the last block of the comms layers and this is the first comm of that block I think this is maybe like the third last Layer of the convolutional section of vg g. So this is kind of like large receptive field very kind of Complex concepts being captured at this late stage so What we're going to do is we need to Create our target. So for our For our bird When we put that bird through vg g what is the value of that layers activations So one of the things I suggested you revise was the stuff from the keras fac about how to get layer outputs So one simple way to do that is to create a new model Which takes our model's input as input and instead of creating using the final output as output We can use this layer as output. So this is now a model which when we call dot predict It will return This set of activations All right, so that's all we've done here Now we're going to be using this inside Inside the gpu. We're going to be using this as a target right so to give us something which is going to live in the gpu A and b we can use symbolically in a computation graph b We wrap it with k dot variable. So To remind you k So whatever keras In the docs, they use the keras dot back end module. They always call it capital k. I don't know why so k refers to the api that keras provides which provides a A way of talking to either theano or tensorflow with the same api All right, so both theano and tensorflow have a concept of variables and placeholders and uh Dot functions and subtraction functions and softmax activations and so forth. And so this k dot Module is where all of those functions live So this is just a way of creating a variable which if we're using theano It would create a theano variable if we're using tensorflow it creates a tensorflow variable and Where possible, um, I'm trying to use this rather than tensorflow directly But I could have absolutely have said the f dot variable and it would work just as well because we're using the tensorflow back end Okay, so this has now created a symbolic variable that contains the activations of block 5 con 1 So what we now want to do is to create uh is to generate an image Which we're going to use sgd to gradually make the activations of that image look more and more like This variable Okay, so how are we going to do that? Let's just skip over 202 for a moment and think about some Some pieces So we're going to need to define a loss function All right, and the loss function is just the mean squared error Um between two things one thing is of course That target that thing we just created which is the value of our Our our layer Using the third image so we use the third image right, okay So that's our target And then what do we want to get close to that? Well, what we want to get close to that is um Whatever the value is of that layer at the moment So what is layer equal so layer? is just It's just a symbolic object at this stage. There's nothing in it. So we're going to have to feed it with data later right So we can so remember this is kind of the the interesting way you define computation graphs with tensorflow and tiano It's like you define it With these symbolic things now when you feed it with data later So you've got this symbolic thing called layer And we can't actually calculate this yet. So at this stage, this is just a computation graph we're building Now, of course anytime. We have a computation graph. We can get its gradients. Yes, we have a question I have a question for readability. Can you scroll down when you're going over code snippet so that they're centered? Yes Thank you Okay So now that we have a computation graph that calculates the loss function we're interested in so this is this is f content If we're going to try to optimize our generated image, we're going to need to know the gradients All right, so here we can get the gradients and again We use k dot gradients rather than tensorflow gradients or tiano gradients just so that it's We can use it with any back end we like the So the thing that the function we're trying to get gradients of is the loss function, which we just calculated And then we want it with respect to Not some weights but with respect to the input of the model All right, so this is the thing that we want to change Is the input to the model So as to minimize our loss right, so they're the gradients So now that we've done that we can go ahead and create our function And so the input to the function is just modeled our input And the outputs to the function will be the loss and the gradients So that's nearly everything we need The last step we need to do is to actually run an optimizer Now normally when we run an optimizer, we use some kind of sgd Now the s and sgd is for stochastic In this case, there's nothing stochastic We're not creating lots of random batches and getting different gradients every time So why use stochastic gradient descent when we don't have a stochastic problem to solve? So in fact, there's a much longer history of optimization methods Which are deterministic going back to Newton's method, which many of you will be familiar with So the basic idea The basic idea of these Much faster deterministic optimization methods Is that rather than saying, okay, where's the gradient? Which direction does it go? Let's just go a small little step in that direction Learning rate times gradient small little step small little step because I have no idea how far to go Right and it's stochastic. So it's going to keep changing. So next time I look it'll be a totally different direction Instead with a deterministic optimization we find out which direction to go And then we find out what is the optimum distance to go in that direction And so if you know this is the direction I want to go and it looks like this Right then the way we find the optimum is we go a small distance Then we go twice as far as that twice as far as that twice as far as that and we keep going until the slope Changes sign And once the slope changes sign, we know it's called bracketing. We've bracketed the minimum of that function And then we can use bisection to find the minimum. So now we've bracketed it. We find halfway between the two Okay, is it on the left or the right of that? Okay, halfway between the two of those is the left or the right of that so we use bracketing and bisection To find the optimum in that direction That's called a line search So all of these optimization techniques Rely on the basic idea of a line search Now once you've done the line search You found the optimized you found the optimal value in that direction in our downhill direction That doesn't necessarily mean we found the optimal value Across our entire space. So what we then do is we repeat the process Find out what's the downhill direction now Use line search to find the optimum in that direction so the Problem with that is that in a saddle point You will still often find yourself going backwards and forwards in a rather unfortunate way so the Neater more the faster optimization approaches When they're going to go in a new direction, they don't just say which direction is down They say which direction is the most downhill But also the most different to the previous directions I've gone That's called finding a conjugate direction So the good news is you don't need to really know any of those details or you need to know is that there is a Module called find it sci-pi optimize And in sci-pi to optimize are lots of handy deterministic optimizers the two most common Used a conjugate gradient or cg and bfgs They differ in the detail of how do they decide what direction to go next which direction is both the most downhill And also the most different to the previous directions we've gone And the particular version we're going to use is a limited memory bfgs So The important thing is not How it works the important thing for us is how do we use it? Yes Okay So there is the question about loss plus grads. Yeah, so this is this is a An array containing a single thing which is loss Grads is already an array Or a list I should say which is a list of all of the of the loss with respect to all of the inputs so plus In python on two lists simply joins the two lists together So this is a list containing the loss and all of the gradients Someone asked if ant colony optimization is something that can be used And colony optimization lives in the class known as meta heuristics like genetic algorithms or simulated annealing There's a wide range of optimization algorithms that are designed for very difficult to optimize functions Which are extremely bumpy And so these are techniques all use a lot of randomization In order to kind of avoid the bumps In our case, we're using mean squared error, which is a nice smooth objective So we can use the much faster convex optimization And then that was the next question is this a Not non-convex problem or a convex optimization Okay, great so How do we use one of these optimizers? Well, basically You provide the name of the optimizer, which in this case is minimize something using bfts And you have to pass it three things A function which will return the loss value at the current point a starting point And a function which will return the gradients at the current point Now, unfortunately We have a function Which returns the loss and the gradients together Which is not What this wants So a minor little detail is that we create a A simple little class Right and all this class does and again the details really aren't important But all this class does is that when loss is called It calls that function that we created Passing in the current value of the data It gets back the loss and the gradients And it returns The loss Later on when the optimizer asks for the gradients It returns Those gradients that I stored back here. So all this is doing is it's a little class Which allows us to basically turn a Keras function that returns the loss and the gradients together Into two functions one which returns loss one which returns the gradients So it's a pretty minor detail But it's a handy thing to have in your toolbox because it means you now have something That can use deterministic optimizers on Keras functions So all we do is we look through a small number of times Causing that calling that optimizer each time And passing in We need to pass in some starting point. So the starting point is just A random image All right, so we just create a random image And here is what a random image looks like Okay, so let's go ahead and run that so we can see the results Actually ran this yet Oh, there it comes Good Run run Go around an image Okay, so you can see it going along and solving here Here's one I prepared earlier And here at the end of the 10th iteration Is the result So remember what we did was we started with This image Okay, we called an optimizer which took that image And attempted to optimize this loss function Where the target was the value of This layer for our bird image And the thing it was comparing it to was the layer for the generated image So we started with this We ran that optimizer a bunch of times calculating the gradient of that loss with respect to The input to the model the very pixels themselves And after 10 iterations it had turned this random image into This thing So this is the thing which optimizes the On block 5 con 1 Fire And you can see it still looks like a bird But by this point it really doesn't care what the background looks like It cares a lot what the eye looks like and the beak looks like and the feathers look like because these things all matter to image net To make sure it correctly sees that it's a bird All right, if we look at an earlier layer Let's look at block 4 con 1 You can see it's looking Getting the details more correct So when we do our artistic style, we can choose which layer will be our f content, right? And if we choose an earlier one, it's going to give it less degrees of freedom to Look like a different kind of bird, but it's going to look more like our original bird And so then here's a video showing how that happens. So there are the 10 steps and it's often helpful to be able to visualize your You know the iterations of your generators at work So feel free to Borrow this very simple code, right? You can just use matplotlib. We actually use this in the last class Remember we optimized our Linear little linear optimizer. We animated it. You just have to define a function That gets called at each step of the animation And then you can just call animation dot funk animation passing in that function And that's a nice way that you can animate your own generators so this Question we're using keros and tensorflow to extract the vg g features. These are used by sci-pi for bfgs And does the bfgs also run on the gpu? No There's really very little For the bfgs to do Really for an optimizer All of the work is in calling the loss function and the gradients the actual work of like Doing the bisection and doing the bracketing is It's so trivial that we we just don't care about that. It doesn't take any time There's a question about the checkerboard artifact. Yes geometric pattern. Yes appearing. Yes This is actually not a checkerboard artifact exactly Checkerboard artifacts we will look at later. They look a little bit different And that was my interpretation mistake not the questioners. Okay. Yeah um Basically the Yeah, I mean, I'm not exactly sure why this particular type of Why this particular kind of noise has appeared honestly It's an interesting question. And how would batching work? area um It doesn't so there's no batching to do um where We have a single image Which is being um Which is being optimized So there's yeah, there's really no batching to do here. Well, look at a version Which uses a very different approach and has batching shortly Has anyone tried something like this? By averaging or combining the activations of multiple bird images And to create some kind of prototypical or novel bird um Yeah, generative adversarial networks do something like that, but probably not quite Yeah, I mean, yeah, I'm not sure maybe not quite Where where can people get the pickle file? They don't you have to get a list of file names yourself from the uh List of files that you've downloaded And then um just to make sure I understand this Um someone says in this example, we started with a random image But what if we started with the actual image as the initial condition We would get the original image back, right? I would assume so. Yeah. Yeah, I mean I can't see why it wouldn't Yeah, it'd be basically the gradients would obviously zero And they're interested to find out where we initialize for the artistic styling problem Sorry That was just a follow-up. We're gonna get there. Um, all right. Oh, there's one more. Okay Um, would it be useful to use a tool like quiver to figure out which bgg layer to use this? Um, it's so easy just to try a few and see what works. Um So Um, okay, so we're nearly out of time. So let's um, I haven't got through as much as I hoped but we're going to finish off this piece. So um We're now going to do F style Now F style is nearly identical. All of the code is nearly identical. The only thing different Is a we're going to not feed in a photo We're going to feed in a painting and here's a few styles we could choose from we could do bang off We could do this sort of drawing Or we could do the simpsons now, um So we uh pick one of those and we create the style array in the same way as before um Chuck it through um vgg and this time though, we're going to use multiple layers So I've created a dictionary from the name of the layer to its output And so we're going to use that to create an array of uh A number of uh the um outputs. We're going to grab the first second and the block outputs um So we're going to create a Target as before But we're going to use a different loss function The loss function is called style loss And just like before it's um going to use the mse But rather than just the mse on the activations It's the mse on something called the gram matrix of the activations So what is a gram matrix? A gram matrix is very simply um The dot product of a matrix with its own transpose So here it is here dot product of some matrix With its own transpose and I've just got to divide it by here to create an average um So what is this matrix that we're taking the dot product of it and its transpose? Well, what it is is that we start with our image and remember the image is height by width by channels And we change the order of dimensions. So it's channels by height by width And then we do a batch flatten. So what batch flatten does is it takes everything except the first dimension And flattens it out into a vector. So this is now going to be a matrix where the rows the channels And the columns are a flattened version of the height by width Right, so if this is c by h by w the result of this will be c rows and h times w columns So when you take the dot product of something With the transpose of itself What you're basically doing is creating something a lot like a correlation metric, right? You're saying how much Is each row similar to each other row right, so if you think about a you know You can think of it a number of ways you can think about it like a cosine The cosine is basically just a dot product You can think of it as A correlation matrix is basically a normalized version of this Um, so maybe if it's not clear to you write it down on a piece of paper on the way home tonight just think about taking The rows of a matrix And then flipping it around and you're basically then turning them into columns and then you multiply the rows By the columns That's basically the same as taking each row and comparing it to each other row All right, so that's So that's what this gram matrix is is it's basically saying um for every channel How similar are its values to each other channel? So if channel number one in Most parts of the image is very similar to channel three in most parts of the image Then one comma three of this result will be a higher number So it's a kind of a weird matrix, right? It basically tells us it's like a fingerprint Of how the channels relate to us each other In this particular image or how the filters relate to each other in a particular layer of this particular image I think the most important thing to recognize Is that there is no geometry left here at all that the x and the y coordinates are totally thrown away like they're actually flattened out All right, so this loss function can by By definition In no way at all contain anything about the content of the image Because it's thrown away all of the x and y information And at all that's left is some kind of fingerprint Of how the channels relate to each other how the filters relate to each other So this style loss then says for two different images How do these fingerprints differ how similar are these fingerprints? So it turns out That if you now do the exact same steps as before using that as our loss function And you run it through a few iterations It looks like that It looks a lot like the original van Gogh But without any of the content So the question is Why? The answer is Nobody the fuck knows So a paper just came out two weeks ago called demystifying neural style transfer With a mathematical treatment where they claim to have an answer to this question But as the point at which this was created like a year and a half ago until now No one really knows why that happens, right? But the important thing of the authors of this paper realized is If we can create a function that gives you content loss And a function that gives you style loss and you add the two together and optimize them You can do neural style So all I can assume they don't say how they did it in the paper All I can assume is that they tried a few different things They knew that they had to throw away all of the geometry Right, so they probably tried a few things and throw it away with geometry and at some point they looked at this and they were Oh, shit That's that's it, right? So now that we have this magical thing There's the simpsons All we have to do is add the two together, right? So here's our bird, which are called sauce They've got our style layers. I'm actually going to take the top five now Here's our content layer. I'm going to take block 4.2 As promised for our loss function I'm just going to add the two together, right style loss for all of the style layers plus the content loss And I'm going to divide the content loss by 10. This is something you can play with, right? And in the paper, you'll see they play with it, right? You can play how much style loss versus how much content loss get the gradients evaluator solve it and there it is right, so Other than the fact that we don't really know why the style loss works But it does everything else kind of fits together. So there's the bird as van Gogh There's the bird as the simpsons and there's the bird in the style of a bird picture There's a question um since the publication of that paper has anyone used any other loss functions for f style that achieve Similar results. Yeah. Yeah. So as I mentioned just a couple of weeks ago There was a paper I put it on the forum that tries to generalize this loss function It turns out actually that this particular loss function seems to be about the best that they can come up with But there you go. So it's nine o'clock So we have run out of time So we're going to move some of this lesson to next lesson, but to give you a sense of where we're going to head What we're going to do is we're going to take this thing where you have to optimize every single image separately and we're going to train a cnn We're going to train a cnn which will learn how to turn a picture Into a van Gogh version of that picture All right, so that's basically going to be what we're going to learn next time And we're also going to learn about adversarial networks, which is where we're going to create two networks One will be designed to generate pictures like this And the other will be designed to try and Classify whether this is a real simpsons picture or a fake simpsons picture And then you're through one Generate the other discriminate generate discriminate and by doing that We can take any Generative model and make it better by basically having something else learn to pick the difference between it the real and the fake And then finally we're going to learn about a particular thing that came out three weeks ago called the wasstine gann which is The reason I actually decided to move all this forwards Generative adversarial networks basically didn't work very well at all until about three weeks ago Now that they do work Suddenly there's a shitload of stuff that nobody's done yet, which you can do for the first time So we're going to look at that next week