 Thank you. So I'm Paulo Grady. I'm going to talk about PyTorch and Autograd. Just a little bit about me. Currently I'm a data scientist at Zalando in Dublin. This is Zalando, our online fashion retailer. I'm also a member of the PyCon Ireland organizing committee. We'll have our conference in October this year if you're interested in coming to Dublin. I'm also on Twitter, so if you'd like to follow me. So just a quick overview what I'll talk about today. So I'll just give a quick background on matrix computations and deep learning, that sort of stuff. And then I'll get into PyTorch, tensors and variables. I'll talk about Autograd. So Autograd is actually a module within PyTorch. And I'll discuss gradients and back propagation, that stuff. I'll then give a simple flow world example of how to use the framework. Just simple linear regression. And then I will illustrate differences between a define by run framework, which is what PyTorch is, and a define and run framework. And I'll do that at the end. Okay, so just a quick background. So before deep learning and GPUs and frameworks around GPUs, obviously all computation was basically constrained to CPUs. And the sorts of tools you would have been using would have been stuff like, well, we still use stuff like BLAS, which basically give you simple matrix computation routines and LAPAC, which would give you routines for solving linear systems, stuff like that. And then in Python, you'd be using NumPy or probably familiar with NumPy. So all these tools are all based around performing computations on the CPU. And yeah, so all support, well, NumPy supports, say, multi-dimensional arrays and arrays. But at that time, there wasn't great support for like tensor operations and stuff like that. And so with deep learning, probably you've heard of that. Basically, the models here are quite large and you need a different paradigm for computing the models and updating the gradients and all that stuff. And so people started using GPUs that allowed that basically enable deep learning. And so basically here we use tensors and our number of frameworks, they probably know TensorFlow and Tiano as well as PyTorch will talk about now. But yeah, so we do the computation on GPU because it's an order of magnitude faster. And so as I said, there are a number of different frameworks, TensorFlow, Tiano. And another thing that the frameworks offer is in addition to doing your computation on the GPU, it would have support or features for calculating gradients, either symbolic or automatic differentiation. And that's important for machine learning algorithms because you usually have to calculate the gradients to feed into an optimization algorithm. And before these tools, you'd do that by hand. So you'd pen and paper, calculate what the gradients are for your update rule. And now there are tools that will do that for you just from the cause function. Okay, so the great thing about all this stuff is that Python's at the center of this. All those frameworks I mentioned have bindings for Python. And I suppose the new kid in the block is now PyTorch. That's what I'll talk about. So PyTorch. So I suppose the first thing to mention about PyTorch is, Torch itself has been around a long time. It's a Lua framework. So there's a lot of great work within that community that has been done to build Torch. Obviously PyTorch has bindings around us, and they still share common C libraries stuff like that. And Lua Torch is still its own separate project as well. So PyTorch is quite a new framework. I think beta release was in January. And I suppose the key difference between PyTorch and other frameworks is that's defined by run framework. I suppose to define and run. We'll discuss that later on. But to me, and one of the reasons, well, the reason I like PyTorch is, it looks more like programming as you know, but it feels to me to be more Pytonic. So I feel with other frameworks, you kind of feel that there's a bit of setup going on before you actually get to do your training, which is fine. But I think with PyTorch, it just feels a little bit nicer. So the main components here, PyTorch are the NN module, and that allows you to build and train neural networks. That's what deep learning is. So these neural networks can become very complicated with many different layers, many different types of modules and layers. And PyTorch is basically used at the cutting edge of research, so it supports all that stuff. And nothing but PyTorch is it has autograd, which has a nice support for automatic differentiation. I'll talk a bit about that. And then once you have your gradients from your AD, you can then use that to update your model parameters for your deep learning network or whatever, or linear regression using the optimization routines in the framework. So I'm going to give a few examples here, some code examples. So the version of PyTorch that these examples are for is 0.1.1.2, and it's running on a Python 3.5. So just give you a kind of a quick overview of the status of the project. So this is a tweet that I've seen recently on Twitter. It basically shows the aggregate activity on GitHub repos for deep learning libraries. PyTorch is a fifth there, which is quite good. It's quite a young project. It's probably obviously at the top there are TensorFlow, you have Google and big companies behind that one, but still, even after a few months, it's doing very, very well at fifth spot there. So first talk about tensors, and I have a simple example here. So just to be clear, we're talking about matrix multiplication stuff like that. You would not use native Python lists to do matrix multiplications, I'll just state that. You can model a matrix using a list of lists, but you would not implement it that way because, well, you can do, but it's very inefficient. So lists, as you know, can contain arbitrary items, arbitrary types. So there's a lot of overheads on the kind of memory handling of that that you would want to avoid when you're doing big computational jobs. So everyone, you're always going to use a different framework or module, so you might use NumPy. So the way TensorFlow handles, sorry, PyTorch handles that is, we use tensors. So these tensors can be initialized, as you can see here with lists, but the underlying kind of memory, the way they're mapped, mirrors more, say, a C programming array where the individual items are continuously laid out, but it's just, yeah, that's the foundation of these types of data structures. So if you're used to looking at NumPy, once you have your tensor or end array, you can perform many different types of operations and extract many different types of information, so you can determine, say, the size of the tensor. Obviously you're going to add, you're going to add in place, now the place. So here's just a simple example where I set up a tensor of ones, add two tensors of ones, then I do an in-place add. So PyTorch supports in-place operations, and they are identified by an underscore at the end of the method name. So, yeah. Yeah, so I mentioned NumPy, so you may be in a situation where you have your data already prepared or transformed or whatever using NumPy, and you want to get that into these PyTorch tensors, and the simple way of doing that is it supports, like, a NumPy bridge. So here I have a Y underscore NP NumPy array, which I've defined, and I can multiply it by the previous float tensor that I created by just specifying this NumPy method on the tensor, and the simplest dash. You look at the result and it's basically a NumPy array. So that's handy, so it's very easy to include NumPy using PyTorch. You can also go the other way around. So we have, we want to multiply, matrix multiply, two tensors. We set up a Z, which is a NumPy array, and then using Torch from NumPy, you can create a tensor, and then, you know, you get the result. So two by three, multiply by three by two, it gives you a two by two matrix, and because we're on short, basically you get a Torch dot short tensor. So that's all handy stuff, that's all good stuff. Let's go back from the framework. Yeah, other operations you can do is you can create, you can change the shapes of your tensors, and that'd be expected. Another nice thing about PyTorch is you can be very explicit about where the computation happens, and so obviously I mentioned the GPU, and that's where you do most of your computation. But you can explicitly say that, so you can set up your variable, sorry, your tensor, and then you can specify that should be on the GPU. Likewise, you can say, I would like to move that to the CPU. So it's very explicit, you know where things are being moved to. You'd be careful with your copies and stuff like that, but it's just nice to be explicit like that, and that's a nice feature. So another important component of PyTorch is variables. So variables are imported from the Autograd package, and basically a variable is a tin wrapper around the tensors. So what the variable gives you is it allows you to specify a computation graph, and when it comes to the back propagation and the machine learning and updating your model parameters and that sort of stuff, it allows you to do all that as well, so accumulation gradients. So for machine learning and deep learning, it's important to know the history of how a value was calculated. So that suggests like a DAG directed acyclic graph or some sort of computation graph, and that's basically what the variable does. So the variable will keep a history of how that value was being calculated and basically gives you the structure of the graph dash was used to create dash. So that's useful when it comes to the AD stuff, and it's also, so that's a nice feature. Yeah, so as I said, that will allow you to facilitate the back propagation of gradients, and basically supports the AD stuff. So yeah, I just want to mention that, so all the slides I'll be talking about here, they're all in the context of machine learning of the learning step. Once you have your model learned, you don't need any of this back propagation gradient stuff, so you can still use the variables, but the way to specify that you don't need a gradient stuff and you just want to run in inference time or test time is to set the volatile flag on your variable. So you do that if you're in production with the model. So here's just a simple example of a variable. So it's important for a model grad. You can see that it's initialized with a tensor. It's just got two simple tensors here, X and Y. I then add them together to get Z, and then I just echo out the values starting Z after the addition. So as I said, variables keep history, so there's an additional property on the variable, the Z variable, because it's an autograd variable, and that's called creator, which is this one here. And that basically tells you the last operation that was performed to actually calculate the result. And in this case, it just shows you it's an addition. Then I can take Z and do, say, sum operation on the contents of that tensor and assign that to S. I can do the same, have a look at the creator for S, and you can see that identifies that last operation was a sum. So this is using this creator property. So that's an interesting thing. So with the creator, it gives you a reference for a data structure where you can actually chase the references and construct the graph for yourself. So I have a simple course of function here that basically prints out the different operations and the data when it hits the data. And by passing the S dot creator, you can see the different operations that have been applied to S. So S here I'm doing it in addition to sign, and you can see that force operations add constant, then the sum, then the add, and then it shows that you actually display the data that was there. So that's a nice feature. So you can imagine for big deep learning networks, you could create a graph like this, more fancy graph, saying with, say, a dot library. But it also keeps a version. So each time the variable S is mutated, it basically increments this version number, and that's used in the machinery of the autograd. So that's a nice feature as well. Okay, so I'll just talk quickly about autograd. Yeah, so autograd provides classes and functions that implement automatic differentiation of arbitrary scalar-valued functions. So scalar-valued functions, so you're usually talking about cost functions, loss functions, stuff like that, and they're usually scalar-valued. So that's why it's highlighted there. And just a quick note that the type of AD that's implemented with autograd is reverse-mode autodifferentiation. Okay, so this autograd module is actually, so obviously it's coming through LuaTorch, but that was actually originally inspired by another Python project called autograd. So it's gone from Python into Lua, and now full circle back into Python through PyTorch. So the define by run, part time is inspired by the chain or framework. Also in the manual, the docs for PyTorch mentioned that the implementation for autograd is pretty quick, which is great. Yeah, and so to implement the back propagation, the only change of the ad is just basically change your tensors to variables, and it's a very, very simple thing to do. Okay, so we're going to talk about back propagation, and we're going to talk about autograd, automatic differentiation, all that stuff. I'll just quickly discuss calculus or differential calculus, and just to kind of illustrate why we need to do this. So differential calculus. Okay, so you may or may not remember from school or college, so basically you have a function. So this equation here is to calculate a derivative from force principles, but basically you have a function and you would like to know, you would like to know the gradient of a tangent at that point. Okay, so that's where the gradients come into it. So to do that, you take your original function, form differential calculus, and you get a derivative. And then using the initial x input, using the new derivative, which is also a function, you just basically apply that into the function, and whatever value that comes out of that basically tells you what the gradient is or basically the slope of this line. So we're really talking about points on a curved function and calculating gradient for that, basically a slope of a line. And then we use these slopes or gradients to indicate something of value to us. And in terms of machine learning and optimization, this allows us to identify extreme functions or basically maximum points and minimum points. And the way we do that is by looking at the gradients for the functions. And that's why we use differential calculus. So back propagation, yeah, so back propagation, because we're talking about networks and multi-layer networks, it's difficult to calculate derivatives of those type of networks. So there's an algorithm that will do that. It's a famous one you probably know about. It's a back propagation. So that will calculate the gradients of your model parameters with respect to the last function and will push back the gradients through the network and allow you to update or change the model parameters to make it a better solution. So the rule that we use to actually achieve this is the chain rule. It's a standard rule in differentiation. And back propagation basically iteratively applies that to all the different layers and that's how it pushes the gradients back through the network and allows you to update parameters. So AutoGrad does this for us, it does the back propagation. So we're talking about AutoGrad, we're talking about back propagation and it's simply implemented using just a backward method on our variables. So the stuff that we've been doing with the variables, that's setting us up now for calculating derivatives and doing back prop. So just as a simple example or just to convince yourself that the derivative of a sine function is a cos function and I'll just illustrate or try and convince myself that back propagation will actually give me what I expect which is cos. So I just set up a simple x variable which is basically one period of a sine wave so from zero to two pi and then I calculate the sine of that signal or variable using just torch.syn which obviously goes into out. So another feature of the variables is you have an additional grad property which will echo out the actual gradients for that variable if they were calculated at this point they have not been calculated so returns nothing but at this point I calculate the gradients by calling the backward method and at that point I calculate the gradients for you. So this will basically calculate dx so the next time you look at the grad you will see that there are gradients there so first what I do is I just look at the out I convert it to number just to make it a bit easier to read and for the inputs which is zero to two pi I have a look at what the out is and we get zero one zero minus one zero and that's basically a sine then by looking at the gradients we can see the same and we can see that's changed one zero minus one zero one so that's basically the cause so we've convinced ourselves now that it's performed calculated gradients for us and we just do another check where we compare it back to cause so that's great we've convinced ourselves that it worked and we're happy with all that just another simple example actually no sorry first so how does all this work so basically deep in the code of PyTorch the sine sine function is actually an instance of this sine class which basically inherits from a function which is imported from AutoGrad so you can see two methods here we can see the sine function in the forward method but we can also see backward and we look here closely we see that it basically just on the backward method it just calls cause which is the correct thing to do we know in advance that the derivative for a sine is cause and basically that's what we do so if you've looked at calculus before you know the standard tables for different functions and derivatives we basically do that that specifies the gradient formula that gives you a gradient and if you actually wanted to extend PyTorch well then this is the pattern you'd use you'd subclass from function and you'd create a class where you'd have your forward method and then you'd define the gradient function backward method so just another simple example I just wanted to display a tangent something similar to four I create a X variable which is a linear space note that it requires grad is equal to true is here then I basically set up the equation quadratic equation and I'm only interested in looking at a gradient for one point within our linear space there and I set that up and then finally on Y I call the backward function using the gradient style set up here and so the main thing we're interested in is the value of the gradient at that point I've identified which is indicated by this index here and so I know that the value here should be minus four again minus 9.3 because of the discretization and from this information I can calculate gradient so I'm just now going to plot that and just plot the original quadratic function and the gradient and tangent so you can see just confirms again what we'd expect we have our point minus one on our quadratic function and we have a nice tangent there you can see that in the plush that's great okay so I'll just finish off with a simple low roll example to mention so just going to talk about linear regression linear regression is basically you have a cloud of data points and you're going to try your best to fit a line to that cloud of data points and try to minimize the distance between the line and all the points close to the line so the model for this is quite simple it's just a linear relationship so y equals alpha x plus beta y, sorry alpha is a weighting and b is an intercept x is called independent variable and y is a dependent variable so this model is actually already implemented in point torch you don't have to implement this itself and that's in the nn module you can see the types of models like convenets and whatever other sort of stuff like that but that's our model for this and this is what our data looks like so I've created this for I've created a noisy example so I've introduced some noise there alpha is 2 and beta is 3 so you can see the interceptors around 3 okay so just quickly so we've been talking about models that's great but you need other things as well if you're interested in learning you need a cost function or I'll discuss that but you also need to identify or specify a learning algorithm for it to update your model parameters so the cost function you use is pretty standard one mean square error so the value for this cost function would be zero when you're predicted values and your actual values are exactly the same and that's what you want this is a very common one so this is where we use our gradients so we have here the gradients of our cost function which is MSE with respect to the model parameters and then those gradients are used within the optimization algorithm or to cast the gradient set algorithm so it's a further step after you've calculated your gradients that's needed okay so I'll just show you the code so setup this is how I set up my data my x and y and basically we set up our linear regression model so this is how you set up a model in PyTorch in it you will specify the different layers or modules that you need in our case it's just a simple example one layer just a linear and then in the forward pass you would describe how these different layers are combined for us again it's very simple but you can imagine for like very very big with a lot of layers and a lot of different interactions between how the forward method is specified then finally we just instantiate it and because we're only interested in one dimensional regression we just specify that then in the model so I mentioned we need a cost function I said we'll use MSE I specified that here and then I also specified cast a gradient descent and the learning rate and then I also specified that here we have our training data that's basically the collaborative points I showed so before we do any training you can actually inspect the model parameters see what the values are so on the model there's a method called name parameters and that's basically a generator of the different parameters so here we can see there's a weight which is our alpha and we can see a bias which is our beta and then they'll be changed and then we'll have our line for the data so this is the main training loop for it to actually train the model so this is the way PyTorch does its training within loops so the first step is five steps here and the first step is to zero the gradients on the model parameters so the gradients actually accumulate and if you don't zero them they'll just continue to accumulate second step is to take your model take your inputs and then take the output so you're just basically running the model on the input then from your outputs on the towards step you can compare them then to the labels using the criterion and this will give you your loss so this loss is a scalar so when you look at convergence plots in machine learning this is basically where you're plotting this single value and using this loss you can see that there's a variation two dash by calling backward so once you have your gradients you can then update the model parameters using the optimizer and you do that by calling step on the optimizer so you can see here within the loop there's a lot of things going on the steps are very explicit and this is defined by run type paradigm so just keep that in mind we can change in our model parameters another feature of the model is that you can actually introspect on the different layers so I just pretty print the individual layers of our model which is simple in regression and then I plot the line so here we can see this line fits pretty well to the original data and we can say that our model has now learned from the data so I mentioned that in the previous example we set up that same problem to Tiano which is another deep learning framework so here in Tiano we import Tiano we set up in some way to the previous example we set up our X and Y data it deviates in the way that calculates gradients so you have to set up some bollock variables so similar to PyTorch you have to set up your cost function then you have to calculate your gradients you can see this is done in a very different way and you specify your updates and then you have this step where you take all this stuff and you basically compile this all this information into a Python function so then the final thing to do is you can see that it's very sparse it's just basically a one-liner so that contrasts a lot with PyTorch so there's no opportunity here to kind of make changes within your learning loop so there's no kind of dynamic possibility to introduce dynamic behavior here whereas in PyTorch within iteration you could have make changes within each iteration so here Tiano basically creates the computation graph before training and PyTorch does it at each iteration so just to finish up so I discussed PyTorch and NumPyBridge tensors and stuff like that also discussed graphs and variables talk about gradients on AutoGrad get a simple example and then at the end here define by run and run so that's me finished now I don't know if you have any time for questions well thank you very much Paul it was a great talk unfortunately we do not have time for questions to 30 exactly but right on time thank you very much again let's thank the speaker again