 Introduction to Neuronext, please welcome Mike Craig. Hello. How's it going? Cool. So as you said, I'm Mike Craig. I'm a data scientist at Warby Parker here in New York. And this is Introduction to Neural Networks with TensorFlow. So I wanted to kind of split my talk half between introducing the concepts behind neural networks and how they work, and then diving into the code so you can actually use this stuff. But I didn't want to make any assumptions about what people know. So I'm going to have a really just kind of a high level overview of machine learning and then the problem, but really with a focus on how to set up these problems operationally. So what is machine learning? Machine learning is a subfield of artificial intelligence that deals with pattern recognition and predictions and forecasting and things like that. It's much more statistical than traditional artificial intelligence algorithms. I mean, when you think about like Minimax or A-Star, these are kind of like churistical algorithms to kind of pick like an optimal action. Whereas machine learning is a little more fuzzy. It's a little more like statistical in using data to kind of get an approximation of what the optimal solution could be. And so that's kind of one way to think about it really is that all the machine learning is just kind of like function approximation. And so basically any function that is computable and where the form isn't really that obvious, machine learning can be a good candidate. So here's a couple of examples. Like handwriting recognition. So I mean humans could look at that and I could 504192. But how do you tell the computer to encode these? Like for example, this one right here is slightly tilted, whereas this one is not as much. And all these very fuzzy differences cause you're going to have this crazy rule set if you wanted to try to program it normally. Facial recognitions and other thing. So let's say you had a facial recognition app on your phone and you've also never taken a US history course or something. You could put that image through the facial recognition and it would use machine learning to kind of find the face in the image and find all the different key interesting points about that face and then look it up in the database and everything. And it'll tell you that it's George Washington and that you're an idiot. I'm not knowing that. Speech recognition. So Siri, Google Now, Amazon Echo, right? Taking audio signals and turning them into human words is something that takes a lot of machine learning. Also I guess the other half of that too, like taking natural language processing and taking words as we speak them and turn them into something that's more structured and useful for a computer to understand. And the last example, so knowledge discovery. This is something that maybe it made the rounds on the internet six months ago. Maybe you've seen it. But it's something to create this neural net that you give two images and it kind of learns the artistic style of the images and can kind of combine them and draw the image in the style of that other image. And they use like convolutional neural networks to do this, which we're 100% not going to get into in this 25 minute talk. But just some examples of what you can do with this stuff. So really high level, I'm going to get a more formalized definition of the machine learning problem. So there are two main types of machine learning, supervised and unsupervised. And we're going to deal with supervised. And that's basically you collect data. You have an input data that kind of describes all the information for a particular data point that you're observing. And then what the observed label for that is. So for example, handwriting recognition, you could have the input would be the pixels in the image. And then the output would be like, it was a one, it was a two. And versus unsupervised learning, which is you only have inputs and you're kind of just saying, OK, give me, the objective is to find patterns and give me this like, what kind of clusters do you see? What kind of hierarchies do you see in the data? But we're going to deal with supervised learning in this talk. So a little bit more formally, this is basically how you want to structure your supervised learning problem. You want to collect these input and these output matrices like this, where each row is like an observation of something that you see. And then each column is just like the different variables of the different pieces of information about that point. So like I said, for handwriting recognition, one row could be an image. And then it's just like sort of flattened out. And then each column is like all the different pixels in the image. And similarly, your output is also a matrix. In many cases, they're just kind of one column. But you can have more than one column in your output. And so for example, with handwriting recognition, what you would have is you might have 10 columns of outputs that are all binary, like 0, 1, whether that particular image was a 2 or that particular image was a 3, and so on and so forth. And so that's kind of how you want to do it. This is like the prerequisite before you even throw any algorithm at this. You want to be able to collect this type of data about your problem. So this is too big. Let's get into neural networks then. So this is what a neural network looks like. You basically have all these input nodes and they sort of propagate through the networks to produce these outputs. And the inputs as are defined by that X matrix. And so that gets fed in and goes through the network through these like arrows, these connection weights, these arrows to the next layer of units. And then those units are activated in a certain way. And then they move on to the next layer and so on and so forth. And so some of the terminology for this, I think I haven't already said it already, right? So neurons are activated at some level, which is then propagated forward through the different layers. And then the strength of this activation is controlled by these arrows, these connection weights, which is going to be what we're going to try to train during this process. And so in fact, you can kind of see that the weights between each layers can be described as a matrix in between. So there would be like a weight matrix here and there'd be a weight matrix here. And these are the matrices that we're trying to optimize to have the network learn something useful. And then one other thing is that after the connection weights and the inputs are fed forward into the next layer, the unit is passed through what's called an activation function, which I'll describe in a second. But it's, yeah, I'll describe in a second. So this is more formally, this is how the, again, neurons computation is. So if I have some unit z, and it's going to be a function of the previous layer's inputs multiplied by this weight and then passed through this activation function. And so then this would then be the input into the next layers, and so on and so forth. So what are these activation functions? Well, activation function allows you to kind of create this nonlinear relationship between, or it allows you to define these nonlinear relationships between your input and your output. If we didn't have an activation function, just by looking at this equation right here, it would just be without that f right there. And it would just say that each unit in the layer is just a function of the inputs multiplied by some weight. And so if that's the case here, then this would be a function of this input multiplied by this weight. So this is a linear function of a linear function, which is a linear function. So there would actually be no point in having this middle layer at all if we didn't have this activation function. So that being said, you could have a linear activation function if you wanted. But that totally defeats the purpose. I mean, sometimes you could put a linear activation function at the last layer, at the output layer. But yeah, so but again, you would lose that nonlinear relationship there. Another choice for an activation function is a sigmoid, kind of looks like this. So they're known as squashing functions because they take the entire domain from negative infinity to infinity and kind of squash it onto this 0 to 1 range here. And this is used a lot because it can kind of very hand-wavy be interpreted as probabilities. And so you might want to just throw a sigmoid activation function at the end, at the output layer. And then if you were classifying something and it came out with an output of like 0.7, you could sort of say that yeah, the network is 70% sure that this is the case. Another choice is a hyperbolic tangent, which is, as you can see, it kind of looks similar to sigmoid. And maybe it's a little bit steeper, but it goes from negative 1 to 1 rather than 0 to 1. And this is kind of sometimes preferred in some of the hidden layers because it's symmetric around 0. And so with the sigmoid, there's like this kind of this inherent bias where, I mean, it's symmetric around 0.5, right? And a lot of the math just works out better if you make the assumption that a lot of these are just like normally distributed and symmetric around some, it's symmetric around 0. Another choice that is popular nowadays is the rectified linear unit. It looks like a linear unit from 0 on. And then there's like that non-linearity at 0. And basically, it's literally defined as the max of 0 and x. And this allows for the, like a lot of the expressive power that you get from a linear unit. I mean, we're not squashing anything. And you can go up to infinity here. But it also has this non-linearity that allows it to not just be like a linear function, a linear function. Also, it has a nice property that it actually is 0 for any activations less than 0, right? So you can kind of think of it as it's a way of like turning off a neuron. And there would be like, yeah, so you can have some neurons that maybe just aren't, it receives some input and it's deemed not that useful. And it's kind of just like turned off, so to speak. So how do you train these neural networks? Well, basically what you want to do is you want to define some loss function. And this is a function that's basically saying, like, how bad is the network right now? And so you collect this prediction. This is what we'll call y hat. This is what the network predicts for a given point. And then you have your observed variable. And then you basically just define some kind of loss function. Here's just mean squared error. It's just the mean of all the squared errors between all the observed and the predicted. But there are other possibilities, too, that you could use. And then the idea is you run these like grading descent approaches, which I'm not going to get into the math behind it also because you don't need to know it, because that's one of the benefits of TensorFlow. But the basic idea is that you update the weights according to their contribution to the error. And so you're sort of descending down this error curve by picking better and better weights as time goes on. And there's just so much literature about how to optimize and all these different algorithms and all these different optimizers that you could use. But that's the basic idea for a lot of these. Let's get into code. That's just a bunch of imports. OK, so we're going to run through an example using that handwriting recognition asset that I described before. So I'm going to load in data from scikit-learn because it's in there. And so here I am just loading it in. And then the only thing I'm going to do to change it is I'm going to use pandas. And I'm going to get the dummy variables out of here. So instead of beforehand, it was just like the y variable is just a list of integers from 0 to 9. And I'm saying I want to transform that into a matrix with 10 columns. That's just binary. And it's 1 if that's the case. So if this particular sample is 3, then the third index would be 1 and everything else would be 0. That's all I'm doing there. And so let's run this. So as you can see, there's about 1,800 samples, 64 columns. So there's about 64 pixels in each of the images. And like I said, this output matrix is 1,800 by 10. And here are some samples of what they look like. So very, very low res. But you can kind of see that they're just grayscaled. And each pixel has a value that's 0 to 255. That's to describe the intensity at that spot. So let's get an Intentive Flow. So Centriflow is an open source library by Google to build a lot of these models. And one of the really cool things about it is that it's kind of like theano if you've ever played with that. It's symbolic. So you can just tell it the equations. And rather than actually computing the thing on the data, you just tell it what equations you expect. And then it can say run off this data. And then it'll plug it in and run this. And that's really useful for computing gradients, like we were talking about before, to training these networks. We don't even have to worry about that. We just have to worry about how to specify the network and then, say, update it. So, oh, sorry. So this is yelling at me because I already have a session open. But you create a session in Intentive Flow. And then that's what you run all of these operations on. And basically, yeah, so a session creates this computation graph that describes all of these operations. And then you can kind of just say run it. So let's get into some of this. So in Intentive Flow, you want to define all of your inputs as variables here. And so I'm defining the input and the output as placeholder variables. And I happen to know the shape ahead of time, right? So I know the shape of the data. It's 64 dimensions for the input, 10 for the output. And then I put none here because that's basically just saying I don't care how many samples I have. So later down the road, if I want to plug in a subset of this data set, that's fine. Obviously, the number of dimensions needs to be fixed. But that's what that none means. It allows you for batching and things like that if you want to do that. Here is the network right here. We're just going to define the computation for the network. This is actually a network with no hidden layers. We'll add the hidden layer in a second. But basically, we define this weight matrix. And this is the weight matrix going from the input to the next layer, which in our case is the output. So we know the size of that. And then there's also a bias unit. So if you think of a linear regression, like y equals mx plus b, this is the b. And that's usually needed in a lot of these, right? And then here is the equation for the network. So we're doing a matrix multiply on the input and the weight matrix, adding in the bias, and then passing that all through a softmax function, which I haven't described softmax, but it's just another type of activation function that basically allows you to see probabilities across many different choices. So this is actually, since there's no hidden layer, this is actually pretty much equivalent to logistic regression, if you're familiar with that, or softmax regression, I guess, technically. Because it's really the hidden layer on the neural network that actually makes it neural networky. But this, so anyway, so what we're doing here is we're just setting this variable y hat to be the output of these inputs. And notice we have not given it any data yet at all. We're just describing how it would work. Then we define our loss function. I'm just gonna use mean-squared error loss. So just the square of the difference between y hat, which is our prediction out of the network, and y underscore, which is this placeholder that I'm using for the actual output. Just square it and then just take the mean of that. I'm also gonna define some function tier too if I wanna know how well the network is doing. So this is just computing the accuracy. So we're taking the argmax, so since the softmax, basically, since the output's gonna be like 10 numbers for each sample, I just wanna pick the one that has the highest probability. So I'm passing that through argmax here, and we're basically saying r is my predicted choice equal to the actual observed choice. And then taking the mean of that would give us the accuracy. Then at the very end here, I'm saying, okay, let's pick an optimizer to use. I'm using this atom optimizer. There's a gradient descent optimizer. There's an eta grad. These are just different algorithms, different optimization algorithms. And you give it a learning rate. So this is basically how steep you wanna descend down this curve. So because a lot of these error curves are not just smooth at all. They're pretty bumpy, and so you can overshoot if you have a really high learning rate. And so you gotta, or it'll take forever to train if you have a really low learning rate. So this is a parameter that is something that the program has to take into account. And we're just saying, okay, give us this optimizer with these parameters and just minimize this error. And so this error is a function of this y hat, which in here's the equation for that, right? And so it kind of follows it backwards and it can update, and it basically, it updates the weights such that it minimizes this error. And then at the very end, I'm just saying initialize all these variables. I'm saying session.run initialize the variable. So like I said, you run all these things on the session and you just, I'm just telling you to initialize it all. Okay, and now let's actually get into training it. So I'm gonna run, I'm just gonna go through 1,500 of these training iterations and I'm just gonna say, this is what I defined up here, right? This optimizer, and I'm just gonna say run. And I'm gonna feed it, you have to give, you have to feed it, you have to tell it what data it expects, right? So remember, so basically any of these variables you could pass in as something, right? Any of these placeholders you can pass in as something. And so here I'm saying, okay, every time you see x, actually give it the actual data we're talking about. Every time you see y, give it the actual data we're talking about. And then I'm just gonna print every 100 iterations. I'm just gonna print the accuracy. So we're doing okay, I guess. 77% accuracy is not bad, I guess, but if you hired someone who couldn't recognize handwriting, or 33% of the time, what are you doing with your life? So let's add a hidden layer to this. And the way, so the way you do that basically is we're just going to create two weight matrices now and two bias units and then define the equation in two steps, basically. You can name them whatever you want. I'm being extremely explicit right now in verbose, but here's the weight matrix going from the input to the hidden units. I'm now initializing it to some, before I was initializing it to zero, now I'm initializing it to these random variables, but other than that, there's nothing too crazy here. And then here's the weight matrices from the hidden to the output, same with the biases. And then here is I'm defining the activation through up to the hidden unit. So this is multiplying the input by that first weight matrix, adding the bias, passing it through a sigmoid activation function. Then I'm just taking that, multiplying it by the second weight matrix, adding in that bias, and passing through a softmax activation function. The rest of this code is the same. I do have to redefine it with all these new changes, but the rest of the code is the same and now we have a hidden layer that is doing better. It's a little bit slower to train a lot of times, just because there's a lot more parameters to train on, but 95% error is much better. I mean, we could definitely improve that even further, but that's good enough, I think, to start. So I want to give a sense of an intuition behind the hidden layers of these neural networks. So here's a toy data set from scikit-learn. It's basically just an inner circle surrounded by this outer circle, and then the inner circle is its own class, and the outer circle is its own class. And the task is to approximate this function to be able to separate the two. So a human looking at that would be, okay, just put a circle in between that and then you're good. We'll see if the neural network could do that. Basically, I just took all this code and turned it into a couple of functions here, but it does most of the same thing, and then I added some code to plot as it trains. So this is no hidden layers. It's having a real tough time. It's so no hidden layers, remember, is basically it's trying to find a linear function, so it's trying to find a line that could separate these. It's kind of like a mean thing to ask it, right, because it's not gonna be able to find it. Now if we add in the hidden layer, it's actually able to find it. So the hidden layer actually allows it to kind of like find this kernel in math terms. It's called a kernel, and it kind of just like finds the shape of this kernel, and it's able to eventually learn that. So I'm starting to run out of time, but really quickly here's some other features of TensorFlow. They have GPU computing built in right out of the box. You don't really have to, you just have to specify, you have to set it up, but you don't really have to specify anything else. You don't have to change anything else in your code. Also cluster computing, they released pretty soon, or pretty recently. I personally haven't messed around with it, but it's, again, you don't even have to think about it. It figures out how to parallelize all these operations, and you just have to obviously set up the cluster and everything. And then this notebook is on my GitHub. My username is mcrag2. I don't know who took mcrag, but. And then in the PyGotham talk repo, I have my code. Are there any questions? Oh, yes. Sure, it is a lot of trial and error, and it does depend project to project. So basically, one way to think about these hidden layers is like, it's, if your hidden layer is smaller than your input layer, then you're basically asking the network to reduce in dimensionality your input. And if it's, and so if your problem is set up in such a way that you think that that's like, it's okay to do, it's like, you're not worried about it being that lossy, then that's fine. Usually, when I'm trying these out, there's definitely no one number, but I try to go anywhere between two thirds of the number of input dimensions to two times, and then just kind of see performance-wise, like how it works. I mean, you can also add in many, many hidden layers too. I mean, all this stuff, but I usually try to start pretty simple, one hidden layer, and then, yeah. So do you have experience with other deep learning frameworks, and if so, can you speak to the performance differences? Yeah, so, TensorFlow did come under fire at the beginning because it was not as performant as some of the other ones, but I think it's better now. I have messed around, like so I've messed around with Tiano, and it's pretty performant. I've messed around with Cafe, and some other ones, and I just personally prefer the interface for this. I'm also, like, I'm not running ginormous networks, so I'm not too concerned about performance, but it's definitely, it's definitely, like, TensorFlow is definitely fast now. If that makes sense, compared to the other ones. Have you seen, just following up on that first question, have you seen any examples of using Grady Descent on the parameters, so like how you're saying, you know, training the different network sizes, like, because I've been having trouble finding, like, specifically Grady Descent on the parameters, and then training that off of the network so that you don't have to self-tune. Grading Descent on, like, how many? Yeah, have you seen any examples of that with TensorFlow? So, this, like, this, that's the whole field of hyper-parameter optimization. Yeah. And yeah, and yeah, I mean, I haven't personally used that. I think that there's a lot of, like, there are a couple other ways you can do that. I mean, there's like a grid search type of approach where you just try a bunch and then pick the one that has the best. There's like just a random search, actually turns out to be pretty better, even better sometimes than grid search, but running Grady Descent on those hyper-parameters, I don't personally have experience with, but it could be something interesting on a, yeah. Anyone else? Cool, thank you.