 Softmax is an activation function that turns an array of values into a probability mass function where the weight of the maximum value is exaggerated. It's tailor-made for multi-class categorization problems like the MNIST or CFAR datasets. It's ideal for converting the result of a linear layer into a vote for a category. It works best across a wide range of input values so it takes the place of other activation functions like sigmoid or logistic or rectified linear units. The Softmax emphasizes the strongest vote and focuses the learning on the parameters that will strengthen that vote, and as a bonus it's relatively cheap to compute. One reason the Softmax is so appealing is that it fits in well with statistics and information theory. It's a probability mass function, so it's a collection of values each associated with a separate and exclusive outcome that are all between 0 and 1, and if you add them all up you get 1. This lets us interpret the output of the Softmax as degrees of belief. So if a Softmax layer gives us the values of 0.2, 0.7, and 0.1 for the categories cat, dog, and elephant, we can explain that as the model believing with 20% strength that the image contains a cat, 70% that it contains a dog, and 10% that it contains an elephant. Alternatively if we were to place bets using this model we would estimate the odds of there being an elephant at 1 to 9, a cat at 1 to 4, and a dog at 7 to 3. All the tricks we can do with probabilities we can now apply to the results of the Softmax and a probability mass function of our expectations about the category membership of the example. You can make a probability mass function out of any set of values provided they aren't negative. Counted values are a classic case, so if you imagine I was sitting on a street corner watching people drive by I could count vehicles by type, car, truck, bicycle. So suppose after an hour I had counted 100 bicycles, 300 trucks, and 600 cars. Now I want to figure out how likely it is that the next vehicle will be a bicycle. I can turn my histogram of counts into a probability mass function by dividing each count of the sum of all the observations I've made, so 100 plus 300 plus 600 equals 1000. This gives a PMF of bicycles of 0.1, trucks of 0.3, and cars of 0.6. This method of converting counts, or frequencies, to probabilities is a hallmark of the frequentist approach to statistics, as opposed to the Bayesian approach. Expressed in equations we start with an array of values, so x of 1, x of 2, etc., up to x of n, which we can call x for short. To convert x into the probability mass function p of x, we can perform PMF normalization. That's p of x sub i equals x sub i over the sum of all the x's. Softmax is very like a PMF normalization, with the exception that all of the elements x are first passed through an exponential function. That is, e raised to the power of each element of x. This does two good things. First, it makes sure that all the elements of x are positive. e raised to any power is a positive number. This lets us apply a PMF normalization to any array of numbers without worrying about whether they're positive or negative. Second, it stretches out the difference between the largest value and the second largest value, compared to the differences between the smaller values. This elongation of the upper part of the value range is what makes the function maximum like. It takes the largest value and further isolates it from the rest of its cohort. It extends the lead of the front runner over the rest of the pack. Because it leaves the other elements with some non-zero value, it earns the name softmax, as opposed to a maximum function or a hardmax. After taking the exponential function of x, the PMF normalization can proceed normally. The softmax of an element of x is e raised to that element divided by the sum of e raised to all of the elements of x. Softmax is an activation function in its own right. If you use it together with another activation function that limits the range, like logistic or hyperbolic tangent, then its effectiveness will be greatly diminished. Softmax is sensitive to the differences between values, so you don't want to compress them into a small range before applying it. Because it's a PMF, softmax is appropriate for categorization tasks where each example only belongs to one category. The whole premise behind the softmax is that it emphasizes the element with the highest value, the category with the greatest affinity, at the expense of all the others. Multi-label classification problems say where photo could be tagged as both a person and a dog, if it has both a person and a dog in it, break this assumption. For multi-label problems, the logistic activation function is your best bet in theory. It represents the affinity of every label separately on a 0 to 1 scale. For some medical examples, check out this article by Rachel Draylos. However, some recent work suggests that softmax works well in this situation too. It may be worth trying both in comparing the results. Thanks to Dmitry Omishkin for pointing this out. To use softmax in a neural network, we need to be able to differentiate it. Even though it doesn't have any internal parameters that need adjusting during training, it's responsible for properly backpropagating the loss gradient so that upstream layers can learn from it. We'll start by working through the derivative of PMF normalization before extending it to the full softmax function. Because X is an array, the partial derivative of the P of X with respect to X is a collection of derivatives with respect to each element of X. So the partial derivative of P of X with respect to the array of X is this collection of partial of P of X with respect to X1, partial of P of X with respect to X2, etc., partial of P of X with respect to X sub n. Because P of X is also an array, the partial derivative of P of X with respect to X is a two-dimensional array of derivatives of each element of P of X with respect to each element of X. So it's the partial of P of X sub 1 with respect to X sub 1, P of X sub 2 to P of X sub 2, etc., across all P of X's and all X's. We can express this collection of partial derivatives with some indexing shorthand, the partial of P of X sub k with respect to X sub i, for all the values of i and k between 1 and n. Then using the quotient rule, which is the derivative with respect to z of f over g is the derivative of f with respect to z times g minus the derivative of g with respect to z times f all divided by g squared. We can break up the PMF normalization equation into its numerator and denominator and work through the derivative. So in this case our f is our X sub k, our g function is our sum of X sub j for all j's between 1 and n, and our z will be X sub i. Now our derivative of f with respect to z will be the derivative of X sub k with respect to X sub i. The derivative of g with respect to z will be the sum of the derivative of X sub j with respect to X sub i, which will equal 1 for any X sub i. And then putting it all together we get this equation for the quotient rule which results in the derivative of X sub k with respect to X sub i minus X sub k divided by the square of the sum of X. So there's two possibilities here. First we can solve it when i and k are different. In that case the partial of X sub k with respect to X sub i will be zero. They are independent. This is the assumption that we make when we calculate the gradient, the linearized gradient. So in that case this simplifies down to minus X sub k over the square of the sum of all of the X's which we can substitute in our expression for p of X sub k which is then minus p of X sub k over the sum of all our X's. And then we can also solve it for when i and k are the same in which case the derivative of X sub k with respect to X sub i will be one because they are one and the same. And when we simplify it we get one minus p of X sub k all over the sum of the X's. A really slick way to represent this is to use the Kronecker delta function which just happens to be one when i equals k and zero whenever i is not equal to k. So in this case we can put it all. It captures both cases in a one liner. It's just a short hand way to write the same thing if you like that. The partial derivative of our p of X sub k with respect to X sub i is our Kronecker delta minus p of X sub k all over the sum of our X's. So note that there's also the value of the PMF is also in the derivative. This is a recurring theme. Keep in mind that this is short hand for the full matrix of partial derivatives. Now to take this one step further and get the derivative of softmax. Having the partial derivative of the PMF with respect to all of its input elements. This is also called the Jacobian. This gives us a strong start for calculating the Jacobian of the softmax. We'll retrace our steps making the modifications we need to for the softmax's exponential functions. So starting from the expression for the Jacobian of the softmax. We'll substitute in now e to the X sub k over the sum of e to the X sub j for our probability mass function. We'll use sigma of X sub k to represent the softmax element k. And we'll use sigma of X to represent the whole array of softmax results for the entire set of X's. So we can again break the numerator and denominator of the softmax out for differentiation using the quotient rule. The difference is that f is now e to the X sub k. And g is now the sum of e to the X sub j's for all j's between 1 and n. We can work through this as we did before. We can use our trick where the partial of X sub k with respect to X sub i we know is the Kronecker delta function delta ik. And we can step through a couple of simplification steps. But when we're done, we'll notice that our definition for softmax occurs several different times. We get softmax of X sub k with our Kronecker delta minus the softmax of X sub i times the softmax of X sub k, which we can further simplify to our softmax of X sub k times our delta minus softmax of X sub i. So now we have a very concise representation of the softmax. With a slight shortcut in the notation, we can get into this terse statement of the partial derivative of softmax with respect to its inputs. Having the derivative of the softmax means that we can use it in a model that learns its parameter values by means of back propagation. During the backward pass, a softmax layer receives a gradient, the partial derivative of the loss with respect to its output values, and it's expected to provide an upstream layers with a partial derivative with respect to its input values. These are related through the softmax derivative by the product rule. The input gradient is the output gradient multiplied by the softmax derivative, the softmax Jacobian. Luckily, the Python code for softmax given a one-dimensional array of input values X is short. Import numpy and the softmax is e to the x divided by the sum of e to the x. It handles the arrays in broadcast in an intuitive way, which makes the code very concise. The backward pass takes a bit more doing. The derivative of the softmax is natural to express in a two-dimensional array. This will really help in calculating it, too. We can make use of numpy's matrix multiplication to make our code more concise, but this will require us to keep careful track of the shapes of our arrays. The first step is to make sure that both the outputs of the softmax and the loss gradient are both two-dimensional arrays with one row and many columns. Then we can create the Jacobian matrix, d sub softmax. The first part of this expression propagates the softmax values down the diagonal, the Kronecker delta term of the equation, and the second part of this expression creates the other term, the product of softmax values associated with each row and column index. d sub softmax equals softmax times the identity of softmax size minus the transpose of the softmax matrix multiplied by the softmax itself. The at symbol calls numpy's matrix multiplication function. The matrix product of the n by 1 softmax transpose and the 1 by n softmax is an n by n two-dimensional array. Thanks to all this careful setup, we can now calculate the input gradient with just one more matrix multiplication. Input grad equals grad matrix multiplied by d softmax. When working with exponents, there's a danger of overflow errors if the base gets too large. This can easily happen when working with the output of a linear layer. To protect ourselves from this, we can use a trick described by Paul Panzer on Stack Overflow. Because softmax of x equals softmax of x minus c for any constant c, we can calculate the softmax of x minus the max of x instead. This ensures that the largest input element is zero, keeping us safe from overflows. And even better, it doesn't affect the gradient calculation at all. Thank you to Antonin Raffin for pointing this out. This implementation was lifted from the collection of activation functions in the Cottonwood machine learning framework. You can see the full implementation in the softmax class at the link. This story was synthesized from a rich collection of online content on the topic, including the Wikipedia article, Victor Zhou's post, Eli Bandersky's post, Aaron Kim's post, Anthony Paredes' cross-validated post. To walk through the process of implementing this in a Python machine learning framework, come join us for end-to-end machine learning course 322.