 Hyperparameter optimization is the problem of adjusting the high-level settings of an algorithm. For instance, in a neural network the values of the L1 and L2 regularizer constants, or the optimizer's learning rates. To be able to do this well, you have to be able to take a step back from this mechanism you used to train the low-level parameters of the model, the weights, and have another loop for training these higher-level parameters, these hyperparameters. I built evolutionary Powell's method for use as a hyperparameter optimization method as part of course 314 in the end-to-end machine learning curriculum. It's inspired by the original Powell's method, but it has a stochastic element inspired by evolutionary approaches. There's a Python implementation of the algorithm available in the Ponderosa optimization package, which is available under an MIT open-source license. The links to all of these resources are available through the post linked in the comments below. Evolutionary Powell's method tries to find the global minimum of a loss function, also known as a cost function or an error function, over a discrete space. That just means that each variable can only take on a finite number of unique values, usually a reasonably small number. It has a few main steps, which we'll illustrate by walking through the example in this animation. The loss function we'll be working with is a two-dimensional variant of a sink function. It has a single, very pronounced peak and several shoulder peaks and dips to either side of it. The code for creating this and running the test can be found in the GitLab project linked below. The optimizer actually tries to find the lowest point in the negative of this function, the global minimum, but for ease of visualization we'll flip everything upside down before we plot it, so it looks like we're trying to get to the top of the peak. Both x and y are allowed to take on 10 distinct values, between 0 and 3. This method can be applied to spaces with any number of dimensions, but a two-dimensional space like this gives us the richest visualization. Once we add in the error itself, we get a 3D visualization. So there are a few main steps to Evolutionary Powell's method. First, randomly select a few points and evaluate them. Number two, make a list of hyperparameters in random order. Three, choose a small number of previously evaluated points to be parents. This will be a random choice, but weighted by their performance, so that the points with the lowest error are more likely to be chosen. Four, for the first parent added, start at the top of the list of hyperparameters and look for a small number of un-evaluated child points along that dimension. Points that are identical to the parent that differ only in the value of that particular hyperparameter. If no children can be found, then check the next hyperparameter until they've all been tried. For each subsequent parent added, repeat this process but shift the list of hyperparameters by one so that a different hyperparameter is attempted first for each parent. Then evaluate all the children found and start again at step three. A little bit of background, this is a little like Powell's method where, starting from a single point, it finds the minimum along one dimension at a time. And the new best guess is moved iteratively to the new location of the minimum. There's a little bit more to it than that in the details, but those are the most relevant parts. Powell's method typically assumes smooth, continuously varying functions and seeks just to find the local minimum, so it, on its own, isn't a good fit for our hyperparameter optimization problem where we're looking for a global minimum in a discrete space. But it's at least nice in that it doesn't require differentiation and it manages a large number of dimensions gracefully. Evolutionary algorithms are the other basis for this method. If you have a high dimensional discrete space, evolutionary algorithms can be really useful. They don't have to assume anything about the smoothness of the function and they can be used as an anytime algorithm, meaning that you can interrupt them at any point and have a workable, best so far answer to get going with. Evolutionary algorithms are especially useful when there are way more points in the space than you can ever explore completely. They make the assumption that the lowest error points will have at least some traits in common with other low error points. In our case, that assumption is expressed as successful points will share at least some hyperparameter values. Choosing the most successful so far points as parents and varying one of their hyperparameter values to create children is a way to exploit this assumption. This is also where the similarity to Powell's method comes from. Exploring one dimension at a time starting from a high performing point looks a lot like the iterative one dimensional optimization that Powell's method performs. It's worth taking a detailed tour of the method step by step. The first thing we need to do is to choose a few points to start with. We don't want to assume anything about which dimensions are important or whether we prefer high or low values in each dimension. So the safest way to avoid systematic bias is to randomly choose a few points in our space. In this example, we show that we just start with two random points, but after playing around with the algorithm, experience suggests that two times the number of hyperparameters is a good number of points to start with. That tends to sprinkle its guesses across the space and gives a reasonably rich population of points to start from. Just to make sure our results aren't sensitive to the order in which we search the hyperparameters, we generate a shuffled list of them. Each hyperparameter represents one dimension in our space. We don't want to give one of them preferential treatment. Later, we'll also rotate through the list so that we start searching along a different dimension each time. From all the points we've tried and evaluated so far, we choose a few candidate parents. Right now this number is set to three in the code. We expect that high-performing, low-error points are likely to make the best parents, but we don't want to make that assumption too rigid. To allow room for accidental discovery, we randomly select our parents from the population, but we wait better performing points more heavily. We also want to avoid the pathological situation where there's one point that is very strong compared to all the rest, but because there are just so many of the others, it gets overwhelmed by their collective weights and never gets selected as a parent. To fix this, the weights of each point are calculated as the square of its normalized position between the highest error and the lowest error, resulting in weights that fall on the interval zero to one. Zero is the weight of the worst performing point, and one is the weight of the best. Then a random number is drawn from a uniform distribution, and the point with the next highest weight on the zero-one interval is the parent selected. This ensures that parents are selected with stratification based on their error rather than their relative prevalence. For the first parent selected, start with the hyperparameter at the top of the list we created in step two. Randomly select points along that dimension up to a maximum of a set fraction of all the possible values it can take. Right now this fraction is set to 30%. If at least one un-evaluated child is found for that parent, then stop evaluating potential parents for this iteration and move on. Each time through the child selection process for a new parent, rotate the list of hyperparameters to the right by one, so that what was the last dimension becomes the first, what was the first becomes the second, and so forth. That way each new child selection process starts looking along a different dimension. If no un-evaluated children are found along that dimension, try the next one on the list. Then if no un-evaluated children are found for that parent along any dimension, try the next parent. If no un-evaluated children are found for any of the parent candidates, then terminate the search. This is how evolutionary Powell's knows that it's done looking. This is a fairly conservative stopping criterion, especially for a high-dimensional space. For a parent to have no un-evaluated children means that a reasonable fraction of the space has been exhaustively explored. It's more likely that on optimization runs with many parameters, other constraints will cause the researcher to terminate the search before it runs its course, such as time or computing expense. If a new lowest error is found, update the best so far error and the best so far combination of hyperparameters. These will be available as a result if the algorithm is terminated before the next better solution is found. When all the pre-selected children are evaluated, return to step 3 and go through the child selection process again until no more can be found for the chosen parents. Here's the set of evaluated points after two iterations through this algorithmic loop. Already, it's found a relatively high-performing point. In the third iteration, it chooses a high-performing point, not the highest, and explores the x-axis to find a still better one. After a few more iterations, it can be seen that the mostly higher-performing points have been chosen as parents and they have generated children along both axes. By the time 30 points have been evaluated, the global optimum has been found. This is not too bad considering that there are 100 points in this space. We would expect random exploration to find the global optimum on average on its 50th point, so this is somewhat of an improvement. We would expect this benefit to be magnified in higher-dimensional spaces where the fringes occupy a larger and larger fraction of that space. After exploring about two-thirds of the space, the algorithm has been unable to find any un-evaluated children for the three parent candidates that selected this iteration and it terminates. If you look carefully, you may have noticed that the algorithm spent a lot of its evaluation effort on poorly performing points far from the optimum. If instead of choosing children randomly along one of the dimensions of its parent, the algorithm instead chose the nearest un-evaluated points, it might have found the best solution much faster. However, if faced with a loss function that was more choppy than this, such an approach might get stuck in a local minimum and fail to find a much better solution. The best approach to use depends on which problem we'll be solving. The no-free lunch theorem suggests that no optimizer is great at solving all optimization problems. The trick with choosing an optimizer is to place your bets on what type of problems you expect to be solving. Will a loss function be smooth? Convex? Bounded? Low dimensional? Discrete? Differential? Have loosely coupled dimensions? Depending on what you expect to find, you can tweak your optimizer to take advantage of those expectations. Just be prepared, if your assumptions are wrong, then your optimizer might fail badly. If you'd like to use evolutionary pals in your next project, I implemented it in the Ponderosa package, an open source collection of optimizers designed to be lightweight and easy to experiment with. The installation instructions and the code for this demo are available through the links below, and if you're curious about how to implement it, there's code for that at the link as well. I wish you lots of luck making your machine learning models the best they can possibly be with hyperparameter optimization.