 One of the most annoying aspects of deep learning is hyperparameter tuning. As we've seen, there are lots of hyperparameters, not just L1 and L2 regularization or dropout, but how many layers? How many nodes per layer? What? Relu? Hyper? What should one do where? So how does one pick those? Well, hyperparameter tuning and architecture selection is basically search, and so you can use any search mechanism. If you have a relatively small number of hyperparameters you're trying to tune, then you can do a grid search. Pick the smallest and largest plausible value for each of them, based on prior experience, and try all possible combinations. But this gets big pretty quickly. If you're going to do five hyperparameter values for each of four different hyperparameters, that is five to the fourth. It's expensive. Almost as good, sometimes better, sometimes worse, is a randomized search. Just try different random combinations of hyperparameters. See what works. I like doing coordinate-wise gradient descent. Gradient descent is the answer to everything, and we can do gradient descent in hyperparameter space. So start with what you think is a reusable set of hyperparameters based on prior experience, what you find in the literature. Change each hyperparameter a little bit bigger, a little bit smaller. If it improves the testing accuracy or improves the convergent speed, whatever you care about, then accept the change. If not, reject it. Repeat until you run out of computer time. Finally, recently it's become quite popular, particularly if you're at Google or some place with lots of compute power, to do Bayesian optimization. Now Bayesian optimization starts from the same premise that many of these do, which is start with hyperparameters that worked well for similar problems, something that is a similar vision recognition problem or a similar NLP problem, often using pre-trained embeddings or partial models. Start with the architecture that worked well and the hyperparameters, and then do localized search in the flavor of the coordinate-wise gradient descent, but often using some sort of a local model to guide that descent. The details are beyond this course, but you can click on the link and get more information.