 Then, using the combination of the one-hot representation, the actual label, and the predicted label, we can count for every individual example for a given actual label and a given predicted label how many matches were there. What we want to see here is a very strong diagonal like we have here. This is called a confusion matrix, and in perfect performance, every time the actual label is a 5, we want the predicted label to be a 5, too. Every time it's a 1, we want the predicted label to be a 1, also. And when this is the case, all of those counts will be on the diagonal, everything else will be 0. Everything off diagonal shows a miss, a misclassification. This is kind of interesting because it lets us see which types of errors are most common. Here, for instance, is a 0 that shows for every time the actual label was a 4, it was never predicted to be a 3. That makes sense, 4s and 3s don't look much alike. However, if we look here, for every time the actual label was a 4, 20 of those times, so 20 out of that 1,000 test examples, it was predicted to be a 9. So 2% of the time, a 4, was predicted to be a 9. And we can see the other high numbers and see how they lined up. So, of the times it was actually a 3, it was predicted to be a 2, 16 times out of 1,000. Of the times it was actually a 7, it was predicted to be a 2, 21 times out of 1,000. That makes sense, it's easy to picture how a 3 and a 2 could look similar if you draw them a little quick and sloppy, or how a 7 and a 2 could look similar if you draw them a little sloppy. Then you can sum up the total number of actual and predicted labels, and you can find the precision of all the times that I predicted that it was a 2, how often was I right, what fraction of the time, and then the recall. For all of the examples that were actually a 2, how many did I guess were a 2? So you can see that we have very high recall on 0s, almost every time it's a 0 we guess 0. And somewhat lower recall on 9s, like just almost 94%. So out of 100 9s there were 6 of them that the algorithm did not guess was a 9. This also helps show which classes are easier to guess and where the algorithm tended to be strong and weak. One of the little goodies with this case study is it comes with a pre-trained model. The train.py has already been run for a million time steps, and the resulting model then has been saved as mnistclassifier.pickle. This lets you immediately run the test.py or the report.py modules to be able to evaluate the results or see how the results. This lets you right out of the box run the test.py or report.py modules to be able to evaluate the results or to be able to see how the results play out, how they stack up, what they mean, without having to retrain it yourself. Just a bit of a heads up, I ran the train.py module and it took two days to train on my laptop, which is not crazy for training a deep neural network. But keep in mind that if you're planning to train it, allow for some time, and while you are training it, you can go into the reports directory and it will keep a running updated loss plot. It updates it about every thousand iterations and you can see how the loss is decreasing and that lets you track its progress. And so if you choose to stop it early, say after training it for 2 or 4 or 10 hours, you can do that based on where the loss is at that time.