 The output is a binary value C and a bunch of word vectors. But with training, we need to minimize a loss. So two key details to note here. All of these word vectors have the same size, and all of these word vectors are generated simultaneously. We need to take each word vector, pass it into a fully connected layered output with the same number of neurons equal to the number of tokens in the vocabulary. So that would be an output layer corresponding to 30,000 neurons in this case. And we would apply a softmax activation. This way, we would convert a word vector to a distribution. And the actual label of this distribution would be a one-hot encoded vector for the actual word. And so we compare these two distributions and then train the network using the cross-entropy loss.