 We already know how to embed words into a vector space. We use embeddings to assign similar vectors to words with similar meanings. Then given a reference word, we can use these vectors to find semantically similar words from our word list. Now we would also like to use these embeddings to find clusters of similar words. And the important thing here is that we're interested in words' meanings, not their structure. We're looking at words as a whole, not as just a sequence of characters. OK, so first let's load the list of 150 words using the corpus widget. There they are. Apple, Arrow, Banana, Bed, and on and on the list goes. Now this time, we will embed our words using fast text. So let's check it out in a data table. OK, every word is represented by a vector of 300 features. Now next, we can measure the distances between each pair of words using the distances widget. OK, I'll make sure that the distances widget uses the cosine distance. Then we can take a quick peek at the output of this widget with the distance matrix widget. And let's show the words in the headers of the columns and rows. OK, I get a square matrix of distances between words. The matrix tells me that the distance between Apple and Banana is smaller than between Apple and Book. Makes sense. Now I want to feed this distance matrix into the widget for hierarchical clustering. Then check out those results in the gendergram. OK, great. It seems musical instruments make a nice cluster. Then there's a cluster of words about writing and taking notes. And a cluster of sharp objects that include knife and rake. And then another large cluster related to food at the very bottom. These results make a lot of sense. The words cluster together are definitely semantically related. So I can use any dimensionality reduction approach to try to visualize the word space in two dimensions. Say, t-sneed. I'll connect the t-sneed widget directly to the embedding output. Here in our t-sneed plot, every point represents a word. So let's select a group of points in the right part of the plot. And feed the output of t-sneed to a corpus viewer to neatly display my selection. Nice. This is the cluster of musical instruments. And then this group are animals. And the group on the top represents food items. Now I can also check how well the clusters in the gendergram correspond to the groups from the t-sneed plot. So I'll connect the hierarchical clustering and t-sneed so that every selection from the gendergram is sent to the t-sneed widget as a data subset. Now I'll arrange the two widgets side by side and select the cluster with musical instruments in the gendergram. Great. There they are in the t-sneed plot. All grouped up. Except for one outlying word. Let's click on it to make sure what it is. Hmm. Whistle. Well whistle is an odd word. It isn't really a musical instrument. So it's nice that I can find these inconsistencies by looking at a t-sneed plot. Now I could do more here and use other clustering algorithms or data visualization methods like MDS or PCA. And you're welcome to check out our videos on some of those algorithms in our Introduction to Data Science video series. The link will be in the description below. But now we need to move on to our next video on word classification.