 In the previous video on text mining, we talked about text preprocessing. Now our data should be ready for machine learning, right? Well, not quite. After preprocessing, Orange still sees only lines and lines of text. For machine learning, we need to transform text into numerical representation. And a simple way to do it is to count how many times each word appears in the text. This approach is called bag of words. Let us reuse the workflow from our previous lesson. Corpus reads the collection of text documents. And text preprocessing removes stop words and the limiters. Now we will extend the workflow with the bag of words widget. Bag of words outputs a data table where word counts are the new added features. You can always check the output of bag of words in a data table. Great! Now we have our data matrix and we can find interesting groups of documents. Connect distances to bag of words. Here, we will use cosine distance as it normally works best for corpora. We feed computed distances to hierarchical clustering. To estimate the distances between clusters, we will select word linkage. Now drag a line at the top of the visualization left and right. What is the appropriate number of groups? Two seems to make the most sense. The nodes in our dendrogram also have a label. In folkloristics, Grimm's tales are labeled with Arne Thompson-Uther index, which defines the topic of the tale. If the tale talks about animals, it's an animal tale. If it's more about dragons and princesses, it's a tale of magic. Looks like the tale type corresponds quite well with our clusters, except for one part where animal tales and tales of magic are mixed. Can we figure out why they're mixed? Select the cluster and connect Corpus viewer to hierarchical clustering. Seems like some tales of magic still mention animals. Perhaps clustering got it right after all. Clustering is a great way to uncover similar documents in unlabeled text. But here we actually have labels, the ATU topic. In the next video, we will talk about classification and try to predict the type of the tale on fresh data.