 Every machine learning needs numbers, be they in a spreadsheet form from the beginning, via feature construction, or in the case of text mining with a bag of words. Today we're going to have a look at an alternative way of describing documents with factors by using document embedding. First, let us prepare the data. We will start with Corpus Widget and the preloaded dataset of Grimm's tales. A quick look at a Corpus viewer gives us an idea of what the data is about. These are the fairy tales of the Grimm brothers, such as Cinderella, Little Red Riding Hood, and Snow White. Now we wish to describe the content of the tales with numeric features. But first, we need to prepare the core units of our analysis. This will be a quick setup. For a detailed guide, see our text preprocessing tutorial, whose link is in the description below. The presets that we have are already quite good. Our text is transformed to lowercase and split by words without including punctuation. We also removed English stop words, that is words that don't have real meaning. Finally, since we have long texts, we will remove words that appear in less than 10% and in more than 90% of the documents. Now we are ready for word embedding. Embeddings are a low dimensional representation of high dimensional data. They start with a single token, which in our case is a single word. They are based on pre-trained models for the selected language, which looks at the word and places it in its corresponding vector space. In other words, it embeds it. In this way, words that have similar meaning, or come from the same family, will be placed close together and will have a similar embedding vector. Once all the words are embedded, the procedure then averages all the word vectors to produce a single document vector. In practice, the procedure is simple. Connect document embedding to pre-processed text and orange will compute the result server side. You can change the language of the model or the aggregation of word vectors. A quick look at the data table shows us we now have 300 additional numeric features that describe our documents. Let us now do a simple clustering with cosine distances and hierarchical clustering with word linkage. I will label the tails by their title. A quick glimpse in the dendrogram shows us that animal-themed tails are clustered together based on the type of animal they talk about. Document embedding is a great tool for describing documents with numbers. It is usually more accurate than bag of words, since synonyms are placed close together and as such, the model doesn't rely on words alone, but also on their meaning. Bag of words, on the other hand, is easier to interpret. The choice of the technique is up to you.