 There are two prominent techniques for representing text documents numerically, bag of words, and document embedding. Now today we'll be talking about embeddings. So, just as a quick recap of our previous videos, we've learned that embeddings are numeric representations of words, and that after embedding, each word corresponds to a location in the embedding space. There, words with similar meanings, such as mom and dad, will lie close together, and unrelated words, such as car and butter, will lie far apart. Okay, now let's get started by setting up some data. We can use the data sets widget to load the BBC3 data set. This is just a collection of BBC news articles. Then we pass it through the corpus widget to ensure we use the text from both the title and the content. Now, let's see what we have in the corpus viewer. Okay, the first few articles are from the business section, with topics varying from jobs in the US to dollar values and greenspans. Now, I also want to take a quick look at the boxplot widget to check out the distribution of document classes. There are about 500 articles on business and sports, and just short of 400 on entertainment. Okay, now we'll embed the documents with the document embedding widget, just like we've embedded words in the past. Here, there are two embedding models to choose from. Espert and Fast Text. Now, Espert is a sentence-based transformer model that works for over 50 languages. And Fast Text is a lightweight collection of models trained on a continuous bag of words. So, since we're now working with more extensive texts, not just individual words, we'll use the default Espert. Also, a nice feature about Espert is that it doesn't require any pre-processing, it just handles entire documents. The data is now embedded, meaning each document has a vector representation, and we can examine those representations in a data table. There they are, each document described by 384 features. Now, let's try to classify the documents using their embeddings. Remember, classification means we want to predict a class of data instances from its description. So, in our case, we have some embeddings that describe our documents, and the classification task is given these embeddings to predict a document category. So, we will use the set of 1,400 embedded documents to train a logistic regression model, and then feed it to the predictions widget. Then, I can go to the BBC website and copy a part of some article on sports, and place the text in a new Create Corpus widget. And just for good measure, I can take another piece, this time on business, and make a new entry in Create Corpus. Okay, I'll have to embed these two articles as well. Then, I can feed the resulting embeddings into the predictions widget. And, there we go. Both documents are correctly classified as sports and business, respectively. Now, since we use logistic regression, we can even inspect the classification model in Anomagrap. Hmm, this doesn't look very informative. The business category seems to be represented by features with odd names, like Dimension 73 and Dimension 322. Now, this is a document embedding issue. While text embeddings can certainly represent text, interpreting them can be difficult. All we get from document embedding is a set of numbers, to which, at least at this stage, it's very hard to associate any meaning. Let me show you a quick way of characterizing a group of documents with associated words. We can project all the documents into two dimensions, using something like T-SNE. In the T-SNE visualization, each document from our data is represented by a single point. The documents with similar embeddings and therefore similar semantics should appear nearby. So, using T-SNE, we should be able to uncover groups of similar documents. Now, let's color the documents according to category. Great, we can already see how the clusters represent different news categories, but there is some overlap. For example, business and entertainment are pretty close and some business documents lean far towards the sports. Still, we can attempt to explain the documents in the T-SNE plot. I'll feed the output of T-SNE to the annotated corpus map and make sure all the data is passed through by rewiring the connection. Now, to find document groups, I'll use a Gaussian mixture model. Annotated corpus map determines clusters in the visualization and labels them with keywords. These keywords tell me that the sports section, for example, mentions England and Wales a lot. The entertainment news covers the best films and music, and the business news tends to talk about prices, percentages, and the economy. Now, I have no idea why I is specific to sports. I would need to investigate this further or just instruct Orange to remove particular words from document characterization. Now, coincidentally, we will be learning this stuff in future videos when we cover bag of words. Document embedding is a great technique to extract numeric representations of documents. It enables us to use any machine learning technique downstream from clustering to classification. But its main downside is that it's difficult to interpret. But as we saw with the help of some interactive maps, like T-SNE, we can even try to alleviate some of those issues.