 Welcome to the text mining series. Here we'll go over some of the basics of applying machine learning techniques to text and collections of text documents. Now previously in the Introduction to Data Science series, we've mainly dealt with tabular data. That basically means we've been describing objects of interest with several features. For example, we represented iris flowers by the width of their leaves, and animals by the number of legs. Okay, but now we'll be considering just plain text. So say we have a list of words. How do you process it? What kind of machine learning, clustering, or classification can you use? And what information can be inferred from the textual data? Now we can explore all of this in orange using the text mining add-on. So let's install it from the add-ons menu. Just go to add-ons, find text, check it, and install the add-on by clicking OK. So let me start with a simple example. I have an Excel table with a list of 150 words, ranging from apple, arrow, and banana to zoo and zucchini at the end. You can also find the link to this data set in the description. So our data has only one column that I've conveniently named words. I can drag the file icon onto an empty canvas to load this list of words into orange. There it is. I now have 150 data instances and just one column of type text. Now the file widget also says that this column is a meta column. So I can double-check my data in a data table. There it is, just a column of words. Now to do anything with text in orange, I have to pack my data into a corpus, which is just a collection of documents. Then I need to find a way to represent these documents with numbers. To do this, I'll use the corpus widget. By default, it will annotate my data by choosing the words variable as the title. Then I can select the variables that will collectively represent the text content of the documents. Now, since the only variable in my data is words, my corpus definition is simple, and I'll just leave it like that. I still need to represent my text data with numbers though. But all I have is a list of words. So to turn these words into numbers, I'll use a pre-trained deep neural network. This technique is available through document embedding. I'll use sentenceBert or sBert for the embedding method. Now sBert is a technique that transforms text into fixed length vectors while preserving their semantic meaning, so that related words have similar vectors. sBert is trained on sentences and considers context, while something like fast text, another option available in this widget, is trained on words alone. For now, I'll stick with sBert though. Now I can check the output of my embedding in another data table. I still have all my words, but now they're described by an additional 384 numbers. I don't know exactly what these numbers mean, but for now I just hope they're helpful. Okay, now let's check if our embeddings make any sense. I plan to select a word like banana and find its semantically closest words from my collection. To do this, I'll use the neighbor's widget. Give it our corpus of embedded words for the input and make sure that the similarities between embeddings are measured with a cosine distance. Also, I'll limit the number of neighbors to three. Now I choose a reference word in my data table, say banana, and pass my selection from the data table to the neighbor's widget. Now, be aware that for Orange to correctly guess my reference words, I have to first connect the whole data set and only then the selected words. Okay, so let's see what's on the output of the neighbor's widget. This time, instead of a data table, I'll look at the results with a corpus viewer. Also, just for now, I'll minimize the control panel. And there we go. Great, semantically similar words to banana are fruit, lemon, and yellow. Okay, now let's use the data table in corpus viewer widgets side by side to select a reference word and see the results immediately. How about the word bed? Again, I get mattress, pillow, and blanket. And the word car, I get van, engine, and wheel. Seems to work like a charm. But note that in my collection, there are only 150 words. I could easily encounter some strange relations. For example, a list of semantically similar words to dog, according to Esper, includes ostrich, which, okay, is an animal, but still I wouldn't really consider them close. Now, let's try one more. Drum. Okay, I get harp, ukulele, and trumpet, all other instruments in my collection. Note that I'm displaying just three similar words, and I can easily extend this by setting the number of neighbors to say five. Now, I get the additional words jam and saxophone. I find it quite amazing how well embeddings work on words. We found similar words using Esper, a pre-trained model to turn text into numbers, by searching the embedding vector space for similarity. And as it turns out, we can apply all standard machine learning techniques to text just by turning it into numbers. And I'll present some more examples of this in our upcoming videos.