 Today on the menu we have word classification. So let me start with a list of words for, well, foods. Here, there's grapefruit, cauliflower, green bean, gorgonzola, etc. Also I have a column that describes the category of food, fruits, vegetables, dairies, you get the picture. Now my goal is to make a classifier that, given a food, can predict its type. So I'd like it to tell me if a clementine is a fruit or a vegetable. And what kind of food is camembert? Now the first step is to load the data into orange. I just drag in the excel file and open the file widget. Now you can also use the datasets widget in orange, just look for food words. Okay, there I have 108 instances of data. There's a text column called food and an extra feature called category, which I'd like to use as my designated target. So I'll have to change its role to reflect this. Okay, now I just need to remember to actually press apply for this change to take effect. Just like in the other text mining videos, we'll have to represent our words as numbers in order to develop a classifier. Of course we already know that we can do this by embedding. Okay, first I'll need to tell orange I'm working with a corpus of documents. So I'll bring up the corpus widget. Here I'll use the food feature as the document title and a text feature. Also, I'll make sure that this is all in English. Okay, now I can embed my words with fast text. Then just check out its output in a data table. It looks okay, there's a category column, a text column with names of various foods, and then 300 numerical features representing each word. Now I'll do two things. First, I'll estimate how accurately logistic regression models my data. And second, I'll classify some new foods using the model built from the existing data. And just as a warning, I'll be using some widgets that we've covered in our introduction to data science series. So you can go check those out if you want a more thorough explanation. The links will be in the description below. Now let's get started with accuracy estimation. I'll use cross validation in the test and score widget and attach the data and the logistic regression learner. Now we already have some results. Fivefold cross validation estimates that classification accuracy of logistic regression at 98%. This means there's just a handful of misclassifications. So I should be able to take a closer look at them with the confusion matrix widget. And indeed, the model misclassified two vegetables. Now I can select the cell with misclassifications and add another data table to the output to see that our model had trouble with ginger and pumpkin. Okay, now remember, we only have 108 words or foods in our data. We could further improve the performance of our classifier by adding some more items. You're free to try this independently. Now my second goal for the day is to classify some foods that are outside my current dataset. Now I could create another data file in Excel, but instead I'll use the word list widget. So let's try to predict the types for date, Clementine, Camembert and Gouda. I'll send these words to the corpus widget. Yes, they are also in English. And then get their embeddings as well. Now remember, it's always a good idea to double check results in a data table. Okay, I have the words and their embeddings and also an extra unused column that displays whether the word was selected in the word list. Now it's time for classification. I'll remove the test and score and confusion matrix widgets and pass all my training data directly to logistic regression. I'll just let it build a model. Then I'll send that model to the predictions widget. Now the data I'd like to make predictions on will come from this other pipeline that starts with word list. Great. Date and Clementine are fruits and Camembert and Gouda are dairy products, just like you would expect. Now there are a few more things I could do here, but that's always the case. For example, I could use Chesney to visualize the data and see how the groups correspond to my categories. Or I could check if embedding with Esper changes the accuracy, which it does. Or I could try even more words to see where my model succeeds and where it fails. And these are all things I urge you to try for yourself. But in the meantime, I have to head on to text mining documents instead of just words. So I'll see you next time.