 In our previous videos, we discussed techniques for reducing data dimensionality. They included principal component analysis, multidimensional scaling, and Disney. The ultimate goal for using these techniques was to represent some data in two dimensions. Now, I'll refer to this kind of depiction as a data map, and try to briefly discuss how to characterize the clusters that appear in such maps. Starting off with a zoo data set. I'll pretend for the moment that I have no information on the type of animal I'm looking at. I can easily do this using the Select Columns widget, to select a feature that I want to ignore. Now, a quick look at Disney tells me that there may be several clusters in my data. For instance, consider the group at the top. To find what characterizes these animals, I can use a box plot. Remember, when I send the data signal from a widget that supports subset selection, Orange will add an additional feature to my data that shows the selected instances. This feature is called Selected, and I can use it to see the difference in distributions of the selected subset and the rest of my data. Now, what we want to do is find the features with the largest difference between these two distributions. So, I can tell Orange to order by relevance to subgroups, and take a closer look at the features at the top of the list. As it turns out, all the animals in the selected cluster have feathers, most of them have two legs, and none of them have teeth. Judging by that, they're probably birds. And indeed, if I look at the data in a table, I see a chicken, a crow, a dove, and a duck. Now, I'll switch it up a bit and take a look at another cluster. Using the same process as before, I see that most of these animals give milk, they have hair, and they don't lay eggs. So, I'm guessing we're looking at some mammals. Again, I can verify this very easily by looking at a table. If you've been following our previous videos, you'll know that the zoo dataset has a very clear and sensible clustered structure. But what happens if I try to use a dataset that isn't as forgiving, like employee attrition? This one is also available through the datasets widget. I'll take a quick peek to get a sense of what we're looking at. The first column, the class or target variable, shows whether an employee has left the company. The rest of the data characterizes employees by age, travel frequency, department, education level, and so on. Again, I'll ignore the class information and focus only on the clustering structure revealed by T-SNE. I can see a very clear cluster in the top left, again making sure that boxplot is subgrouping according to the feature selected. It seems that the feature that best distinguishes this cluster from the rest of the data is the department. Furthermore, all the employees from this cluster work in HR. I can also take a look at the cluster on the right and find that includes all the salespeople. Looks like this company has a lot of sale executives and some sale representatives. Obviously, they all work in the sales department. Now, all the remaining data seems to be in one big cluster. I can select it all using a modifier key to add multiple rectangular regions. On a Mac, this would be the command key. Now, I'll try to run T-SNE on this data again. It doesn't look like anything crazy happened, but I do get one extra little cluster that might be interesting. So, again, I send all the data to boxplot and make sure that I subgroup by selected and check order by relevance. I find more people from sales. The majority have a marketing background and most of them are managers. There are other ways to characterize clusters in data maps and find features that distinguish between groups. And we will talk about some of them when we get to classification algorithm. But until then, ordering features based on differing distributions and viewing them in a boxplot is a simple and effective method that tends to yield pretty good results.