 In this video, I will talk about the embedding of data into two-dimensional space. While this may sound complicated, I assure you it's not. I'll start with a simple example and work the idea to the point where I can embed the zoo dataset from my previous videos into a two-dimensional map. I've found a website that reports travel distances between some European cities. It says, for example, that Barcelona is about 1,500 km from Belgrade, which in turn is another 300 km from Budapest. I'd like to use this distance matrix to devise a two-dimensional plot where my cities are represented by points and the distances between these points are proportional to the distances in my matrix. In other words, I want to construct a map that will faithfully represent the distances of the input matrix. To do this in orange, the first set is to prepare the data. I copy the distances into Excel using edit-paste as special text to get rid of the formatting and save the file to the desktop. Now, to load the distance matrix into orange, I use the distance file widget. It tells me that the data contains 24 labeled data instances which I can examine in a distance matrix. Good, it looks like nothing's changed. Now for the main trick. I will use an embedding technique called multi-dimensional scaling and turn the distances into a two-dimensional map of European cities. All I need to do is connect the MDS widget to the distance file. There we go. I see Belgrade is indeed close to Budapest, Prague is close to Vienna, Munich and Berlin, and Moscow is way off on one side with Barcelona and Madrid off on the other. Multi-dimensional scaling seems to have recreated a map of European cities, except it's flipped upside down. This geographic information was not in our original data. We only provided the distances. A map of European capitals oriented as we're used to would normally look more like this. Madrid on the bottom left, Dublin to the north and Moscow far right. I can help Orange by flipping the MDS coordinates. Note that MDS can also output the entire dataset with embedding that is inferred two-dimensional coordinates stored as meta features. Anyway, I use the feature constructor to introduce two new variables, X and Y, that flip the computed MDS coordinates. Now, when I present my data in scatterplot, it places the cities the way we're used to on a map. Take a second to appreciate what multi-dimensional scaling is doing here. I gave it the driving distances between pairs of cities, and Orange used that data to place the cities in a way that closely resembles a proper map of Europe. We can do this with any data, as long as we can obtain or construct a distance matrix, say the zoo dataset. To do so, let's just restart with an empty canvas. Now, as in my previous videos, I load the zoo dataset using the dataset widget. Remember, the data contains information on 100 animals, described by 16 features, such as their number of legs or whether they have pair or feathers. Now, using this data, I construct the distance matrix with the distances widget. As we've already used this widget in every workflow for hierarchical clustering, I'll skip the details for now, but feel free to re-watch those videos. I can examine my computed distances by feeding them to the distance matrix. And finding them intact, I continue by feeding the distances to MDS. Taking a look at the plot, it makes a lot of sense. The mammals are clumped together and so are the fish and the birds. The three mammals that are closest to the fish are, no surprise, the dolphin, platypus, and porpoise. If you compare this plot to the principal component plot from my previous videos, you might also remember that the insects and birds aren't as intermixed this time. However, orange still indicates with a line that some insects should be closer to some birds than depicted on the map. Remember we estimated the distances from a 16-dimensional space and multi-dimensional scaling can only approximate these distances in two. The distances in two dimensions will never be entirely faithful to those we estimate from 16-dimensional data. If they were, all that 16-dimensional data would lay on some two-dimensional plane, making 14 dimensions completely redundant. You can get the impression of faithfulness by changing the threshold of when orange links to data instances, like this. Again, the line connects the points that represent the instances that are close together in the original data space. In faithful representations, the lines should connect the nearby points in the MDS visualization. Multi-dimensional maps remind us of the plots we constructed with principal component analysis. Remarkably, these two entirely different algorithms can result in very similar visualizations. On the one hand, PCA finds two-dimensional projection which retains the most variance, while MDS aims to preserve all the pairwise distances. PCA is a projection-based approach, while MDS iteratively optimizes the placement of points. MDS also embeds the data in two-dimensional space, where resulting coordinates have no meaning. The main advantage of multi-dimensional scaling is that it can handle distance matrices directly, while PCA requires tabular data representation.