 Let's take a second and think about the stars. A typical image of the night sky is, according to Google, maybe this one here. See the bright line across the sky? That's our galaxy, the Milky Way. Or better yet, a two-dimensional projection of our galaxy, like we'd see it from Earth. Okay, that's an odd way of thinking about the stars. However, based on the night sky's appearance, one could model our galaxy as linear, meaning that most of the Milky Way stars are positioned side by side on a straight line. Well, turns out, we know this isn't actually so. Astronomers have measured the distances of stars in our galaxy and figured out that it's more of a spiral shape. This image is also a two-dimensional projection of the stars in our three-dimensional galaxy. So, which model of our galaxy is better, a line or a spiral? Which tells us more about the Milky Way? And selecting a specific star, which of our models tells us more about its location in relation to other stars? Well, the more informative projection is obviously the spiral. In data science, we also like visual depictions of data. Think of each data instance as a star, but positioned in a multi-dimensional space. Now, just like visualizing the Milky Way as a spiral, we would like to find a two-dimensional projection of some data while retaining as much information as possible. A popular technique for finding such a projection is called principal component analysis, or PCA for short. Let's use PCA on some example data. I'll use the zoo data set from the data set's widget. The data describes animals like bears, boars and catfish with features that provide information about animals, having hair, feathers, laying eggs, or giving milk. The data also includes some information about the type of animal. Now, I'll put this data into the PCA widget. PCA aims to find the axis or principal component in a multi-dimensional space where the data varies the most. For our zoo data set, PCA tells us that the first most informative component explains 28% of the variance, and the second component explains another 19%. Together, they account for almost half of the total variance in the data. For now, we're only interested in two-dimensional plots, and we'll only be concerned with the first two principal components. On the output of the PCA widget is our data with two additional columns, PC1 and PC2. These define the position of each data instance, that is, each animal, in the projection found by our analysis. We can actually see these two coordinates in a scatter plot. I need to use PC1 for the x-axis and PC2 for the y-axis. I'll also instruct my scatter plot to color the points according to the type of animal. Nice! As we might have expected, all the mammals appear close to each other in the scatter plot, and so do the fish and the reptiles. The insects, however, are a little intermixed with the birds. Let's select the mammals here closest to the fish and take a look at them in the data table. Unsurprisingly, we find the dolphin, porpoise, and seal. Dolphins and porpoises have the same value of all features, so they're projected to the same coordinates in the PCA plane. We can uncover this though by jittering the points a little bit. So, in this video, we used PCA to reduce our data to two dimensions. It allowed us to present the data on a scatter plot and visualize the data structure. Notice how we can reason about the similarity between data points in the projected space and find exciting new clusters.