 Not all data is created equal. Some are more centralized and others tend to stick out. There's inliers and there's outliers. So today I'd like to explain these concepts, especially with regards to clustering. The first thing to do is to create a data set. I'll paint three clusters. A blue one, a red one, and another green one off to the side. Now, consider the blue cluster. Between these three points, let's call them A, B, and C, which one is most representative of its cluster? Intuitively, you'd probably say B. It's in the center of its cluster and far from the other two. We could also call point B an inlier. Next, consider point A. While it is far away from the red and green clusters, it's also very near the edge of its own cluster. So it's probably not very representative of the entire blue cluster. And point C is a clear outlier. It's on the very edge of the blue cluster and very near the red one. Now, we probably want to quantify this vague idea of how central a point is to its cluster and give it a numeric value. This way, we can actually measure how well a point represents its cluster. In our example, we can estimate the cohesion of point B by computing the average distance to every other point in the cluster. Call this value A. We can also estimate point B's separation from other clusters by averaging the distance to each of those points. The nearest foreign cluster to B would be the red one. So we'd be considering these distances. Now, let's call the separation between B and the red cluster lowercase B. So using these scores, A and B, we can say a point is central to its cluster if A is small and B is large. We can even compute the difference between B and A and normalize it by the larger of the two to get what we call the silhouette of point B. Specifically for point B, we would expect the silhouette to be near one because B is much larger than A, so we're dividing something a little bit smaller than B with B. Keeping this in mind, we can see that for inliers, meaning points near the center of a cluster and far away from neighboring clusters, silhouette scores should be close to one. And on the other hand, if you think of outliers, so points at the edge of a cluster, their silhouette should be near zero. It's also possible to have a negative silhouette. This can happen when the point is actually closer to the neighboring cluster than it is to its own. Now, we can use this definition in our running example. The silhouette of point A should be positive but smaller than the silhouette of point B. And the silhouette of point C should be near zero, as the distances to the blue and red clusters are fairly similar. Luckily, I can even check this out with orange's silhouette widget. So here are the scores. I can see an outlier in the blue cluster has a negative silhouette. So selecting this bar, I can plot my data on a scatter plot by connecting the selected data output to the data subset input. The outlier turns out to be point C, the point right next to the red cluster. I can also use the silhouette to do the opposite and check out the inliers. These are the three centermost points from the blue cluster. And this is the central point of the red cluster. You might also notice that the green cluster has the largest silhouettes on average. This is because the green cluster is the most separated from the other two. Finding inliers and outliers is trivial in two dimensions and can honestly be done visually without computing silhouettes at all. The real usefulness of silhouettes becomes much more apparent in higher dimensions. And as we've seen, all I need to compute them are the distances between data points. Coincidentally, that's exactly what we've been talking about in our previous videos. So next time I'll combine these two ideas and try to find the inliers and outliers of some much more complex data.