 Welcome to this deeper dive into unsupervised learning. Unsupervised learning is a machine learning task in which you take unlabeled data and try to learn the underlying structure of the data. If you're not already familiar with the core concept of machine learning, I strongly recommend that you go watch my short introduction to that first. In this presentation I'll start by covering clustering, then go on to talk about dimensionality to reduction and autoencoders and then finally talk a bit about how we can evaluate the results and the big picture of how the many different methods relate to each other. But let's first have a look at the input data. When you're doing unsupervised learning, you start from a set of endpoints that live in some high dimensional space. You can then take this and for example calculate pairwise distances, you could use Euclidean distance for that and that way get an end-by-end distance matrix. Alternatively, you could calculate all pairwise similarities using for example cosine similarity and get an end-by-end similarity matrix instead. And if you apply a cod-off to this one, you would obtain a network within nodes. These are all different views of the same input data and different algorithms start from different views. The first type of unsupervised learning is clustering. The goal of clustering is to discover groups in the data and there are two main approaches to that. One is hierarchical clustering in which we're trying to build a tree or dendrogram like this that shows which input points are most closely related and then gradually build up bigger and bigger groups. The other is partitional clustering in which we're trying to explicitly take the points and divide them into a number of clusters. There are several algorithms for this, the best known is probably the K-means algorithm in which we work in the original input coordinates and try to define centroids that correspond to the clusters. Another popular clustering algorithm is Markov clustering, also called MCL, which takes the network view on the data and tries to do community detection. Another class of unsupervised learning is dimensionality reduction. The goal here is very different, namely to take the high-dimensional input data and compress it to produce a lower-dimensional space that captures the information. This lower-dimensional space is also called a latent representation of the input data. There are several methods, some are linear, the best known being, of course, principle component analysis and most of them are nonlinear, including tisny and umap, both of which are commonly used for visualizing single cell data and also multi-dimensional scaling and force-directed layouts. All of these algorithms do similar things, but they have different objective functions that are being optimized and different transformations that are allowed. So, for example, in PCA we try to maximize the variance captured, in tisny we're focusing mainly on preserving the local structure and umap tries to preserve both the local and the global structure. Another approach to dimensionality reduction is autoencoders. Autoencoders are in a way supervised learning, but instead of trying to learn to predict the output from the input, it tries to predict the input from the input. You might, of course, object, that's trivial and that's true. But what we do is we introduce a bottleneck in the neural network architecture. So we work with an architecture that looks like this, where you have an input layer, then possibly a hidden layer with lower dimensionality, the small code layer in the middle and then mirroring the architecture in the other half to again produce an output layer with the same size as the input layer. The trick here is that when we train the model, it has to learn in the code layer a low-dimensional representation of the data that allows the high- dimensional version to be reproduced faithfully. After having trained it, we can throw away the lower half of the network and now have our autoencoder that takes the data in the input layer and convert it into a low-dimensional representation in the code layer. In other words, the latent representation. There are many variants of autoencoders. Autoencoders can be linear, but most autoencoders are nonlinear. In addition to normal autoencoders, you have denoising autoencoders that take a noisy version of the input data and tries to reproduce the clean version of the input data. And you have variational autoencoders that have become very popular recently that instead of learning a single vector representation in the middle, rather learns a patient representation of the data. So let's say we've done some unsupervised learning. How can we evaluate the quality of the results? The short answer is, it's difficult. And the reason is that we don't know the ground truth, that's why we're doing unsupervised learning. If you have some labels for some of your data points, it's very wise to check for consistency. In other words, whether the structure you've found in the data is consistent with what you know about your data points. Alternatively, if you're doing clustering, you can use metrics like intra-cluster cohesion and inter-cluster separation. In other words, looking at whether the points within clusters are indeed much closer to each other than the points in different clusters. If you're doing dimensionality reduction, for example, PCA, you will want to look at how much of the variants do you manage to capture in the first few dimensions. And if you're doing autoencoders, which are a form of self-supervised learning, you can, of course, skittle the tricks from supervised learning, leave out some of the data, and thereby have an independent test set to see if your autoencoder indeed works also on new data. But in the end, it typically comes down to expert judgment, that is, having somebody who knows about the data, look at your unsupervised learning and see if it makes sense. Alternatively, something that I like to do is what I call downstream benchmarking. Typically, when you're doing unsupervised learning, that is the starting point for doing something else. So simply move on, do the task you're trying to accomplish and benchmark the end results instead of trying to benchmark the unsupervised learning itself. It is only a means to an end. So let's end on the big picture. How do these many different methods relate to each other? If you take a hierarchical clustering and cut it, the tree becomes a set of distinct clusters. In other words, hierarchical clustering being cut is the same as doing partitional clustering. If you train linear autoencoders, you're in fact doing something very similar to principal component analysis because it's linear. And if you make tisny or umap plots as is often done for single cell data, you are effectively visualizing the clustering of the data. And finally, when you run layout algorithms on data, laying out nodes in 2D space, you are in fact doing dimensionality reduction taking the high-dimensional network and reducing it to two dimensions. That's all I have to say about unsupervised learning. If you want to learn also about supervised learning, I suggest you go look at this presentation next. Thanks for your attention.