 Hello, everyone. Hope you're having a wonderful day today. So today we are going to look at topic modeling by using Bert. So basically with a topic model, we have a document that's the input to our model, and the output would be, well, the cluster or the topic to which it belongs or to identify that topic. And we adopt this via via clustering, right? And we can do this with Bert. Bert has this library called Bert topic that allows us to perform clustering and each cluster represents a topic. And we're able to identify, you know, put in a document, it belongs to a cluster. And from that we can also extract what the cluster is talking about. So we're going to break this entire flow into three passes. Pass one will be just using Bert topic, which is very straightforward, very simple. And the second is this is actually how you would code it in actual time. And then we're going to break down Bert topic and just code it out ourselves, just showing intermediate steps since it's one huge black box. And then we break down those concepts even further. I've listed a couple of concepts over here. So this way you get a holistic experience of what Bert is actually doing when it is doing some topic clustering. So let's just get started with that. All right, let me expand the cell. It's all in a collab notebook that it will share at the end of this video. So first of all, um, a lot of this all this code that I'm going to show you at least in this past one is borrowed from the Bert topic repository right over here that you can clone download. And there are just very clear cut rules for installation. But let's run through all of this within our collaborative notebook. And I put it in here because it's just easier to, um, to condense all the information at once so that I don't have to keep hopping around links when I am doing this tutorial. Right. So first of all, we're going to import Bert topic, which is the essence of the video that I will get into. And then we're also going to import one of our data sets from cycle learn. And this is essentially a set of 20 different topics or news groups. And each of those, um, there are ranges within folders. Let me just show it to you. So within some folders, we have like different topics and each of these topics has a bunch of documents. Right. And each of these documents talk about like hardware, for example. So if we go over here, all we're doing is reading all of these documents. These documents have like headers, footers and all that that I just want to remove. And we just want to extract the core data. And, uh, well, if you look down here, this is an example of what the document looks like. My brother in the market, my brother is in the market for high performance video card that supports. Okay. So all that. This is an example. And like this, there are 18,846 documents that are distributed across like 20 topics. So that's the nature of our data set, which we are going to do some topic classification or topic clustering with bird topic. All right. So the first thing, and honestly, the core of everything that goes on that you need to know when you're doing topic modeling is just calling Bert topics, constructor function, and then applying, then actually training it or rather fitting these documents to cluster these documents. So what actually happens here is when you, when you do this first line, um, it is creating something called a sentence transformer, which I will get to in a bit. So essentially you have, let's say a document that document is going to be constructed will be, well, each of the doc, the document has a bunch of words or sentences like words and each of those words are converted into word vectors. These word vectors are condensed into a sentence vector. So one vector you'll have that represents all of this information. And once you have that vector, it's going to basically perform a dimensionality reduction followed by the actual topic clustering. Now that is the entire overview of what is going on when you call these two lines of code. Now I, we're dealing with the English language. So hence we're passing this in and it also instantiates a sentence transformer, a specific type, which I will get to later. And I'm also passing in like probability values that need to be calculated. It might take a little more time if you pass in this argument, but I want to, I'm passing it in because we have some cool graphics down below that I wanted to render and show you so that you can see every, you can actually see the embedding space for these topics. All right. And then once we do that, we transform, we just pass in our list of documents and we get the corresponding topic clusters. Now from here, what we're doing is we're getting the most frequent topics. The thing, the way that, so this is the kind of data frame that's actually returned by, you know, that's within this topics frequency. This is the topic number, topic represents the topic number. And then the count is the number of topics that are within that topic category. Negative one just represents that, well, these are outliers, a set of documents that really haven't been clustered anywhere because they don't belong anywhere specifically. Then we have like topic number 50, whatever that represents. And that there's a count of 862 of those topics, right? 862 of those documents, sorry, fall within topic number 50. So you can see, yeah, of like 8626 documents have not been classified. And the other 10,200 have been distributed across 134 topics. So that did a pretty cool job. Okay. And now I just gave an example of, okay, what the interpretation is that I just mentioned. But now, since the model just predicts a topic number, what exactly is topic 50? Right? What are these 862 documents actually talking about? And well, what we can do is internally, it is performing something called a topic based TFIDF. And using that, we are going to determine the topics themselves. So from the vectors, we're actually trying to determine the inherent meaning of what these documents collectively are talking about. And it looks like, well, they're talking about hockey of the NHL, right? So all of these words that are the results of the category base or cluster base TFIDF are talking about hockey. So that way we have some interpretability in these clusters too. And to add further interpretability, we can even plot visuals that I was talking about. So we have like a bunch of topics here, like I mentioned, like in the 130s, what was this 100 and 134 topics, right? So there's like just about that many topics here from topic zero to topic 123. And if you hover all over these, you can see what each of those topics are actually talking about. I think 50 comes over here 50. So there is hockey. And of course, it's you see a lot of these are overlapping each other because they talk about very similar things. And because of this, you can see, well, there's definitely room for you know, joining all of these clusters to even form larger clusters. And I'll probably leave that as an exercise by playing around with some hyper parameters that you can pass into Bert topic. And I think it's also given in this primer right here for for Bert topic, if you want to, you know, there's just like some more details if you want to just get some more information and clarity of fine tune, you know, these topic clusters. So yeah, you can just do that. And overall, this kind of covers exactly what we would do and how we would code out topic modeling with Bert. Now obviously, there is definitely a huge black box element in all of this, which is Bert topic itself. So let's actually dive in into, you know, that code part in past two. So past two is breaking down Bert topic. So let's show all the cells over here. Okay, so like I mentioned before, Bert topic actually uses something called sentence transformers, which I will, yeah, I'm going to get into it anyways. It also uses a an algorithm called UMAP learn UMAP actually, which performs dimensionality reduction. And we have HDB scan, which will perform the actual clustering. And we want to kind of install all of these. The way that you know, how you identify, you know, Oh, how did you know? Well, all of these are broken down in this way. Well, you just go to the code right here to the main repository, go to Bert topic. Look at the actual code. And then what we're calling is the fit transform function. So let's just look at the definition for fit transform that's right over here. And if you scroll down into the function, you see that there are three major steps that are happening right over here. So first is dimensionality reduction with UMAP, then we'll use these dimensional embeddings to create each unit using the clustering algorithm, HDB scan, which I'll get to. And then you use a category based TFIDF calculator. So all of this is kind of happening. And we are breaking that down into individual parts with past two. Alright, so I think aside from everything I mentioned, this count vectorizer here is just used to create, you know, the counters that we'll be using for the TFIDF counter. NumPy math library pandas is the data frame manipulation library. So first of all, what we're doing is we want to create a sentence transformer. And we're passing in a specific type of sentence transformer. Essentially what this will do, the output is that it the sentence transformers a model, right, the input will be a English sentence, like, I don't know, the, the dog ran in the field or something. And then the output, well, the output of Bert will be well, a set of sentence embeddings will be a sentence embedding. And that embedding is a vector, essentially. And that encodes the representation of the sentence. So since you know, these documents represent like there's 18,846 documents, what it will do is it will create 18,846 vectors, one vector that represents each document. And the shape of course, this is a typical this is a specific type of Bert architecture that will it will output 746 dimensional or 768 dimensional vector. And so each vector 768 dimensions. And that's why we see that the final embedding shape is the number of documents cross the embedding shape for each document. Right. So now the thing with with each document is being 768 dimensions is that it's still very high dimensional. And we need smaller or shorter dimensions to actually perform some sort of clustering here. And that's why we use something called UMAP embeddings, where what we want to do, this is essentially a non linear form of dimensionality reduction that squeezes that 768 dimensional vector into a very small vector of in this case, the number of components is five. So kind of like how it's analogous to your principal component analysis. But that is a linear dimensionality reduction algorithm that doesn't really create a good embedding to such a degree that when you compress it to just five dimensions, it's not going to retain all of that information. Well, instead, it's a more complex, non linear dimensionality reduction technique that is just better at, you know, storing these embeddings. Now we'll get into slightly more detail in the next pass. But essentially, there this is a very math heavy concept that I'll probably link the video to down below. So basically passed all the embeddings and now we get the UMAP embeddings. So this was 768 for every single document now only becomes five. So now it's small enough for us to process by our clustering algorithm. And the third part is the clustering algorithm, which is HDB scan. So essentially what we're what we're kind of doing here is that like, we want to like define, you know, the number of points that are very close within these clusters, I am going to get into more detail in the next part of this, you know, it's it's analogous to your k-means clustering, but just better for large amounts of data that is just very rampant, specifically for NLP type data, which I will get to. So what the input of this would be the embeddings that we created previously and the output would be the corresponding clusters and the cluster assignment for every single one of these points. So the output of this HDB scan was 105 different clusters. So I'm saying it's the number of cluster labels. That's, you know, the labels for each of these for every single one of the samples. And then I'm getting all the unique labels for every sample. And I'm just subtracting it by one because one of the clusters will be a negative one that consists of the outliers. And I just don't want to consider that. And so the overall number of clusters 105, kind of similar to with 134 that we saw before, the difference is being just some small hyper parameter tunings that you could possibly do. Alright, cool. Now that we have the clusters themselves, we need to actually interpret those clusters. And for that, we are using a category based TF IDF. So for context, TF IDF is well, the product is basically a matrix of every single word with every single document. And every cell would correspond to the TF times the inverse document for text frequency times the inverse document frequency for that word in that document, the higher that number, the essentially the higher the number, the more important that word is to that document. And so that document will is mostly a topic of that word. It's talking about that word a lot. But instead of, you know, specific dot, but here in this case, like we have a bunch of clusters. So the goal here is just to interpret what each of these clusters is saying. So what we can do is, well, why don't for every single cluster, combine them, combine all the documents within a cluster to form like a mega document. So it's 105 clusters will have 105 documents, all just just kind of like concatenated. And then for each of those documents, we can run TF IDF. And that's exactly and so it becomes a modification of TF IDF just called category based TF IDF. And in this way, we can get well, this would be the number of rows would be the actual words and the columns would now be the clusters instead of just the documents. And so in this case, every cell would represent how important a word is to a cluster. And that way we can get interpretability for our clusters. So that's essentially what this entire formulation is. It's just combining all of our documents into a into like a cluster. And then we're interpreting it with just TF IDF as we would. And that's exactly what we're doing here. So this art this argument right over here, this command is basically taking all of our documents grouping them by topic. That means everything that's within the same cluster, all the documents in the same cluster, and just joining them by a by a space. That's it. And we are then performing a simple TF IDF, where I've tried to comment out exactly like what the shape of the output is for you know, each of these variables that I've described, so that it just becomes easy when you're kind of going through the code manually, that you can relate it to the same formulation, it's literally the same coded formulation. So hopefully there should be no confusion there. Next, all right. So we get the TF IDF over here. And what we want to do is we're just calling two functions right over here. So I'm extracting, well, this function will extract the top n words for the document. I think it's pretty self explanatory. In this case, I am extracting the top 20 words, but well actually now, I'm passing in 10. So it'll extract the top 10 words. And yeah, that's it. Then we're also going to just get the sizes of all these for every single topic. Again, like this is code that I personally have not written myself. I will link to below. I think the top it was actually from a blog post that I just kind of grabbed a little bit of code from. So just check that blog post out a little more for even more details. But essentially, I mean, it's pretty self explanatory we're doing here. I'm just trying to create the same, the same interpretability matrix that we had right over here. Right? That's the essence. That's the essence of the category based TF IDF. So if we just scroll down here, right, so if you scroll down here now, what we're doing is well, we have the same kind of chart where it's saying, okay, topic number 11, there's 630 683 documents talking about topic 11. And what is topic 11 about? Well, it's about in astronauts interstellar. So you can kind of get an idea if that's about astronomy in general, and then says, oh, there's 593 documents that are talking about topic number 78. What is that? Well, it's mostly about like, probably Christianity or something. So cool, we're able to see that there is a way to interpret every single cluster to even though we picked everything apart from the code. So hopefully this gives even more detail into what we're really doing with Bert. But there is still some fuzziness here. The biggest one being, what is a sentence transformer? First of all, right? And what what is this crazy name that's that's kind of in here? And for that, we'll get to this. Well, we're getting to that. We are actually there. Let's just go to pass three and start with that. Alright, break down of everything. So, first of all, a sentence transformer, the the essence is this is the simplest kind of sentence transformer that you can create. Essentially, you pass in a sentence. This is a good day. Five, it could be five words, right? When you pass it into Bert, remember, Bert is a sequence of a transformer encoders. So the input is a sentence. And the output is a set of word embeddings. So if it's five word sentence, then Bert will output a five. I mean, it will output five vectors that are of the same size that correspond to the word embeddings for the sentence for each word in the sentence. But what we want from a sentence transformer is not word embeddings, but we want the sentence embedding. And so a simple thing that you can do is do some pooling. Here it's mean pooling, where you're aggregating, you're just taking the mean of all the vectors that have come out. So if there are five vectors, you just take the average of those five vectors. And that final vector, which is you, will be the embedding of the sentence. Simplest kind of transformer that you can simplest kind of sentence transformer that you can make. The biggest problem with this is that kind of sentence transformation, especially with this mean pooling, it doesn't really yield good sentence embeddings, even though Bert makes the great word embeddings. And so a way to kind of caveat this is, well, we can train a Bert Siamese network. So I've talked a little bit about this in one of my videos on, I think it was on a few shot learning where we're training a Siamese network for facial recognition. But let's try to go over what that is here. So Siamese is basically twins. So there's basically two networks that you are trained that are very similar or same architecture that you are training together. And you're just coupling it with some output layers that have some additional prior knowledge injected into the network. So what all that means is essentially those sentence transformers that we saw, they're right here, that's this one. And we have a same architecture right on this side. And then we're just joining what we're doing is we're taking this would be a sentence vector from the side. This will be the sentence vector from this side. All we're doing is just concatenating those vectors all together and then putting into a softmax. So let me give some context into what these two really are doing. Essentially, what we could do to train a good sentence embedding, the idea is to get good sentence embeddings in general, right? Like this UNV, we want these to be like really, really representative of what sentence A and sentence B actually mean. So there are two problems that we could train on. So this problem on the right is something called a natural language inference where we have two sentences that we pass into the network. And the output would be, well, are these two will be one of three categories? One are the two sentences, does sentence A entail sentence B? So it's entailment. The second is contradiction. Does sentence A contradict sentence B? Or then the third one, which is like a no context or no relation, which is the third classification. In that way, we can pass in, you know, the way that we would generate our data set is, well, we can randomly pair wise, you know, all of our sentences together, and then we can just, you know, we have a, there is a data set, which is like the NLI or the natural language inference data set from which, you know, you can, you can tell whether or not a sentence entails contradicts or has no relationship with the other sentence, right? And the same thing can be done like training another type of problem too. Again, we just want to get the good sentence embedding. So how we get it, we can train different problems. These two problems just seem to be pretty standard. So one is natural language inference and the other is sentence test similarity or STS, where we pass into sentences and then output of these would be kind of the, like a cosine similarity between them. So it would be, ideally, the output should be, you know, the higher, if it's ranked between let's say negative one and positive one, higher the number or like a more positive trajectory, like if the output is like towards plus one, it would mean that the sentences are very similar in meaning or yeah, in the positive direction. If there is zero, they don't have much relationship. And if it's negative one, those sentences are kind of opposite of each other. So you can say it's like a continuous version of like what we're doing for natural language inference, but it's more of like how it's, it's more of like a gold standard of rating and how, you know, things are done here. Okay. So this is great and all, but you know, it was so this represents an ally, this represents STS. But the problem is that like, even if you were to just train network with like randomly generating pairs of sentences, and then just like, you know, you have a bunch of training examples, because there's so many pairs that you can make. But the problem is that, you know, a lot of these pairs, they're not going to be very, it's not going to be very challenging for the model for the most part, at least like 99% of the pairs that you generate. It's going to be pretty easy for the model to distinguish because sentence A could be the dog is running in the field and sentence B is like, it's a wonderful day to have ice cream. I don't know, some totally unrelated things. And so like the classifier, the model is not going to have too tough of a time to distinguish those. And so in order to combat this, we want to make it a little tough for the model. And so we use something called a triplet loss. And in this case, instead of making pairs of sentences, we make triples of sentences. So a triple would consist of three, I mean, they have three names here, a triple could be first it consists of an anchor, something that contradicts the anchor and then something that entails the anchor. So this is an example that I, you know, I just came up with, like an anchor could say like sentence A could be say hello to me. And the sentence B could be don't say anything to me. We pass it into the network and the output label would be contradiction. So the output for the soft acts would be a contradiction. We just run it through this. And then just the very next one that we pass is sentence A could still be the anchor say hello to me. But the next one would be entailment, which is say something to me. So it and then the output of this would be entailment. Now, what this would do and why it makes it a little more challenging is that in the first case when we have a contradiction, the only difference is literally the word don't. And because the verbiage of everything else in these two sentences is the same, Bert really needs to understand what Bert really needs to understand like the actual semantics of the sentence, because it's not going to be able to tease it out just by looking at tokens because they have the same tokens you can't. Chances are that if it's a dumb model, it's just going to say, hey, they're the same, but in actuality, it needs to understand like negation in this case. And there's so many other, you know, different parts of speech, different, different connotations that it would need to understand fundamentally. And so because now this now we feed in triples in that way, this is just one example of a triple, like that we can construct triples from like our entire data set. And because of this now, what we can do is our we when we train our model, the sentence, the sentence embeddings are definitely going to be much more intelligent, they'll be much more better. And so we can train our model in this way using triples and the triplet loss in this case. OK, now let's see what it did. OK, so because these are two, you know, we can also use the triplet loss the same way for for doing, you know, sentence text similarity. Now, the cool thing is we can also try to train on natural language inference, and then we can fine tune the model with, you know, this sentence text similarity. And in fact, if you kind of look at let me just actually look this up just to show you like where we're getting the name of the sentence from the the the name of the if you remember the name of the token that we passed into our sentence transformer was distilled bird base and a lie at this this weird thing. So what this actually means is I think let me just show you actually right. It was like way at the top, right? Yes. OK, where is it? Yeah, here. So when we're passing in English, right, let's see what happens when you actually pass in English within the code. So let me say English English OK, so here. When you look at the code for bird topic and you pass in English, what it's going to do is it's going to use this particular sentence embedding. And that's where I that's why I introduce that as a sentence transformer within, you know, within the past two. And like this, there are so many other sentence transformers. This is just defining the architecture in which the sentence transformers are made. So in this case, well, what I believe it is saying and what is the actual architecture of this particular sentence transformer? Let's just scroll down here. All right, so picking it apart, distilled Bert base NLI STSB STSB mean tokens. So this basically uses a Bert based architecture. It then trains on natural language inferencing and further fine tunes on the sentence text similarity task. And then it uses the mean pooling for creating the initial sentence embeddings and hence this weird name. And that's exactly what's happening. And that's how I think these there's so many like this. This is just one type of sentence transformer. We could have trained it on any other task in any other way to for main training and also like fine tuning. And you'll find a lot of these sentence transformers just in hugging faces repository or hugging faces site. There's like a bunch depending on the context, depending on the data, depending on your needs in general for just processing the documents that you have in your data set. So yeah, I hope that that explanation it wasn't too hand wavy. I've introduced a few links for proof of why I think this is the case just in some references here. So yeah, that's pretty cool. All right, so sentence transformers. I hope all of that is really clear. So next is UMAP. UMAP is more like I mentioned before UMAP is used for the dimensionality reduction of your court entire embeddings into a much smaller space, right? But there this is a very mathematical involved dimensionality reduction. I'm not sure if you've actually heard of another dimensionality reduction technique called T-SNE. Let me actually pull up jukebox AI because there's something really cool that you can do with, you know, your understanding of, you know, dimensionality reduction where it can be used. So one is typically in visualization. So this is jukebox AI, which I explained in another video as a music generator. And internally it generates also like an embedding space of of artists of artists themselves, like what their artist vector would represent. And if we smash into two dimensions using another dimensionality reduction technique called T-SNE, T-SNE, which is another nonlinear dimensionality reduction technique, just like UMAP. So where, you know, you're able to see, oh, wow, all like the R&B stars are just together. The classical artists are together. Then, you know, Hans Zimmer and these are film music stars, I believe, are together. Like that, you can see that this is a good embedding representation. So the dimensionality reduction is actually really good. And it's it looks super useful. But with UMAP, it's another similar type of dimensionality reduction that is also it learns like a non nonlinear functions for reducing dimensions. It can also be used for while processing in the machine learning flow that we are using right now, for example, to reduce it to five dimensions, because we require that for actual clustering that does happen with another technique called HDB scan. I hope that explanation is good enough to whet your appetite. If you want to look into the mathematical details, I do recommend the video by the main person who created UMAP that I've linked. So just do check it out. Well, other than that, let's say that now you have like all your documents are now represented in those five dimensional feature vectors, right? So what do we do now? We want to do some clustering, some document clustering to find out, you know, distinct topics that your entire data set is talking about. So there are several ways to perform clustering and the different hierarchies are given on the rows and also the columns here. So centroid based and density base is one category of ways that you can categorize clustering. So centroid base is basically assuming that the nature of your clusters, the shape of your clusters is like Gaussian or specifically the shape of your clusters is circular or in that typical elliptical shape. Whereas density base is going to say, hey, we want to perform clustering in the sense that, well, if there are more points concentrated in the region, the shape of your cluster is only dependent on points and it won't, you know, arbitrarily define like, you know, a circular region for your clusters. Good examples of that are here. Again, there's also another way that we can cut, you know, how do we define clustering algorithms? One is flat and the other is hierarchical clustering. So flat is basically you define like a predefined set of clusters. You want to you want five clusters, you run your algorithm, you'll get five clusters. But with more flexibility, hierarchical clustering can be much better where, you know, depending on, you know, the way or how much of, you know, how many clusters that you want or how close that you want them together, you would define a hierarchy and just choose like the appropriate number of clusters based on your data. Now, clearly the best of both worlds will be HDB scan, which I'm going to talk about in terms of like at least how the data looks. So first is, you know, if if you look at K-means, for example, and you have data that's very noisy, that looks like this, right? So K-means would have probably identified, you can see that, you know, there are some regions you can see here that are like circular and they they're very specific, right? And even though there are some points, if each of these points were to represent documents that are like way out there, they don't they're all considered partitioned and they're all considered a part of a cluster, even though like, for example, this green point over here is way off compared to all the other green points too. It's still, you know, in there when it honestly could be just an outlier. Now, if we look to HDB scan, on the other hand, well, first of all, the shape of your clusters is not like circular. It is not partitioned. It can it takes and also so, yeah, and it also takes like the shape of, well, whatever, you know, the actual density is in these points. So these are the most dense points, they contribute best of clusters and all these other grade points could be representative of documents that are outliers. And on top of that, well, depending on, you know, that is since it's like a hierarchical clustering to it would identify the appropriate number of clusters as well. So that is an intuition on what HDB scan represents. And honestly, that kind of completes the entire thing. So when you do implement HDB scan, you get your appropriate number of clusters and hence then after this, you would just perform the category based TFIDF in order to interpret what each of those clusters actually is talking about. So I hope that wasn't, you know, too handwavy. I have actually some research paper links that, you know, some of the intuition and notes and especially the figures and explanations that come from. I've linked them everywhere in this entire notebook. So please do check it out. Please do also credit. I'm going to also credit the the authors of the actual code, which are mainly the tutorials that I've well, it's not tutorials. I guess it's the main repositories that I've kind of picked this apart from. I'll put that all in the description down below and also in the read me. Hope you all enjoyed what you saw today. Hope you learned something new with some topic modeling with Bert. It's super exciting stuff and I'm so glad to do it. And the sun is going down right now, but that's great because this video is over and I'll see you in the next one. Take it easy. Bye.