 And this is my thanks to the team at ODSC. This is what every speaker wants. So since this is right after lunch, let's try and keep it as interactive as we can. Feel free to stop me anywhere, ask any questions, and I'll be happy to take it up. And this is also the topic of the talk today. So good to go. These are a couple of more examples of what is sarcasm, right? So here I have given example of one buffet topic about talking about livable wage, a person trying to express his feelings in a very subtle way. Now, what sarcasm is? Let's understand that. So sarcasm has roots in French and Greek words, which basically means speaking bitterly about something or someone, but not being explicit about it. You are being nuanced in the language and you are not being very explicit about that you are being sarcastic about it. And like some of you might feel, this talk looks like great fun. And it could be a real feeling. It could be a sarcastic feeling. Now, why is sarcasm detection even important? I mean, why is somebody even talking about it? So if you think from a business perspective, organizations tend to look at social media in great detail in trying to mine various kind of information. One of the information they are always after is how does our users or how does our customers think about our products, about services, right? And for this very reason, you take any CRM tool around the world, you see an offering called sentiment analysis. And they will do for every conversation at a weekly level, monthly level, try and give you a better understanding of whether your customer basis is coming closer to you or is it going away from you. Now, since the evolution of social media has often become a primary channel for CRMs, customer grievance, and people often when they get too fed up, they tend to be very sarcastic about what they feel about. So here I've shown an example where somebody is talking about American Airlines being messing up the entire schedule right in December. When everybody knows that because of snow and other issues, typically the travel is hard in that time. Now, what implication does this have? Most sentiment analysis systems that are there as of today tend to do very badly when it comes to sarcastic tweets or any kind of sarcasm. Here, what I've shown are two of the best-known current sentiment analysis systems. One comes from Stanford, which actually does at a world level. While the other comes from Allen, the other group. And if you give a sentence like, I love the brain of breakup, if you notice carefully, both of them got it wrong. Both of them were confused by the word love and they thought it's a positive emotion, while it is not. Now, this actually places an additional burden over the customer representation teams that are working behind this. And they tend to many a times miss sarcasm altogether. So in this particular case, the person felt that they were congratulating the American Airlines and they responded in a positive fashion. While this person then had to come back and say, hey, you actually got it completely wrong. And there have been instances where these kind of fiascos have happened even with social influences, which have led to PR disasters for the brand. Now, what I talked until now was a business perspective of this problem. There's an alternative side to it, which is the research perspective. So if you look at NLP as the area, there are certain problems and if you rank them according to the level of difficulty, certain problems would come right up front like question answering, summarization, sarcasm, machine translation. These are believed to be the harder nuts to crack within NLP because they require model which are much more nuanced, can understand the subtleties of the language, which many other problem statements within NLP do not require that. And any progress that you make on any of these problems, which is actually a positive step in the direction for NLP as well. Now sentiment analysis has been around for many years and people have actually now been able to build commercial solutions, which are really good. And that is when people have started to look into the more deeper aspect of sentiment like aspect based sentiment, right? So aspect is if you are going to a particular restaurant, you might feel great about the ambience in the food while you might not be very happy with the service, right? So this is neither negative nor positive. There are aspects of the product and the services that you are happy about and there are those that you are not. Similar is the story with sarcasm. So be it be the business or be it be the research side, I mean whatever side you want to take, it is worth investing as to how can one go about building systems that can handle sarcasm. Now as I said sarcasm, what is it? It's a sharp bitter cutting expression. It's a taunt. It always have a negative connotation. You are never sarcastically positive. You are positive but never sarcastically. So by definition, the sentiment of sarcasm is always negative. What makes it difficult? It is deliberate. It's a play of language and it is often very subtle. It will be just a small variation of a grammatical symbol or a word here and there that will completely flip the entire meaning of the statement. When it comes to sarcasm on Twitter, it takes altogether a different level of difficulty because Twitter comes with its own challenges. You have very short texts. People use all kinds of acronyms. There is a ever evolving grammar or a language to it. Every day, you have new words that keep on coming. And this is what we will try and address here, sarcasm on Twitter. So the business problem is we want to build a sentiment system which is capable of handling sarcasm. So we have a text. We like to understand whether it is sarcasm or not. If it is not, we will continue to use a basic sentiment analysis system. If it is, then we know by the very definition that the sentiment involved is negative. So the abstract problem statement is, given a tweet T from a user U, our solution should be able to detect whether T is sarcasm or not, okay? So I am not gonna focus on sentiment analysis part of it. That's a standard system, but this is something that has to come over and over. Any doubts? Everybody good? So this is the focus of this entire talk. And let us just quickly define the scope behind it. Now, if you look at this sarcasm part, if Hilary wins, she surely will be pleased to recall Monica each time she enters Oval Office. Now this is pretty nasty. If as a human also you have to figure this out, you have to do couple of things. You have to understand that the she refers to Hilary. But there are facts embedded in it that Hilary is running for the presidential election. If she wins, it will give her the right to enter the Oval Office. There is a history associated with the Oval Office and her husband, which does not bring good memories to her. There's a lot that is going on here. And current set of systems cannot handle this. I mean, we have not reached to a level where NLP systems or NLP architecture can handle complex cases of reasoning of this level. So for the purpose of this talk, we are not gonna be involving in text of this nature. We will only and only focus on those sentences where all the information that is needed to say whether this is sarcasm or not is there in the text itself and there is no outside reference. So double positive does not always mean it can be. So yes, we would ideally like to address that as well. Now, how did we even go about building a data set for this? So Twitter has some very interesting sources. There are a couple of hashtags and there are a couple of handles. The only job is to give such kind of data to you. So this is what we did. We actually crawled a lot of data from these sources. And this gave us one side of the picture of what is sarcasm. How do you collect data for what is not sarcasm? Everything else. And that everything is a huge domain. And that is where one of the challenges that we came across that how do you collect data which gives a very good coverage of the other side of the picture. So what we also did is to build a more nuanced system. We also added the normal sentiment analysis data sets to this data set that we had collected. And after cleaning, we were roughly left with close to 50k data points per class. So 50k instances of what is sarcasm and 50k instances of what is not sarcasm. You must note that any amount of effort to have a comprehensive data set of what is not will always be short of its, I mean, entirety. You'll always have certain parts that are missing to it. And this was true even for this. And similarly, we built another data set of 20k but from a different timeline. Now, it is not for the first time that people are attempting problems like this. For the sarcasm detection, people have always known that it's a much nuanced form. People have addressed over a period of many years. But most of it was hand labeled features. So you will look at Unigram, Pygram. You will look at features like are there emotions in it? What is the capitalization? Are there emojis in it? You will look at things like are there words which are positive in nature? How many words are there? How many words are there which are negative? How many times do you get a positive word right next to a negative word and stuff like that? And these all are hard coded features. And I assume most of us do understand that the difference between ML and DL, the defining factor has been that hand coded features are no more needed. When you can run a DL network and get features which are not human interpretable, or at least they are not hard coded. So they tend to be more robust, right? So I mean this is what, so essentially what every DL model deep down does it, it maps your data to a latent space where your points then becomes a features, right? Now what is the interpretation to these latent features? We are still a couple of years away, at least in the text, in the computer vision system we have very good understanding of what is actually happening inside, but in text there is no clean interpretation as to is it looking at a verb, is it looking at a noun, what exactly is happening. So the first thing that we did is we first establish a base line. And the way we did it is we treated as a simple text classification problem. You have 50,000 examples of one class which is sarcasm. You have 50,000 examples of another class which is zero. And then treated as a binary classification. How do you typically go about starting on a binary classification in text? You use RNNs. What is RNN? RNNs are models which have the ability to look into a sequence. Text is treated as a sequence of characters or words or phrases, but ultimately it's a sequence. And RNNs have this ability to process the whole text and come up with what is called the representation or the memory vector part. And that becomes your latent representation on which you do your final classification. So we exactly did the same thing. We exactly set that out, but we actually got very bad numbers. And why? Because one of the things was that there is RNNs tend to typically need more data as compared to this CNS. So if you look at the number of parameters that are there, thumb rule in DL is always that your number of data points have to be far greater than the number of parameters of the model. And so instead what we did is we applied what is called CNNs to text. Now, how do you do CNNs to text? I'll just quickly brief. So on a computer vision where you have an image as a grid, you run what is called a convolution. You have a sub-matrix that you run across the entire feature map. This is called 2D convolution because the sub-matrix is moving across the width and across the height. Now in text, what do you do is you typically represent your words using embeddings. So if I have a word like, I like this talk which is five words and each of the word is a 300 dimension vector. You have a straightforward matrix of five into 300. We are trying to replicate what happened in computer vision. Where the starting point, we have a matrix which is five into 300. Now, if I run a 2D convolution, I'm looking only at certain dimensions of my latent representation or word embedding. In NLP, as of today, there is no clear interpretation as to what each of the sub-dimensions mean. The meaning of the words are smeared across the entire D dimension space. So the changes that you typically do is instead of a 2D convolution, you have a convolution which is of the, whose width is equal to the dimension, but you run it across the words. You are moving it across only one dimension which are the tokens that you're working with. So this is called 1D convolution. And then you do a typical idea of max pooling and then taking it forward. Exactly same as computer vision. So we actually did the same thing. We applied a CNN and we got a much better score than that. So these two were our base lines that we started with. Now, how do you improve from there? I mean, machine learning is not always about making the solution in one go. You always build one and then you improve from there. So literature typically have talked of couple of signals that tend to indicate well when dealing with sarcasm. These are sentiment, emotion and personality. Let's look into what each of them. So sentiment is all about whether this is negative or positive, right? But sarcasm has this hallmark that as you process the text, your sentiment will flip. So if I look at this sentence, I love the pain present in the breakup. It starts with the positive tone and then it changes into a negative tone, right? This flipping of sentiment is a hallmark of sarcasm. Now, and people have always known this and one of the standard way to do was look at the number of words that are positive, negative, positive words coming in, this side negative words and things like that. We would like to avoid these kind of fragile features and build a more robust set. What is emotion? Emotion is a more fine grained sentiment in some sense. So sentiment is more objective. It is either negative, positive or neutral. While emotion is a mix of that, it could be one or more of happiness, anger, jealousy, grief and et cetera, right? And sarcastic sentences are always rich in emotion. So as an example, I mean some of you might actually relate to it, but my stellar programming career is job offer, control C, control B, resignation, repeat. Now this has a lot of emotion part to it, right? There is pain, there is sadness, there is anger, there is disgust. All of them are there. You can't say this is there and this is not there. And this is the other hallmark of sarcasm. And this is what we'll be exploiting. The third piece of the puzzle is personality. Now people have done a lot of studies in human behavior, psychology and they have concluded that some people have a greater ability to understand and even express sarcasm as compared to other, right? In a more general sense, there are people, so given a particular person, the likelihood of you being sarcastic will vary and can vary dramatically. And this matters a lot. So maybe I am somebody who is very sentimental, very emotional, but I am not very sarcastic, typical. And this actually matters when you build models because the user history, who you are dealing with has implications on the performance of your model. Now, typically the way this is done is you look at the history of the user from whom the tweet has come and then you actually derive certain features and signals that matter. However, from a practical point of view, at the end of the day, this is not a research product. I mean, this is a system that we wish to deploy in a real world setting and make it work. Now, if for every call I have to go and check the history of this user, then there is a lot of effort that's gonna go at runtime. The other option is I store the history of the users that I've already interacted with, with me, which has a lot of implications on storage. And both these points make it a very bad choice from a pragmatic point of view, right? And which is why we decided, at least for now, let us not go down this side. We will not deal with the persona part of Sarkasa. So one option is you store certain features like whether he does this or not does this and stuff like that. A more nuanced form is saying that over the history, I mean, your persona is not fixed. It is not something that you're born with. It also evolves and that also has implications to play, right? And if you want to capture that, that means you have to look at the entire history. In the language of Twitter, you are essentially saying that you should look at all the 3,200 last tweets that this guy has made, which is very expensive to do that. Any other doubts? Yeah, sure. But the, absolutely, I mean, so first you have to consider whether they even have a correlation with the Sarkasa or not. And second, if they have, how do you actually incorporate them? Machine learning is not about using all possible features. It is about building the simplest system with the smallest set of features that tends to do the best, right? I mean, that is the computer science way of looking at it or optimization. Resources are expensive. Let us try and build the most simplest that can push the boundary that far. I mean, that way you can also look at why are we looking at only Twitter. I can look at other social profiles of this guy and get data from there. Yeah, you can. But you are dramatically increasing the complexity of a production system. In a picture, if I have two, yeah, sure. So a couple of things. One, we are not dealing with examples like Hillary Clinton. There are two complexes, the system cannot handle it. Second, when we are looking at one tweet, we are not looking at any other history. We look at each tweet in isolation. So maybe this guy has tweeted a couple of them and all of them are there in our dataset, but we are ignoring that fact all together. Because if we start to use those dependencies, then those dependencies will be embedded in the model. And that means at the runtime, also you need all those dependencies, which we want to completely avoid from a pragmatic viewpoint. So when I said the word cleaning up, cleaning up has a lot of parts to it. So there are various ways. I mean, at the end of the day, you have to do manual at some point of time. But prior to that, you can use a lot of techniques like you can build a small set, which you have hand curated, train a model on it and do what is called the outlier detection. Look at those that are flagged positive, add them to the training set and keep on repeating this process. So the quality of the dataset over a period of time grows very well. And it's only in the last that the human in the loop comes in. But yes, it is there in the last. Okay, so if I have to sum up the entire solution in one diagram, this is it. So we have a text. We build a bunch of models. Each of these models does a very separate job. So the green model only says what is the sentiment of this text. The blue only talks about what is the emotion of this text. And the one in the last, which is the gray is the baseline model, which says whether this is sarcasm or not. And we actually don't take the judgment of these models instead. So a common technique that is there in deep learning that is used extensively to automate the feature engineering part is that you take last but one layer and you use that as your features, right? So here we actually do the same thing. We actually remove the judgment part, which is typically a softmax layer in the end and we take last but one layer. So if I do this for the green model and I'll come to the details of what is the green and the blue and the gray. But if I do that, what I'll get are features which in some sense encode the sentiment information that is present in the text. The blue encodes the emotion features and the gray encodes the sarcasm feature. And we actually combine all three of these and form what is our final feature vector. The third one was an option. So we conducted a bunch of experiments. In one, we had the baseline feature. In another, we did not include that. And I'll show how the results are being. Once you have the final feature vector, we applied a classifier on top of it. And these are the various types of them, like right from a logistic to an SVM to a CNM. Am I clear with the larger architecture? So now I will go into the details for each of them. Sentiment model. So what we essentially want to do is we want to extract out the features which are relevant for sentiment of the text. So what we do is we have a CNN for this. Why a CNN? As I said that the amount of data that we could collect was much smaller. Large part of the data turned out to be a huge noise. So we had to discard a huge portion of it. So the clean data that we were left with was much, much smaller. So we built a CNN on this. And what does the CNN do? It predicts the sentiment of the text. So negative, positive, or neutral. And for this, we use the sentiment data sets that are already available on the web. In a couple of lines, if I have to sum up, you take the text, you tokenize, you do some pre-processing, you tokenize. And then to each of these tokens, you convert it into a word vectors. Here, as a starting point, we actually use the pre-trained quitter word vectors which are available from gloves. And for the tokenizer, rather than using the standard idea that space is the split in English, there are complex tokenizers which are available for dealing with the quitter data. So one such was written by a guy called Alan Red Crater a couple of years ago. We had done a lot of work around social network. So we have directly used that. So we took sentiment analysis data set. This is basically three classes. Each of them has either negative, positive, or neutral. We took a lot of public data set. We augmented it with some of the custom data set when we saw that how it was doing as a final system. And we looked at what are the kind of cases it is missing. Depending on that, we augmented it. And all we have done on top of it is a CNN. A CNN is all, as I explained, one deconvolution, max pooling, one deconvolution, max pooling. Somewhere you have a single vector which is all flattened up and that becomes your feature vector, okay? So for those who have not followed that thread, I'll just repeat once more. So if you have a sentence like, I like this movie very much. These are seven words. Imagine I have a five dimension word embedding space that I'm dealing with. Somehow these word vectors are already available to me. I just query, I just fill up with the corresponding word vector. So now I have a metric which is seven cross five and starting point for that sentence. So we are trying to replicate how CNN's work for computer vision. These are our filters, right? So the width of the filter is equal to the dimension but the height is varying. So I have two filters of size four, two filters of size three and two filters of size two, right? So when you apply your two cross five filter, it takes this, moves by one, moves by one, moves by one and it is continually extracting the features. The convolution operation is same like it is always there in a computer vision. And then on top of it, you apply max pooling. They took the max from here, max from here and similarly from each of them, this becomes your final feature vector and then you do a classification on top of it. Am I clear? Yes, we use that as well. Now that has implication in terms of whether you can find embedding or not. For most of them, it is there. If you look at the glove embedding for Twitter data, I think that was trained on close to 2 billion tokens or 2 billion tweets rather. So it has a pretty rich embedding space in that sense. The second piece of the puzzle was emotion model, exactly similar. So the only thing is the objective now rather than predicting one of the three classes is now predicting one of the six classes. So you have six classes as anger, disgust, surprise, sadness, joy, fear. There are public data sets that are available for this, which is the reason for choosing these classes. Ideally you might want to have more, but at least as a starting point, we didn't want to first invest building a data set for emotion. So we said, let us look at what all is available on the web and then just use it directly. So now coming to the details. So we have three models, the baseline, sentiment and emotion. So, and these are not very deep CNNs like you typically see in computer vision, 100 layers, 200 layers kind of thing. So this had only a bunch of layers. You have convolution, match pooling, convolution, match pooling. Then you have the simple linear layer and on top of it you have the soft mass. So these are the various kernel sizes that we used, like which is three into D, four into D, five into D, six into D. And then these gave us feature maps. On top of that we applied kernel sizes of five, three. I mean, these are the best hyper parameters that worked for us. We actually tried a lot of them. I mean, I'm showing the final result. And then we have the final layer, which is for the baseline, we had 100 for sentiment and emotion. We had 128, 128 and the soft mass was, so for the baseline was whether it is sarcasm or not, that is why the soft mass is of size two. The sentiment was three classes, negative positive neutral, which is why it is three. And the last one was six classes, which is why it is six. And in this entire work, we did not fine tune the word embedding because we never had the enough data. I mean, whether to fine tune or not word embedding itself is a tricky decision. You have to factor in a lot of things. So the idea of without going into all the details, what is the simplest model that can be built without getting into all of these nuances? So, okay, so this was not multi-label. This was one of the six classes, okay? So now that you know each of the green, blue and the gray boxes, so all that we do is we take these three models, we remove the last layer, we have our text, we run the text through each of these models, we take the last but one layer, which is the feature vector, and then in the final, we have trained a bunch of linear models on top of this. So let us look at some of the numbers. I mean, please do not expect the kind of numbers that you typically see in computer vision. That is not gonna happen. So for the baseline, we got close to mid-70s. By adding the sentiment and the emotion, we actually got much better. We reached up much upper 80s. The baseline sentiment in emotion did slightly better, slightly worse, I'm sorry. I don't have a clean answer to as to why the third one didn't do so well, while we are explicitly feeding some feature which are relevant to sarcasm. This could be more as a nuance of the dataset itself because we also, since from a production point of view, you train a system, then you have some time to deploy and then you start to run. So there's a time gap between the test and the trained data, which is what we also replicated here. So the training data was pulled from a different timeline and a test data was pulled from a much different timeline. One of the problems that you often face with Twitter is because the language is ever evolving. So over a period of time, you'll start to see more number of tokens for which there are no embeddings because these then start to become what is called the out of vocabulary words. Now, the swap can be extended in many, many directions. You can always ask what if you train your own word embeddings? Are you likely to do better? If you have a lot of data, yes. Training word embeddings require a dataset of typically of Wikipedia level if you have to do well. The ideal way to solve this is you start to go towards character embeddings rather character engram embeddings because so how do you typically deal with out of vocabulary? If you have a embedding for every character, then no word is a new word for me because I can break any word into constituent characters. I already have a embedding for each one of them and from that, I can always build the embedding from far the world. If the only problem is characters themselves have no meaning, therefore character embeddings have no meaning. The normal notion that similar words tend to club together in the data space, which is the fundamental principle behind word embedding, tend to fail up there. The ideal way to solve that problem is what is called phonemes because phonemes are the smallest unit of text which have their meaning. But given a word, how do you break it into phonemes is the very hard problem in itself. So the work around that computer science people, computer science people are always good at finding work around, is that you break it into constituents of characters. So you look at three characters, four characters, five characters, six characters at a time. As a matter of fact, this is the fundamental idea which is behind fast text. If some of you might have used it, it is like one of the most cutting-edge classifiers that are available that can give you better than DL models and can train within under a minute. So the ideal way to solve this is you start to look into character engrams and then you start to build embedding from there. With that, how well do you do with RNNs? Because typically RNNs are the way to deal with text. Can you apply notions like attention? So if you can go to a level whereby you not only say that this is sarcastic, you also say why it is sarcastic. Which aspect of the sentence is making it sarcastic? That would be amazing. But typically these are much heavy-duty models that require much more massive amount of data. So that is one big challenge rather than, I would say as much as building models, building more robust comprehensive data set is a bigger challenge up here. And the next part, the last part would be, can you factor in the human part as the persona? And how do you do it without actually influencing the runtime dependency? One of the ways is that you can maintain some kind of priors that define how likely is this user to be. And these priors can update after every conversation. So while this updation can be done in offline manner, you just are using a couple of numbers at the runtime. Any questions? Yeah, I think I don't have them in the slide. Some of them that I can recall were, so as he said, like double locations and that kind of an answers never worked. They went out of the window. See, while that's a great question, but it is difficult to answer that. See, if you look inside the model and then say, okay, what is happening up there? Then you can answer that question in a better way, whether it is going for the syntactic or the semantic part of it. All of this is combined together in a latent space which is your feature. So you can't exactly pin down whether it is syntactic or semantic. There were no clean answers there, but we surely saw that the ones that had strong sentiment part to it, but people were disgruntled. It was able to do much, much better over there. So from a business objective point of view, which is what we wanted. So we wanted to capture a greater number of tweets where people are fed up and then end up being sarcastic and you actually take care of that. So that part worked out well for us. No, so no, we have not gone to that level where I can answer your question in a very clean manner whether it is syntactic or semantic. Any other questions? No, that was primarily based on the verticals and the brands that we deal with. So the data was primarily pulled from not specific to geography because once you start to do that, start to get into nuances of language. In India, people tend to use English a lot more rather than English. I mean, that is a beast, a beast of a different level altogether. So no, I mean, this was a much difficult problem. So we tried to address the most simplistic version to begin with. It's as simple as that. Yeah, but we actually wanted to avoid any such kind of a thing being fed explicitly. Did it capture that latently? You can't say, I mean, this is the latent space features you cannot interpret to this level. I mean, you can't even interpret to basic level as to whether this is a comma or like what word is it looking for and stuff like that. You can't answer those. I would say a single word is not sarcasm. That is being blunt. I mean, I have not seen many examples of single word sarcasm. See a lot of this, I mean, while your question is great, but a lot of this was given from a business viewpoint and from a production equivalent fit from of it. If you can give me some example of single word tweets, I might be able to give you some insight. I mean, the example that you said, but if not sarcasm, I mean, that has been blunt and being pissed off to a different level. So history and all we are anyways, we are clearly avoiding that. Yeah, given the history of how airlines have recently dealt with their customers, I think that will be disastrous. At least for the guy who is sitting at the machine and responding, he will be fired for sure. You are typically not, right? I mean, humor is a different side of it and that is played at a brand level. Typically, customer reps don't play it. That's a well thought out strategy that is played from more from a marketing PR side of it. And typically, you don't want to, as a system, why would you evaluate the response that is going from your own team? You want to understand the sentiment or the sarcasm if involved in the inbound data and not the outbound data. Yeah, but that's interesting. But I have not seen any examples of that. See, at the end of the day, when you use embedding, the fundamental assumption is that if between two sentences, you have a big overlap between the words and then you use a simple representation of sentence like some, then you are likely that both sentences are likely to be mapped to similar space, right? That is the underlying assumption and in that space, you actually want to learn a linear or a non-linear boundary with which you can do separation, which is why you typically don't use word-based sentiment rather than you use the whole sentence-based sentiment to precisely avoid those kind of pitfalls. You cannot do that unless you have access to the data sets. Couple of them, but they don't necessarily meet your business goals. So you always have to augment your data sets with your own needs. No, you'll have to be loud, there is a lot of noise. How do you, but I'm not sure if there is a basis for that. I mean, at least whatever I know in NLP, I do not know if there is a theory that says that such a thing should happen or not happen or if it happens, this is what it means. I mean, how do you know that is not a fluke? Maybe on that data set it is there. I mean, I have not seen any of the study which does this kind of a thing because typically your dimensions need not be same. While your word embeddings might be in a dimension X, your document embeddings might be in Y. No, but the embedding space, the space in which the final vector comes out, that may not be same because that's a hyper parameter. I don't know what, which is what I'm saying. Yeah, I think we can take it offline. Thank you so much for staying awake and not sleeping through it. Thanks.