 Welcome everybody to this session. My name is Dakshina Murthy and next 45 minutes or so I will be talking about understanding understanding language. So I changed the title from what is mentioned in the brochure just I you know this was a scooter and I felt it was deeper so you know but the idea is you know how do we how do machines understand language and how do we benefit from understanding how they are understanding the language. Post lunch talks are always tricky I will try to keep you all up and running for the next 45 minutes so I will do my best right. One slide about the organization that I represent I am from International School of Engineering. We are AI machine learning research and education institution. We are present in India, in Hyderabad, Bangalore and Mumbai. We have academic partners in France, Ren School of Business in USA, Case Western Reserve University and in Canada Carlton University. 60 plus faculty and scientists all are interested in AI ML working on that. Two products are already launched, two products are in the pipeline 75 plus patents, thousand publications so we are very actively involved in research. Faculty either research on technology areas like blockchain, big data, IOT on analytics or applications of analytics in healthcare, financial markets etc. So we teach and we do research so that is what we do. In that institute I am one of the investigators for AI for business group. So the theme of our research is we do not actually we are not that keen on inventing new models. What we acknowledge is that really smart people are creating really smart models and you know really fast. Every day there is a new model and most of these models are very academic in the sense that because it is developed in a university setting the accuracy, the performance is given a lot of importance but when business wants to apply really accuracy is only one of the indicators you have to look at its productionizability, its maintainability, the stability a bunch of things come into picture. So they need separate metrics and frameworks. Our group tries to create those metrics and frameworks and facilitate that process. So today while we look at a bunch of AI machine learning related models today I want to focus on natural language processing and why is that? I started seriously looking at NLP in 2010-2011 at that time the state of the art models were conditional random fields. You know super cool mathematics I mean log linear models, stochastic gradient descent and it allowed domain experts to input their features in a very cool and smart way sort of machine learning at its best you know you needed domain experts, you needed super tough math, you know ideal setting for a PhD to feel good about and I thought you know oh I can apply it I did a few use cases and I told myself you know I will live happily ever after. We then you know one of my clients called and said you know what two interns we hired for an undergrad institute built something called an LSTM that broke whatever you did for six months man they just gave more data no features nothing okay I was ego got crushed you know for a long time I used to call this Starbucks neural networks you know large short tall medium neural networks whatever you do you know so I mean I was upset obviously but then you know I got to live with that and I said okay I broke up with CRF started dating LSTMs said oh they're super cool so you know they can do all that and by 2013-2014 I said yeah LSTMs I think I can apply I know how to do this and then I saw Google doing something odd you know one day I saw a paper where they said my LSTM is one part of one half of it they said you know there is an LSTM then we will run another LSTM in the reverse direction and then we will stack eight of them one on top of each other and provide a residual connection you know what this is one half we call it encoder and then we have those eight layers on the other side we call it decoder and we'll put something in between we call it attention what is happening here okay I said okay what do I do and it took me some more time to figure out and then then I thought okay partly understood it last 12 months they said Bert they said excellent they said Ernie I mean where am I going here so I said wait wait wait I can't be spending my life fine-tuning hyper parameters of these models I have to understand something more about how these guys do these cool things okay so whenever you want to understand something much deeper it probably benefits to ask very simple questions so I asked myself how do we humans understand language okay is there anything that I can correlate it to the way these models understand the language so tried observing a few kids actually and started taking notes first thing when you're learning a language you learn a few words and you start guessing the meaning of the sentence with that one word or two words that you know okay so that's you know if you have what to pick a completely new language probably that's what you do and once we get there once we understand we start forming some naive rules you know if there is a dollar most likely it's currency if there are 10 numbers probably it's a mobile phone number so we sort of make up these simple rules and probably after some time we build fairly complex parsing we split it into multiple things combine and start understanding the language and then once we become comfortable with the language we start guessing trends we we build beautiful auto regression so the clouds are in the dash you and I probably will tell skies and someone in the IT department will say server room whatever that is okay there somewhere we start filling in the gaps learning these auto regressions trends that's another very important and lastly we seem to be becoming good with what we learn we don't forget for the next task what we learned in the previous task we sort of building so these are the paradigms are dimensions in which humans learn languages and what I found interesting was that whether thinking about it explicitly or not the machine learning community actually went about in more or less the same way to solve the natural language problem and let's see how first thing learning words the naivest thing was the one-heart vector when people in 1950s and 60s the machine learning community might have influenced them you know whenever you'd see a categorical attribute just dummy fight right so whenever you see a word depending on the vocabulary size if your vocabulary is ten thousand words you create a ten thousand dimension vector and put your words in alphabetical order and the first word is one zero zero zero second one is zero one zero zero zero and so on and so forth so a vector of V dimension where V is the dimension of the vocabulary where V minus one zeros and just one hot guy so there is only one one hanging around there so that's one hot but the problem is book library cricket any topic one zero zero zero one zero zero zero one they're all similar to each other equidistant because they're all one they're all just one one and several zeros right Euclidean distance is always root two so they didn't have even the basic semantics needed to understand the language so while this is a representation we invented I don't think it's relevant for any major problem right to solve it I'm going to plot models in a specific graph let me spend a minute on the strategy that I adopted I should work with some visualization experts to come up with something better but this is what I have currently the x-axis I'm going to just write the models in chronological order to the extent possible I'm not a AI historian so if I miss a year or two don't worry but in general the chronological order y-axis is going to be the models ability to understand the words so one hot vector doesn't understand word meanings it's a very dumb representation right so the it's at zero so there's no understanding of the words now the color is how well it does the trend prediction so how well it looks at past words and guesses future words there is no color here so it doesn't do anything the shape number of sides in the thing I want to represent by how well it parses and applies the rules okay this is again a circle with no shape so there's no parsing no rules and overall size is the performance of the model I could have drawn smaller thing but I was worried whether it would become visible or not but this is almost a point so this this is sort of of nowhere you know so I'm going to as we discuss more and more models I'll fill this graph right okay that's where one hot vectors are then in 1960s a bunch of machine learning guys made some really smart hypothesis about understanding words they said hey those words that occur in few documents but when they occur that occur in large frequency are really useful words because if they are all over the place they must be the or than space filler but if they are in a document but only once maybe it's just a fluke but in a few documents but when they occur if they occur a lot of times and then it must be a really useful word so they created what you probably know definitely know are these term document matrices where they took a lot of documents and put them in columns lot of terms and put them in rows and at base level you binary if the term is present in the document one L zero or if you want to scale it by the number of times it occurred across the documents and within the document you can come up with some nice smart scaling factors and they will become some numbers like this okay now this vector of the word understanding is a lot better than the one hot vector because most likely similar terms will appear in similar documents so when the vectors are similar it means that they occurred in similar documents so most likely they are very similar to each other so got better than one hot vector but the problem with this model is this is still extremely it's not as parsed as one hot but very sparse right there are only a few numbers more importantly this is much larger dimensions I mean if you think of a Google this is a billion documents so your one hot vector is probably size of the vocabulary this is size of the corpus so this is scary so what people did is hey you know what this number here is a combination of two things it's there because the document wants to express some concepts and the term is capable of expressing those concepts only when there is a match between the concepts that the document is wanting to express and the concepts that the term is willing to is capable of expressing you see that term in the document so they believe this matrix is because of two sub matrices one is a term and its concepts another one is a document and the concepts that it wants to express so they created a linear algebra guys went ahead and did some really cool things figured out a way to split this matrix into these two matrices term concept and concept document and what is the advantage this can be a very small number this can be 100,000 by million but this can be 100,000 by 300 or 200 this can be 200 by million so your matrix overall will form again so the term concept vectors are much shorter number one and much more denser so that itself is an advantage but really what clinched the deal here is the term documents are very document specific so if there is a document on data science and it doesn't have the word AI I will get a zero there but term concept on the other hand when you factorize it it figures out that you know AI and data science occurred in many other documents together so probably they are similar concepts so even though at this point in this document AI is not there most likely they are relevant so it figures that out so a long range interactions short dimension and sparse sort of clinched so they are very very popular I think in 1990s these were quite popular now okay I understood the words well that's only one part right what else can I do can I parse can I build trends can I build rules these models have inherent problem with building trends simply because I dropped many words that I thought are unnecessary so previous two words concept is not there I lost the order of the words I put them in alphabetical order I am gone for good so the only thing that they could try was building rules and regular expressions was one way where a human was giving those heuristics I don't even think we can call it machine learning right you give rules you are it's not machine learning so the more popular method is you take these term concept or term document matrices they are structured data so you apply your logistic regression or support reclamation to build the rules on top of that so this is the second generation can we take questions after the talk I know tf this is SVD kind of matrix factorization term concept right term tfidf is the original one term b o double bag of words thing okay so so do ML on it you find these simple rules so in my graph now these guys still don't have any trends and hence I haven't colored them a few rules that they identified through ML so I made the shape a bit more complicated so the number of sites is a sort of you know indication of whether what they're doing so they are better than one heart in terms of word understanding and they are better than one heart in terms of the fact that they're combining word understanding with rules still they are not very much there with respect to rules that's okay so in word understanding as you guys probably know the ultimate happened in 2013 right word embedding sking minus man plus woman is queen all that you know Google actually is very good at marketing some of these things I mean there was enough work even before that but somehow Google just became so prominent with it but by the way last year someone really studied this and they found that word to work did something crazy I mean when you ask if father is a doctor what is mother it said nurse that's a different discussion all together but you know what yeah it it became sexist without realizing but anyway let's not get into that but glove and word to work sort of opened a new era glove went with these co-occurrence mattresses they took a vocabulary by vocabulary matrix and they saw every word what precedes it what succeeds it what is two words away what is two words this way and came up with probability distributions of the words and then factorized it that was glove what to work on the other hand is a pure brute force neural network you give one heart vector here you predict the next word and somewhere in the middle neural net keeps creating better and better representations right so you pick the representation that you want so what to work is extremely popular but what to work and now have this problem in the era of Twitter even if I start with a million words someone else invents a new word and when there is a new word my entire word to work I have to randomly initialize again so I have to retrain the entire model so many are many models came up to solve that problem Facebook fast text etc my favorite is that space is bloom is quite good essentially what they do is they embed a portion of the word so Facebook's sorry bloom space is bloom takes the prefix embeds it suffix embeds it and word embeds it if it's not there it will give a random thing and it also creates categories so if there is a capital letter initially it calls them in it cap so all doc words that are started with the initial capitalization or they go under that group all digits for digits all caps like that they created some 10,000 groups into which the entire vocabulary so even if there is a spelling mistake it forms one of those things and they embedded the groups also so the embedding of a word is embedding of prefix suffix the word and the group and so and they add them up so it's it handles the out of vocabulary and spelling mistakes a lot more elegantly but bloom is only one example fast text does that there are many other embeddings that do that but all these embeddings are context free which means if I have a sentence like this he went to a prison cell with a cell phone to extract the DNA cell samples of the inside inmates all cells will have same embedding right because yeah you get one embedding for one word so that's the problem so these are contextless but still having said that this is like a new era and the people were super kicked about it so in terms of understanding words and doing ML on top of that and people did some really simple things and that work beautifully like you know I saw sentiment analysis paper where they took word vectors of each word they added all of them called that sentence word vector and attach the sentiment and build an SVM on top of that and it gives state of the art results I mean so really understanding words can take you a long way and this is probably the ultimate of understanding words now parallely there is another community that was focusing on understanding language only through regression I want to fill in the blanks given the past two words how will I guess the next word so the approach there is you know you compute it look at large car percent say what is the probability that I see this word what is the probability that I see this word given I saw this word what is the probability that I see this word given I saw these two words and so learn all this unigram by gram trigram as they call them probabilities and then use base theorem to predict what is the next word for some time Google you feel lucky was using something like that okay probably now they use neural nets I really don't know what technology they use but this was a workhorse this Markovian model was the workhorse of of regression based language models from 1980s to 2000 also okay and the culmination of that is these here and Markov models which are really smart it's so sad that they lose to neural nets actually and this is like so nicely mathematically developed you know they use base theorem so the a word and its part of speech what is the giant property they really they use base theorem and then they use Markov model on the so I see the words but I don't see the parts of speech so the Markov model is now on the part of speech what is the probability that it's a noun given that the previous one is a verb and the given that the previous one is a determiner so the Markov model is not built on the word here but on the hidden state of the word so there is a base theorem to do giant distribution there's a Markov model now one word let's say there are three possible states it could be a noun verb or determiner with the very close probabilities now I have a sentence with ten words and ten possible parts of each one is the top three guys if I put that is a three to the power of ten combinations that's humongous so they used like really smart dynamic programming kind of counting to actually solve that problem so hidden Markov models are mathematically extremely elegant but anyway there I can get romantic about it but they are they're dead okay well more or less right so but that was very very popular 1980s to 2000 this is a parallel track to NLP where you are only looking at regression to solve the NLP problem previously all the models were they were only looking at meaning of the word so I sort of cheated squeezed it in now again I'm not a historian so this came first or this came first I really don't know I just had to put it somewhere so around this time they came first so I there's no word meaning so I brought it down and there are no rules other than the regression so there is no shape or anything there's no and then but I colored it because trending is beginning to be learned trend color we decided to indicate with sorry trend we decided to indicate with color so this is like sort of it tells you there is a regression in it but no word understanding and no rules in it good 2002 2003 and then came this conditional random fields did any of you study it anyway oh my goodness this was ultimate in machine learning at some point okay but anyway don't worry about it they lost the war okay but conditional random fields are beautiful because they are the first attempt to combine trending with rules two of the four are connected the machine learning model and term document is meaning with rules right this is trending with rules so it had all the things that hidden Marco model could do so what is this probability of this given the previous one is this given the previous one is this it had all that but it also had a very elegant way for a domain expert to add whatever features he or she thought was important so let's say statistical model I want to predict the path of speech of model I can add a feature like his model capitalized is the prefix of the last word statistical ICAL if it's yes or no and these are normally called indicator function in their language but you create these hundreds of features I read somewhere that the state of the POS at that time had hundred thousand features hand coded I mean you need a lot of grad students to do that yeah right hundred thousand features hand coded and they you know that they built it and then it's elegant machine learning right you have here hundred thousand features X1 X2 here are not words they are the features and you learn from data W1 W2 W3 classic machine learning and it can be a negative or a positive number so you do a exponentiate it and softmax it so that's why they're log linear model so if this whole sum is minus 100 also I don't carry to the power of minus and it still is a positive number and when I softmax it I get a probability so conditional random fields had this beautiful trending and an elegant framework to fit the rules and super mechanism of gradient descent to solve the problem really they were state of the art so somewhere around this the they're definitely a lot of trending in it as good as hidden Marko models and the rules are many many more hidden Marko only looks at trend whereas here I can add my rules so I made the shape more complicated still word meanings are not added so they don't have any bag of word or anything it's just words are just treated as entities are categorical right so but performance wise these are anywhere comparable to this so they are they're fine then came the era of for deep learning I said deep learning 1.4 4 4 is the nowadays that's the fact I think you see industry 4.0 web 2.0 so I just thought I would also add a few but this is like 7 years back deep learning right when you started deep learning the first efforts in deep learning was about combining meaning and trending so I have the bag Google word to work or something they go in as inputs now I build it on top of a LSTM which can look at much longer trending than HMM I'm not just looking at two words back but I can look at much longer sequences that's why I called it auto regression 4.0 much larger than HMM so you give you combined meanings and you combine trend hidden Marko model was only doing trend this is meaning come trend and trend is much larger and then they said hey you know what you combine trend from this direction and then you combine trend from this direction and you concatenate them you get trend from both directions so that is bi-directional LSTM even today most people think of this as the most fundamental you need to learn regression based patterns in text right bi-directional LSTM and the meaning is fed from here and then there is a bi-directional LSTM but there are quite a few instances and personally I also experienced even convolution neural nets do a very good job of getting the context LSTM the problem is they go too long right do is it good to actually have such long trending do you want to predict tomorrow's Google price by looking at 15 years back Google price no you may not want to right so convolution neural nets on the other hand are much more closer in terms of looking at the thing so those three W1 W2 W3 formed some embedding here the W2 W2 3 W4 form this W3 W4 W5 form this now these three form this in a typical convolution neural net local connection but if you think now about dub this thing is not only an embedding for word 3 but also includes the context information of 1 2 on that side 4 5 on this side so I am getting a short term bi-directional a context inputs into my word embedding so actually CNN's work very well in practice in embedding by the way people call this encoding when embedding normally is when you just get that word to a kind of a meaning encoding is when you add context to it so do a convolution or do an LSTM to actually give it a shade of the past or future words okay so both are fairly popular though I would say this is probably more known for some reason some of the commercial products actually only use this but they don't publicize that much I don't know why anyway then of course this ELMO is another very popular thing which encoded where it started with one hot encoding of characters and used a convolution neural net to create a word level embedding and then put those words in LSTM to get that encoding which is context rich so a CNN gives a word embedding and the they give context LSTM gives a context embedding that is very very powerful so I mean I am adding meaning and trend in in these models but is very interesting they don't deal with and and entire word they will deal with only a small fraction of the word so even if your vocabulary is 10 million they are able to divide it into 30,000 fractions you can combine them in multiple ways to create as many words as you want so that's also a very powerful representation so these are state of the art so in this phase of machine learning deep learning we are combining meaning and trend right and we almost forgot that last thing that we use right which is we become good with time until now there was no way for us to incorporate that so in language this whole transfer learning was a very late entrant nobody was transfer learning but in the past two years this became extremely popular people use two ways you know how do I transfer learn one is you know train on entire Wikipedia language model language model is a very funky way of saying you know give him the password predict the next word so I give I don't have to tag it it's unsupervised right if I take entire Wikipedia I write it as previous word next word and I try to predict and learn all the weights based on English language then you take your problem and tweak only the weights a bit or there is another technique that people are doing build a neural net keep most of it common only last layers different and learn multiple tasks okay classification language model translation whatever sentiment analysis learn multiple things so most weights get tweaked and tweaked and tweaked because they're learning multiple tasks on a some same data but they have to learn last layers are or what task specific medium be initial layers are all language specific so people did pre-training so bidirectional so trending is covered word embedding so meaning is covered pre-training training is covered so three out of four and we had really powerful models in the horizon almost done if only we can add rules to it we have to figure out a way to add rules to it and we didn't know how to and manually you can but deep neural nets are not very friendly to manual inputs so ashio bingio came up with attention rules rigged and parsing perfected I mean he this is probably he didn't even realize when he proposed it that he was actually building a parser he was building it for at least the paper talks about doing something else that's also a very tough problem but attention essentially is a mechanism where see until now with my word embedding and context thing each word is represented as a vector so in a sentence let's say I have four or five words I'll have four rows one row for each word and I have these vectors now attention essentially is a way to learn for a given task how important are these words so I learned a wait for each of these words h here is h1 h2 h3 right capital h here so essentially I may say hey you know what for this particular at this particular moment this word and this word are very important I can ignore all others now this and this are very important now this and you use a neural net to learn it obviously you don't write those rules right so attention is a very powerful way to learn every vectors importance in a given task and that involves essentially taking elements for this vector that vector combining exactly that parsing and rules that I talked about so the CRF rules are human generated so human reads the text understands the patterns and writes the rules whereas attention rules are much more deepish neural net learns those rules because neural net already understands the embeddings that are highly context rich so attention sort of solved the game and this person he's a founder of a spacey wrote a very influential blog embed encode attend predict this is the blueprint of NLP it actually fits very well with the framework that I talked about until now embed is about understanding words encode is about understanding trends attend is about understanding rules what is missing what is the flaw in this approach if you can think about okay only one thing that I said it not part it's not learning transfer no transfer learning in the model so if only I have to update it I will say pre-train because we said four things are needed for perfect language model right this is 2016 and okay we are three years ahead so we can add this right embed encode attend and predict it's a very powerful model I think pre-train will really improve the performance even further how am I doing with time 10 minutes okay I'm almost done so in my graph now I have embed encode attend predict pre-train at the very end so it is the perfect in word meaning it it's beautiful and it's thick black pitch dark so it really understands trends extremely well and I wanted to draw something more complicated shape but star I thought symbolically it shows it's super cool I so that's why I left it at star but it can actually do really complex rules and patterns and pre-train so it learns and it gets better with each thing if you have time look at an architecture in 2016-17 called hierarchical neural net which is actually built on this model so they take character words and build a sentence level vector through embed encode and predict and then they take sentences to create a paragraph level vector in through encode embed encode predict and then they do a document classification task so very systematically you just write this model so I personally believe for many business applications this probably could be the you know could be good enough I mean yeah this has everything in it so then why but what does bird do differently what does excel net do differently what are they doing you know I'm also surprised so yeah this is what where it stands so it didn't cheat it played well but then when Bert came Google guys more or less said yeah you know this is great we'll do a bit of a training but you know what we will brute force parsing and rules so they created a amazingly complicated shape in my today's this thing you know I mean you each time you do six different rule generators acting independently they called it multi-head attention so each one builds its own set of rules they combine all of them then send it to another six things then combine all of them like that 12 layers oh my goodness that is rules on rules by rules with rules and rules all over the rules you know so obviously you know this guy got even you know you have a good batting team good bowling team good fielder team and you do well you think you're doing well and then you bring Virat Kohli then you lose right you know okay that's that's fine man I mean but still you know this is the way to build a team right so I am still very much with this and then but you know so Joe Butler was brought by Carnegie Mellon they said yeah you know what you attend with brute force I will regress with brute force so this is all colors you know I couldn't put one color because they not only took two words back towards forward they combined all calculated all kinds of permutations and combinations the clouds are in the dash is what the clouds are in cloud or the in the cloud are every possible combination of past to make a prediction and concatenated all of that similarly they brute force Ernie is a brute force on the pre-training I train on many many more tasks than anyone ever did before so yeah I mean so what will be the trend my guess is there'll be some researchers that are going to come up with insane embeddings I mean I will take prefix suffix part of speech named entity or who invented that word when was it written first time in Twitter I can embed every possible type of that and I'll come up with a much better embedding I'm sure there'll be some research it'll be fun to read and fun to test similarly there'll be crazy regressions now you know why only words within a sentence I come up with pattern I'll figure out a way to embed a sentence before embedded sentence after and then come up with all kinds of permutations and combinations at a sentence level to predict I'm sure there are going to be papers on that and then maddening attentions right you know six at multi-head why not now I'll do Wikipedia and whatever the YouTube transcription put together and do probably 15 attention heads and then run it for 25 layers whatever I'm sure and pre-training you know you come up with very smart ways to pre-train the models further use GANs to create more data and pre-train on that I'm sure and then once in a while someone comes up with combining all of these two put together a model hopefully and so the whole idea here is at least I was quite confused with how to understand why these things are working but this framework really helped me to put things and how models in place okay you know what I don't really have to worry about it this is where I focus so machines like humans only need those four things they have to understand the words better they have to build the ability to be predict the trends and they have to create some rules and they have to be pre-trained once you understand each model in these four dimensions you figure out a way to see what you really need for this application more importantly figure out a way when to stop given the computing power given the data huge data nowadays it's difficult to have a data scientist stop and move on to the next problem that extra point not one percent is so tempting so you do that but you know figure out you know maybe you're overdoing on one front end you can just stop and move on to simpler things right anyway that's what I wanted to talk about yeah please go ahead thanks it was nice talk I have one query so you just mentioned before but it was EAP plus pre-trained so if any of the business application where you don't know what new terminology in is going to come in your communication with the customer it's always a new terminology new word right so for training existing corpus let's say we can take it from tree bank or some other thing or which is already available in analytic or something but that is not going to suffice our purpose so considering that situation where you every day your for your every transaction every communication your customer is using new terminology so how this model can work with the new terminology so how to train that yeah no as I mentioned the word understanding even this EAP uses bloom embeddings not the word to work which can handle out of vocabulary that's your question new terminology they don't look at word as word only they look at the prefix prefixes are part of the library I can't run out of it's very difficult to create a new word with completely new prefix completely new suffix completely new word and that doesn't fit in any of the word categories that I have given so people have figured out smart ways to come up with a meaningful embedding for the newer words it's just that word to a can go over the oldest versions but later ones can handle but also has a way to handle so if you just want to take birds word embeddings alone and put it in your EAP if you think that's simple enough model I was talking about that situation where you know same word play multiple POS tagging like noun plus determinator plus because you have LSTM to learn the trending and context this handles that not only word if it's used in a different context it learns the context driven embedding encoding so this works even in that situation thank you by the way I saw few people taking photographs they have collected my PPT I think they will give it they will upload it right yeah it is there so you don't really have to any other questions so I am the one they had they can't upload so just ask me the questions document yeah yeah I wanted to ask you so how would you approach a problem where we have to do document classification where documents have a lot of words maybe on the scale of 10,000 so many sentences how will you approach this kind of problem you know what these guys Jeremy of fast.ai actually did it he first pre-trained because using those document vocabulary only a document text he pre-trained the model and then he took a small fraction of it and classified he got state of the art so I was asking about like I have to classify multiple novels into their genres and novel can have multiple sentences like 10,000 or 20,000 sentence and I don't think the current neural networks will be able to handle this thing right just check out hierarchical neural networks so what they do they start with each word in a sentence and embed that sentence now they take these 10,000 so whatever 10,000 sentences and embed each of those sentences and get one final vector and they use that for classification so the attention is about consolidating all these rules to collapse that entire thing into one vector that respects all the required rules in one go so check out hierarchical neural networks they actually solve this problem otherwise what people normally do take the first few paragraph or first two paragraphs and just solve with that but if you really want to do it the right way and but if you think about it a human also doesn't have to read the entire book to classify it otherwise librarian is the wisest man on the right he has to read the entire library before he classifies any book right you don't do that so you take the first page you take the title you take a few things to make the right guess so that's also not a bad approach but if you want to do the entire thing hierarchical is a good way yes one last question we can take okay thank you very much