 Good morning everyone. We had a nice discussion on math and Kalman filter. So we are now moving to language and natural language processing. To make the transition easier, let's have a short interactive quiz first. In five years, how many people will interact with an NLP application daily? Take a guess. Come on. Sorry. Give your guess of millions. That's pretty close. It's 7.5 billions because almost everyone will be interacting with an NLP application daily in five years. Here's the next one. What do you estimate the size of the NLP market in next five years? Again, billions, but take your number. Come on. Any other guesses? 3.14. Okay. You're behind pie, I think. Okay. We want a bigger guess. Come on. We're all NLP people. Great. Very close. 116. Okay. Last and final question. What is the percentage of AI and NLP projects that fail in moving from idea to production? You guys have to get, everyone has to get these right folks. Okay. Passing mistake? Okay. It's 87 percent, so you're quite close. You might be wondering what was the reason for this? The reason is this. NLP, natural language processing and AI has enormous potential. There's no doubt about that, but there are a lot of challenges. You saw, right? There are 87 percentage of the projects don't make it from idea to production. I personally cannot think of a greater or a more exciting time to be working on NLP or AI than right now. Right? We all know that. I mean, it's like the information age and the new industrial revolution, but there are a lot of challenges also involved. One of the challenges is how do we make NLP projects move from idea to production? How do we take it from research to real world? This is the topic we are going to be discussing over the next 30 minutes. So stay with me on this. And one small request, I recently sprayed my knee. So I'd be sitting in between and giving this talk. So please bear with me. Okay. Okay. And this is what I want us to remember. We have a responsibility to deliver robust and responsible AI and NLP. And it's our mandate. And to give a quick intro, I'm Sandhya. I have been in this industry for around 1920 years or work previously in HP, Microsoft, IBM and Zerogs. And recently I'm an independent researcher and a consultant. So the agenda, as I said, the main topic would be we want to understand how robust are the current state of the art NLP models. We will look at a few examples, a few use cases, and then we will talk about how we can make them more robust. And just to get a sense of our audience, how many of you have worked on NLP working in the NLP or have an interest? Oh, great. We have a very big chunk. So it's going to be quite exciting. Okay. I don't have to say this. NLP applications are everywhere, right? I'm in stock market trading, detecting cancer, right from verb search. Last month, I think we had this news where Bert was powering Google search results, right? But is that sufficient? Here's another question. I have four headlines here. These are news headlines in the recent past. And I want you folks to guess what's the commonality, other than the fact that they're all related to NLP. Come on. Gusses, please. The first one is an automatic essay grading tool. Second, yeah, I have a Guss here. Who is that? Please. Okay. Can you elaborate a little bit more? Okay. Good. Okay. Actually, sorry. Okay. Let's give somebody else a chance as well here. Great. Exactly. I think like both of them got the answer. All of these are NLP models that failed in production. So the first one was an automatic essay grading tool which was deployed in US. Or the second one was from Amazon, a AI based recruiting tool which was discriminating female applicant return resumes versus mail. Third one is a bit older one. Microsoft had this Twitter chart, what called it pay? And it went and said Hitler is great. I mean, it started spouting offensive language. But the fourth one, I think somebody got that right here. Whenever actress Anne Hathaway was in news or her movie was starred, Berkshire Hathaway stocks used to go up because they got the name disambiguation incorrect, right? All of these are NLP models that failed in production. But remember, these are not toy models. These were built by engineering teams with heavy R&D. We are talking about the likes of Microsoft and Amazon here, right? I mean, you can consider the example. So the challenge is not trivial. We know how to build NLP models, but we still don't know how to make sure they are robust, how to evaluate them and how to take them to production, right? So let's take a step back and see how do we really build NLP model? The current recipe seems something like this. Take a representative data set, split it into your train test and validate. Use the latest Bert, Robert, Robert, I mean, for those of you who might not be familiar with NLP, these are very popular pre-train model. I'm not talking about any CCM street characters here. I mean, these are names of NLP models. So people choose these pre-train models or you end up building your own fancy deep-net architecture, I mean, make your choice and then you prove that the model can achieve, say, 90% on my test data set. And so we happily declare, oh, my model is ready to go into production, right? But is that sufficient? So this is just an iterate, I mean, a picturization of what we talk about here. We take the model, right? We build it. We test it on a test data set. We show that, like, it achieves good performance and we deploy it. But what happens is that the model fails in production. So we go through this iterative cycle. We deploy it, fails, fix and redeploy it, right? And there was a recent Gartner survey which said that two-thirds of the NLP model fail after they have gone live. And this is something we definitely want to avoid, right? So the question we should be asking is, why do these NLP models fail? After all, we got 90% and above on our test data set. So why do they fail? The answer, I think all of us know, real-world data is lot more diverse, lot more different and lot more messier compared to the nice private data sets we use in our labs and organizations to train our models, right? Take a very simple example, sentiment analysis, where we try to predict what is the sentiment associated with the sentence, right? If you go and look up any papers, they'll say, we can achieve 95% accuracy on this. Try any real-world utterance, most of them would fail. For instance, I just want to say I like this movie and I have so many different ways of saying it, right? I mean, I've just listed some of those things like on the left-hand side there. For instance, take the last example, Arne killed it. If I put that, I don't think like any of the models including like, I mean, Google Cloud NLP or even, let's say like Stanford Core NLP are going to get this right. The answer is the inherent complexity associated with language. A single word can contain or mean different things. Similarly, 100 different ways you can say the same thing, right? So we do know that real-world data is lot difficult, lot messy. So first step is to figure out how do we evaluate our models, right? We said that we need to have an objective way of evaluating our models. So how do we really do that? As in other fields of computer science, one is to use benchmarks. In NLP, there were two, I mean, benchmarks. One is called glue. Let me just give you a quick history of it. Glue was introduced in 2017 and when it came, even the deep learning based neural network models were only getting around 68, 69%. Then last year, BERT came and it got around 80%. And this year, a variant of BERT beat human performance on this benchmark. So the NLP research community went back and they recreated super glue. But again, another BERT variant beat it like quiet recently. So the question is, these models do very well on the benchmark. But why do they don't do so well on real-world data? See, this is an inherent problem with today's learning paradigm, both in machine learning, NLP or machine, I mean deep learning, right? The models are based on associative learning. It's correlation and not causation, right? So what we really do, we don't tell the model how to solve a task. What we are doing, we give it a data set. And the model is learning the patterns which are present in this data set. If the patterns are representative of the actual task, right? Then the model picks up the right patterns so it learns well. If the patterns are not representative, then the model obviously picks up like the wrong patterns. It's very similar to the maximum you would have heard if you throw in good input, you get a good prediction. If you give it garbage, you are going to get garbage, right? So the question is, we know NLP models work great on this benchmark. So the question is, how much of that will translate to real-world performance? That is a question we want to take and examine with a few examples. So let's pick up something very simple, sentiment analysis. And this is one of my favorite example. The sentence is, a great white shark bit me on the beach. Positive or negative? Negative, right? I mean, every human being gets it at the first shot. I mean, I was bitten by a shark, why would I consider it positive, right? Predicted as highly positive, not just by any toy model, Google Cloud NLP, Microsoft Cognitive Textual Analytics Service, Stanford Deep Learning based sentiment model, right? Why do the models get misled? The models get misled because of the surface pattern great white. In the training data they have seen, these words were always associated with a positive sentiment. So the model does not have anything better to learn. The data we gave it always had a positive sentiment associated with this word. Because of these surface patterns or the surface cues, the models get misled. Maybe sentiment analysis was just one example. Let's take another example, which is machine reading comprehension. This is a problem where you would be given a passage I think everybody understands reading comprehension. You are given a passage, you are given a question. The answer is in the passage, just extract the answer, right? And NLP models work brilliantly on this task. In fact, last year there were headlines all over the world. Alibaba's model beats humans on machine reading comprehension. Missions can beat humans in reading and so on, right? So we have one question, which is straightforward and the answer is picked up from the passage itself. Now let's try to evaluate whether the model really understood the question. So we want to confuse the model or we want to test the model. The way we want to do is, we want to add one more sentence, which is irrelevant to the whole thing, which should not change the answer. And we want to make sure that the mission could actually answer the right question, right? So we added one more sentence, which is not related to the question directly that talks about Huguenot colonists and the other one is the Akkadian colonist, right? Now the mission says a different answer. Why does it say this? The problem was the mission or the model was learning a shortcut or a heuristic. What was the shortcut? It looks at the question and it finds the sentence which is most similar to the question and it figures out like right now the new sentence which was added was most similar to the question. So it goes and picks up the answer from that question, right? So this is another example where it is picking up a superficial pattern that is present in the data to make a mistaken decision. Okay, let's look at something which is more complicated but which is very important in terms of understanding meaning. This is called as natural language inference. Basically the idea is about reasoning. So you would be given two sentences. The first is a premise and second is called a hypothesis. So the idea is the task is to find out whether you can infer the hypothesis from the premise, right? So the first let me just give you an example. A man is standing in front of the statue on a beach and the hypothesis is a man is sleeping on the beach. Straight forward it's a contradiction because he's either standing or sleeping. I mean though somebody could say I could be sleeping while standing but let's make it simple. So this is a contradiction. And the second one is a man is standing on the roof. There is a man on roof. It's an entailment because the hypothesis can follow directly from the premise. And when you cannot say anything between the two, a man is standing on the roof and he has a hammer on the hand, the label will become neutral. The reason for using this task is it sort of like tries to figure out whether missions can do some amount of human kind of reason, right? So let me just go back, right? So on the here you can see the performance is around 80% for different state-of-the-art NLP models. So let us again test how these models are robust. Let me take this example. The premise is the judge was paid by the actor and the hypothesis was the actor was paid by the judge, right? And what is the right answer? Thus, take the example. The premise was the judge was paid by the actor. The hypothesis was the actor was paid by the judge. Is the hypothesis implied from the premise? No, right? Look at what the model says. The model says entails, which means yes, it follows, right? We can look at the other example as well, even for the second one, right? The or let's take something very simple. The fourth one, it was still at night. That is the premise and the hypothesis says the fun has not risen yet and the moon was still shining, right? So this is actually is an entailment whereas the model was predicting a contradiction. The reason was, I mean again it goes back to what was the kind of data on which the models were trained. Whenever the model sees the word not, right? It thinks it's a contradiction because that's the kind of examples it has been trained with. Similarly for the first one, when there is a very high similarity between the premise and the hypothesis, the model thinks that it is always inferred, right? It's always an entailment because that's the kind of examples the model has been trained on. So that is the challenge here, okay? We can talk about one more example, which is argument reasoning comprehension. Here also the same problem occurs, right? This is a much more complicated task, so I probably wouldn't go into this. So the summary is this. State of the art NLP models do great, okay? On training data or test data on benchmarks, but much of that performance does not translate to real world, right? That is because they are picking up surface cues or superficial patterns that may be present in the dataset, which is not representative of the task. So this is what we call as a clever Hans effect. Does anyone know what is the story of a clever Hans? Please, great. Exactly. He got that, right? Let me just say that for the audience. Clever Hans was actually a horse, I think like in the early 20th century. He was supposed to answer arithmetic questions currently by tapping the right amount of number of times. If you ask him what is the answer between plus as adding five plus two, he will tap seven times so that like he can answer such questions and he became a very famous horse like and everybody called it like the clever Hans, but later they found that he was picking up non-verbal cues or signals from his trainer who was sitting next to him. So the trainer used to stalk or get tensed when he reaches the right answer. I mean it was not intentional. The horse was picking up this cue from the trainer and he was able to stop right at the correct answer, right? This is exactly what is happening today with many of the NLP models. They are picking up the superficial or the surface cues present in the dataset and using that to gain their performance. And that's the reason why that performance does not translate well into real-world deployment. And this is exactly what we need to avoid. So next time when you get a model and you run it on your dataset and you see that I'm getting 85 percent, 89 percent, first question it. So be skeptical. Okay, so far we talked about like there are these issues in making our models not being robust for real world. So now let's move ahead and see how we can improve the robustness. Before we understand whether the model can be robust, we need to understand why they fail, right? I mean you need to understand the root cause. So one way of understanding what your model is doing is to use interpretability, right? Interpretability is a big area both in NLP, machine learning and deal. I think we have a lot of other sessions for this, but for the purpose of this talk, the idea is can we use interpretability tools to see why my model makes the decision? This is a question we see when a model makes a prediction. Why did my model make this prediction? I want to know that, right? So there are a number of interpretability tools that are available. One is called LIME. LIME is a very popular library, very easy to use and the second one is what is called the interpret from Allen NLP. This Allen NLP's interpret is a very recent tool came in around two months back. It puts together different techniques, different ways of doing interpretability of models and you can use it. So let's go back. I'm just going to give you a look, a peek into Allen NLP's interpret. We had this sentence which was mispredicted by many of the models. Remember the great white shark bit me on the beach. So you can feed that to Allen NLP and Allen NLP can interpret and give you an explanation why the prediction was being made. So it shows that the most important word for the prediction was the word great. So the model, you can see that colored in red. The other one, I just want to give a quick example of the textual entailment sentence. The judge was paid by the actor and the actor was paid by the judge. I mean, they just swap the object and the subject. So again, you can feed this to Allen NLP and you would see that as we said the model mispredicted it very confidently said with an 81% confident the two are equivalent. But if you look at the interpretation, you would see that it is looking at the same words between the premise and hypothesis as highly important. It is the degree of overlap. If these two sentences are very similar, the model thinks that the two sentences infer each other. Basically, the hypothesis can be inferred from the premise. The idea of using these kind of tools is to understand, run these on your examples. Try to see that whether they are actually picking up surface using your data set or whether they are actually like finding the relations. Another way of actually, I mean, sorry, understanding the model decision is what is called attention. I think are people familiar with attention mechanism in NLP? Okay, attention is a very simple way of saying that when you make a decision in a classifier, which words or units the model paid most importance to. And there's a lot more complicated things beyond attention behind attention. But that's a very straightforward example. Allen NLP allows you to visualize the attention. So again, you can see that like the common words get higher attention. And that is the reason for this. Okay, the long and short of it like of what I've been saying so far is that the data is very important. And you need to understand what is in your data because that is what your model is going to learn. And it's not just me, Andrew Karpathy, who leads Tesla AI, he said it very nicely. The neural net is nothing but an compressed version of your data. Essentially, your classifier has learned its world knowledge only from the data you gave it to it. If you give it the right data with the patterns which are representative of the real world, yes, it would learn well. But if you give it superficial patterns, it will not be able to learn well. Okay, there are a few other techniques you can use to identify misleading cues in your data set. One is called point wise mutual information. Again, the idea is that there are certain words or phrases, which are correlated with a specific label. And you can use this formula to figure out like whether those words are like representative of the task, or whether they are just spurious patterns. This is a very simple way of finding out. The other thing is actually we talked about words like great and white being superficial. But it is not just the words sometimes even names carry biases, right? Somebody was talking there. For instance, if I see a Twitter feed where Donald Trump is always mentioned in a negative context, and I use that to train my model, my model will always think that like the word Donald Trump itself carries a negative connotation which is not correct, right? For instance, just take a quick example, right? I hate Taylor Swift, I hate Kate Perry. The two of them should have a similar score. But as you can see on the table, there is a difference in the negative scores. And that's because the data on which the model has been trained has a higher negativity probably associated with Rihanna or Taylor Swift compared to Kate Perry. So this is important and this has to be looked for in your dataset and you have to remove this. One way of doing this is what is called perturbation sensitivity analysis where we essentially perturb the name entities with other equivalent names and try to find out whether the classifier score changes, okay? The algorithm is very straightforward. And once you find out that having that kind of unintended biases, you will need to retrain your model with the perturb sentences, okay? The next thing is what is called a dataset ablation. People are familiar with what is called a model ablation analysis. When you have a model, do you know like what is a model ablation? Okay, the idea is very simple. A model gets its performance from multiple factors. It can be because of your hyperparameters, it can be because of your architecture, it can be because of your dropout. So you want to find out what exactly is contributing to your performance, right? So that's called model ablation. Similar way we need to do dataset ablation, right? So you test your model with partial inputs, you test it using random labels. And another way of improving your data is what is called counter example. So I talked about the fact that like the model does not really know what makes a difference in prediction, right? So if you have a sentence which says, look at this here, every film student should see these things so they will know the very definition of a perfect movie. You keep the same sentence, but make it, change it so that the label gets switched. So once you train your model with counter examples, the model understands what is the actual phrases which make a difference, okay? Another way which is very well known to folks to improve the model is to use data augmentation. I won't go into details, but there are a number of libraries which are available, which you can readily use. And another one is you can even use back translation for creating data augmentation, okay? So far we talked about like how we can figure out whether the models are robust, how I can improve my data by adding counter examples and data augmentation. The final thing is like, how do I really prevent my model from learning these superficial patterns? So one way of doing this is what we call as training an ensemble. This is very simple, I mean if I put it in, I mean simpler terms, you train two models. One model learns the biases of the superficial pattern and then you train another model which learns the examples which cannot be predicted by the name model, right? And at prediction time, you just use the stronger model instead of the weaker model. This is a very recent work, I mean this lot of work to be done in this area. And okay, the key takeaways I want you to carry away from our discussion today is this. Remember if you build a model and you achieve performance greater than say 85 percent, 90 percent on your test dataset, it is not sufficient to say that my model is production ready, be skeptical. That is the first thing. Understand your model for using interpretability tools. Make sure you stress test your model with real-world data. The more diverse, the more different data you can throw at your model, understand it using interpretability tool, you will have greater control and greater conviction of your model and definitely continue monitoring after deployment. And the final thing is this. I think all of us ML, AI, NLP practitioners, we have a responsibility to deliver AI which meets not just the performance but which makes sure that it does not have any adverse impact in real world. And this is our mandate and I want each of us to rise up to it. Thank you. Questions? Please. So one thing, great talk by the way. I mean, everywhere you hear people talking about the ability of AI and NLP and here you are talking about the limitations and I think it's very important to, you know, flow backwards. One question is you have talked from the terms of someone who's building these models. What I want to ask is from someone who's going to be at the end point using these models. For example, like maybe in a few years time we will get IBM doing some NLP which a doctor or a lawyer does. Do you think we should trust it? Okay. Very big question. I mean, I think we should talk offline but the short answer is this and that's what I was trying to say this. See, it's not enough that we build the models. We should own the responsibility in seeing how they are deployed. I mean, the importance is this. In many cases, AI, ML or NLP models can augment human intelligence. They are not replacements for sake. I mean, in today's paradigm, I mean, the reason is that we are still on an associative learning. We have not reached artificial general intelligence until that thing happens. This has to be a human and artificial augmented intelligence. I prefer the word augmented intelligence to artificial intelligence and you made a very important point. All of us, both the people who build the systems and the people who are going to use it, we need to take responsibility not just in building the system but driving policy towards it. I mean, talk to the people who represent the community in scientific boards, tell them you're concerned, make sure that we build AI or NLP which is democratic. I mean, I'm not just using surface words here because in 10 years, it's going to impact all of us. I mean, the facial recognition system, the banking, your doctor could be an AI, right? So it's like real for us. So we have to own this. We have to drive this. Thank you. Can you expand upon this data argumentation? I have this idea that a lot of time the data which we parse are very structured text data and all. Now, as more and more data generation is happening on audio visual kind of a thing, do you think there is a room for like extra linguistic signals where you can tag with the tone of voice, voice modulation, clicker of eyes and all. And maybe we can introduce some new kind of punctuation for sarcasm and all in written data also. So that will increase the accuracy. No, no, totally like a valid point. I mean, I think that's being adopted in many places. I mean, the first point you touched upon, multi-model. For any problem where multi-model data is useful, we should go ahead and use it. For instance, emotion recognition, any of those use cases, multi-model data is critical. So we should definitely use that. And second thing is this, right? I mean, data argumentation becomes critical because many of us are like ending up with use cases where the dataset is pretty small, right? And there are two factors to this. I mean, I think like Bikram is going to talk sometime later about like the labeling annotation and data creation. Those are costly, right? I mean, we can find efficient and effective ways of doing it, but it's a costly effort. So if you can generate synthetic data where the labels can be carried over from your existing dataset, there's nothing like it to improve your model. I mean, it's a cheap and cost effectively. That's the first thing. Second thing is that natural language generation has recently come of age. I mean, essentially the progress in research is pretty good that you can use many of these techniques now to generate synthetic data. As you rightly said, I have a sentence. Can I make it some testing? Yes, you can, right? So those are the ideas we need to pursue today. Like, I mean, we are using very rudimentary forms of data argumentation and there's much to be done there. We got to do that, right? Multi-model, right? If you have Vishuvel, if you have audio, I mean, sometimes like you can have other signals, even the metadata can be used to solve some of these problems, right? So valid question and it has to be worked more on that. I have time for one quick question. How many of our problems are related to the fact that we're using English? If you had used maybe some Chinese language or... Okay, good question. The much of NLP world used to think that English is the center of the world, but that has changed in the recent years. I mean, right now one of the biggest NLP market is India because we have, I think, like the number of dialects in India is close to 7,000, right? But sadly, the tools have not caught up, right? If you want to get a good parser, a good segmenter or a good lemmatizer for Indian language, it's a lot tougher. I mean, it's not just Indian language for any other language. Chinese is a little bit better because there's been a lot of work in that space. But for non-English languages, the tools and the techniques are still in, I would not say infancy, but they're not mature. So there's a lot of things to be done there, but that is where the market is. I mean, if you look at like the NLP market in India is much larger and there's a lot of work being done to support Indic languages. I mean, Indic NLP is a very popular library, like, I mean, Anup, he wrote this library and he's being used by almost everyone, right? So there's a lot of work there. And I think like, we know this area, we know this ecosystem from India. So we should put in more effort to support Indic languages. Yes, we'll connect offline on that. And I mean, I'm out of time. Thank you. Please feel free to connect with me offline for any questions. I'll be around.