 Okay, so good morning everyone, I'll just get started. My name is Shailesh and I give these talks almost every year, so this is a very deja vu feeling for me. The only thing different this time is the stage is slightly thinner, but great crowd, great list of talks so far. So, Zainab called me like a couple of weeks ago and said, why don't you give a keynote again? And I said, you know, I'm running out of things to say now. I've given four talks at different forums within Fifth Elephant and, you know, I wasn't sure what I'm going to talk about. So, then one of these days I was talking to one of my non-geek friends and he was very excited about what I do, so he said, what do you do? And I, you know, it was on the phone and I started talking to him about this, that and the other. And for about 45 minutes I was rambling and this guy was very quiet. I didn't realize that he wasn't, you know, he wasn't a techie and I was going on and on. And after 45 minutes I stopped and said, are you still there? Are you listening? And he said, yeah, I'm listening. Can you tell me what do you do again? And then I realized, you know, how do I summarize this in two words? So then I told him, hey, I'm building thinking machines. And that's when he said, why didn't you say that before? It was so easy to say that, right? So that's how the title came by and obviously we are not building thinking machines. What I'm going to talk about is two words thinking machines, right? So, we have a long way to go. So I added the word two words later. So what I'm going to talk about is all over the place. I'm going to talk about philosophy, science fiction. I'll talk about algorithms. And I'm going to talk about, you know, deep learning and how to think about things beyond deep learning, right? And so let me give you a perspective and then we'll start. So I'll take questions at the end. It's not working. It's not working. That's fine. All right. So I ended my last year's talk on this quotation. So I thought I'll start with this quotation this time. So I like this quotation because it puts a lot of things in perspective of what we are doing, how our civilization got here and where we are heading. So it says our technology, our machines, is part of our humanity. We created them to extend ourselves, right? And that is what is unique about human beings. And if you look at chairs and dogs and animals and cats, they don't create machines to extend themselves. They just have instincts and they follow the instincts, right? And that's very unique about human civilization. We created Tad Mahal and, you know, space flights and internet. So, and we have come a very long way. So if you think about the tools, right? The cavemen had tools. And now we have, you know, a completely robotic, you know, assembly line with no humans and you could turn the lights off and nothing will happen. The cars will get produced, right? If you look at our transportation, right? We have gone from just on road, you know, bullet cards to massive amounts of transportation that we can do now. If you look at our ability to look further in the space, you know, again. So, you know, since Kalio, we have made a lot of progress. Recently we saw the news of Pluto flyby. So now we are able to send satellites into space. If you look at the first computer we built and where we are today, right? We have a huge data center. And really, if you look at the whole thing in perspective, we have made an enormous amount of progress in the last so many centuries, right? So if you look at just the technical part, the IT kind of world, the intelligence machines, we are not talking about mixies and other things. Just look at what AI and deep learning and all this stuff has produced. Today's machines can play chess. And there's no human on the planet who can play chess better than the machine. I want you to take a pause and think about where we are. There is no human on the planet who can play chess better than a machine, right? There's no human on the planet who can play jeopardy better than a machine, okay? And recently, you know, Google came out with automatic cars so the machines can drive cars and the records show that these cars are better than humans under ideal conditions and they have much less accident rates. And all the accidents that happened are because of other human drivers and not because of cars, right? And recently, you also saw how machines are able to create pictures, right? So this is one of the things that we saw about deep learning is internally doing. And now think about all this. Think about where machines have gone today, how many things they can do which are way beyond our imagination that machines could have done, right? So obviously, there's a lot they have done. But can they do the following, right? We would want to stretch the limits. So one of the holy grails of AI is to have a machine have a conversation with a human being, right? So we all know the Turing test and the repercussions of this will be huge. If you think about how we talk to the Internet today, we carefully craft three-word, four-word queries, right? And, you know, we allow the Internet to make mistakes. We craft the queries again. We take the suggestions or not. We talk to the Internet like we are talking to a three-year-old. Now, in a day and age of massive data computers, NLP and all these deep learning stuff, imagine what a shameful thing it is to talk to a computer like a three-year-old, right? So it's got the capacity of thousands of people, but it can't understand language, right? So we need to change that. Now, imagine beyond keywords what can happen? We can do question answering. But how do we do question answering today? We have created Yahoo! answers. We have created Quora and people who type questions. We do a match between the questions and the answers, right? And then we, again, do retrieval. We're still not answering questions, right? Now think about conversations. Conversation is an even more complex thing. If it works out, what are the repercussions? I don't want to study physics from my physics teacher. I want to study it from Einstein or Feynman. We already know all the language, all the knowledge of these people. Can we not have a persona of a person, Feynman or Einstein, and have a conversation with that person, right? So just imagine the future of what will happen if you're able to just have conversations with the machines, right? So there's a long way to go between keyword search and conversations. Can we discover cure for cancer? There are a lot of diseases out there. Now, obviously, there's a lot of research pharma companies are doing. There's a lot of new initiatives in how to use AI and machine learning in pharma research. But my contention is that I believe that the cure for a lot of diseases is already out there in all the medical literature. If somebody could actually read them, hold that knowledge in the brain, in the RAM, and do interconnections, we should be able to find a lot of things, right? But what is the problem? A single human expert, even in one field, cannot keep up with that much of knowledge, right? We'll forget some things. We won't read certain papers. And therefore, it's the other problem. We have too much knowledge, and our individual brains are not capable of forming those connections in the... because we can't even read that many documents, right? So, but if a machine could do it the way, you know, NLP has progressed, can we not find cures of new diseases? Can I crack the next IIT entrance exam? Right? You're laughing today, but who, you know, who never know? Five years from now, what will happen? Okay? I mean, we should hope that if Watson is a test of intelligence, right? If Deep Blue is a test of intelligence, could this not be a test of intelligence, right? The ability of an AI system to be able to actually solve an IIT paper and get a rank one in it. What about, can I search all the video scenes which only have a goal shot in the football videos and nothing else? I don't want to watch the rest of it, right? A lot of balls going here and there. I just want to see the goal shots. Today, I cannot do that, right? Can my machines be intelligent enough, the vision part that can actually find this is a goal, this is a goal, this is a goal, the rest of it is something else, right? So you can imagine the applications out there. We will talk about sarcasm a lot, and we all understand sarcasm is a very hard thing to do, and, you know, imagine if you could detect sarcasm, what all can you do? You're writing an email to your boss, you're angry, you've written a sarcastic comment, and Gmail says, hey, are you sure about this, right? In the heat of the moment, right? Can I put it this way, right? So, like, today we do attachments, can we detect sarcasm and things like that, right? And, to me, the holy grail of AI is not really all these big things, but a very simple thing. Get a machine, find a joke, funny, okay? Now, there are a lot of, I don't know if you guys watch Star Trek, but, you know, data in 300 years, 400 years from now is an android. He's capable of all these other things, right? He's a great supercomputer in a human form, but he's still struggling with humor. That's how hard the problem is, okay? So, obviously, we have a long way to go. We have come a long way and we have a long way to go. So this talk is really about the way forward. So, what do we imagine the future to be? We want something like this, good and bad, hopefully good, we want a Jarvis, right? We all want a Jarvis. We'll take care of the chores and, you know, get rid of whatever, and then, you know, we all want a Jarvis, right? So, if you watch these movies again, after watching this talk, you'll have a very different perspective on what all we need to do to get here, okay? It's not going to happen just because we're going to make more and more Bollywood movies like this, right? I mean, Asimov wrote I, Robot in the 70s, and we're still not there. It's not going to happen because we keep doing data science. And that's one of the reasons I wanted to do this talk because a lot of people keep thinking data science is the end of the world, but there's a lot more to data science and I want to see how we can go beyond data science, and this is not data science, this is artificial intelligence, right? So I want to draw the distinction and say how we can move beyond data science, nothing wrong with it, but it's, it's a done deal, right? We have software you can download, you can code up whatever you want. It's a done deal. Data science has been packaged already, right? If you look at Microsoft Azure or some of these other software, right? It has already been packaged. All you have to do is download the right software, put your data in the right format, and you're done, right? So there's nothing great about data science anymore. Sorry about that, but, you know, we need to jolt ourselves out of this comfort zone and say, okay, we are all data scientists. That's not it, right? What can we get here? Will data science get here? All right. So we'll get here by asking a lot of deeper questions, right? Not the questions like, why is this customer churning from Flipkart, right? Or who's, what is the next product to recommend to somebody? Or which movie you are going to ask? These are not the questions that will take us to the next stage, right? So the question that will take us to the next stage is, what is learning? Fundamentally, philosophically, when we see that we are learning, children are learning, everybody's going to school, we are learning, we think that machine learning is learning, but what is learning really, right? What is understanding? What does that mean? What does the word mean mean? Okay? What is thinking? We keep saying, oh, I'm thinking about this. What are you doing when you're thinking? So today I'm going to show you an equation of thinking. Okay? So it'll be fun. I don't claim that this is the equation of thinking, but I'm trying to get to that point where we start thinking about thinking and not just think. Okay? What is creativity? Now creativity is if you look at an artist or a musician or even a scientist, we create new inventions out of the knowledge we have, right? And in a way, it's a manifestation of the knowledge in a certain form, right? A poet creates, a musician creates. So what is creativity? And the last question I have here is what is consciousness, right? So ultimately, if you look at movies like iRobot, the word i, robot is not really about the robot's great abilities at mundane tasks, but really it's about the i in it. I am a conscious being and now what are the consequences, right? So what is consciousness? And we have sentient machines at the end of the day, right? So we won't go there today. Maybe we'll see if we have time, we'll watch a video. But I'll try to cover the bottom three and see if you can find something interesting. So learning, right? Learning is one of the most basic things we all do learning all the time. At least we all claim to be learning all the time. So really I'm going to use language and not vision that much, language as my basis for all the examples. So learning really is many, many things. The first thing we learn, so the greatest example of a machine learning system or an AI system is a human child. And all you have to do is just observe how a baby is growing up, how he's picking language, how he's picking walking, how he's picking swimming, how he's picking tantrums, right? And you learn so much about AI because you're looking at the real AI. So what is learning? So I want to use that example and see how we pick up language, right? So if I use the word, if I start, imagine you're reading a novel or imagine words are coming at you one at a time. You see the word United. What do you think the next word would be? Right? United States or United something, whatever, right? Then so we're predicting. When we are learning, we are also simultaneously predicting. And this is one of the flaws in current machine learning that we keep thinking that learning is separate, prediction is separate. We learn first, then we'll score, right? But a human brain is not like that. We don't learn for 60 years and then suddenly we start behaving, right? We're constantly learning and we're constantly applying that learning. And this is one of the fundamental reasons why, you know, I call the current model of machine learning like the von Neumann architecture which is not going to become a data flow architecture ever, right? So that is one of the problems. So imagine what we are doing now. We are predicting what will come next, right? So if I say United Nations, then you also do one thing which is you are saying now these are not two words. This is one word, one phrase, right? And we saw how phrases cause a lot of problems. Then I say security and then you say, oh, I think what the next word is going to be is council, right? So we're predicting and the confidence in the prediction goes high or low. And even if the word council is misspelled or misspoken, you'll still be able to fill the gap, right? And that is something that machines do, AI system should be able to do. And then what you do, you say, oh, United Nations Security Council, Security Council is a phrase and this is a phrase. Actually I'm not looking at four words anymore. I'm looking at one word, right? So this idea of tokenization, this idea of syntactic composition or segmentation is very common. And now if I use the word resolution, now the whole thing changes again. It's a completely different thing. And now you're saying, oh, we are not looking at five words. We are looking at one word. You see? So what is language? When I say five words, is it a phrase? Is it two phrases, right? This is not a glomerative clustering. This is phrasing. We do the same thing in images, right? We look at this and we say, oh, these are people. This is stage. This is light. So we segment things. We like to bring things to IDs, to tokens, right? All right. The next thing to do, and part of this was covered earlier, is semantics, right? How do we assign meanings to those IDs? So we have identified the IDs. This is an object different from this object. What is this object, right? How do I assign meaning to this object? So what we did was we did what everybody is doing these days. We took a bunch of Yelp reviews. And we ran skipgrams, right? We studied skipgram in the first talk. And we said, let's see what it learned. But before we did that, we did one more thing. We did the phrasing. We phrasified the whole corpus. And then we said, let's look at, you know, it's a review thing. It's an opinion thing, right? So obviously, you should be able to say that for the word great, it should be positive. So the word similar to the word great should be positive, right? So we saw the words. And these are the top words that came up. This is on the Yelp data that is publicly available. And now one of the problems with this kind of data is there's a lot of error, right? Spelling mistakes. And to a computer, everything is a spelling mistake, right? Think about it. It doesn't have a dictionary of its own. Whatever you give it, it'll learn that as if it was real. For us, this was a spelling mistake. Then we said, what about the opposite? Can it learn sentiments differently? So we said, let's look at the word worse. And then say, let's look at what words are similar to the words worse, right? And we didn't even say that this is a sentiment analysis task. It's completely unsupervised. We just said, hey, learn based on co-occurrences, right? Here's a very interesting thing that came up. Lack of better word. This is a phrase. We had tokenized the whole thing into phrases. We are not using convolution networks. So we see these underscore things. These are already tokenized. So this was easy. The next thing we did was we said, let's look at the word ambience, right? This is Yelpso restaurant reviews. Now, the word ambience is an easy word. I mean, I'm sure there are a few people in this room who won't know what the meaning of the word ambience is, right? Not only from that reason, but to really understand the word ambience, you have to go to a restaurant, feel the ambience, right? And that is the second big problem with AI, which is AI, the way we do it is, oh, here's a bunch of text. Go understand the meaning of the words, right? We never understand the meaning of the words by just reading a whole bunch of text. We have to feel the ambience. We have to know what it is before we can match it to the word ambience, right? So we said, oh, this should be a complex thing for the machine to do. Let's look at what it learned. And this is what it came up with. It said atmosphere, ambience, the wrong spelling of the word, relaxed atmosphere. Now, the word relaxed and atmosphere both need a physical experience of what is relaxed, right? Just relax. Cozy atmosphere, very casual, dim light. Think about that. If a machine can say dim light is ambience, or music played in background, for that to really understand the machine has to be at the restaurant, hear the music, feel the coziness, right? It should know what is trendy, and then it should be able to tell me all this. But look at how far we can go with all this because, and this is happening because there's nothing great about the technique here. What is great is that a lot of people have put these right kind of keywords around the right kind of keywords. So it's all the human experience of what is ambience in that language corpus, which is what the machine has picked up, right? The machine does not really understand what is... All right. So not only words, we can also do similarity between sentences. So if I use this sentence, garlic used on pizza was tasty, and, you know, sometimes a lot of user-generated content has a lot of noise. So people somehow don't type was completely, right? So they'll say, and then these are the sentences not based on word-to-word similarity. These are based on what we call paragraph vectors. So paragraph vectors is what you use if you want to do sentence similarity or paragraph similarity. So it gave very interesting results. This is another review on Yelp, and these are the top results. If you use NLP and grammar parsing, you know, you won't even get the half of the words right, right? And that's the nature of the data. But to a machine, everything is misspelling. So it doesn't care, right? As long as you have enough of each word, right? You'll get it right. Okay. So we understand syntax and semantics, and we... Machines are now able to get to a stage which looks very promising in terms of catching phrases, catching meanings. So we'll talk about thinking. What is thinking? So learning is one thing. Understanding, thinking. So these are different parts of a thinking machine. So we'll talk about thinking, and I call it thinking 1.0 so that next time, you know, I have a chance to come and talk about thinking 2.0. So it's just a teaser. All right. So what I'm going to do is I'm going to use disambiguation as an example of what is thinking, okay? And this is going to be a little bit mathy, but bear with me. I'm going to produce an equation of thinking at the end of this section, and hopefully it'll resonate with some of you. Okay? So let's look at this sentence. This is my favorite sentence whenever I talk about disambiguation. Apple filed a suit against Orange. Okay? Now if you don't know, Orange is a company in France which is equivalent to the Airtel in India, right? Why? Because if somebody could name a company Apple, why not Orange? Right? Nothing wrong with that. But what does it do? It makes life for us harder. Right? It adds more ambiguity to the language. All right? Now when you read the sentence, what is going on in your brain is called thinking. What are you doing right now? You're saying, oh, you know what the word suit and the word Orange. Yeah. I think Orange suit is a good thing. Orange could be a color. Right? But look at the word suit against Orange. This is probably not Orange. Color. Orange must be a person or a... Right? So you're going through this exercise in your brain and what I want to do is I want to create a mathematical way of looking at this. Okay? All right? So now we're going to generate a mathematical way to solve this kind of problem. This is what I call a joint disambiguation problem, which means that all the words have multiple senses. Right? So the word Apple has filed. You can file a paper. You can file your taxes. You can file a divorce. Right? A suit. You can wear a suit. Right? You can... It's a lawsuit. It's also a suit like here. Right? Against could be many, many things. The more common the word is, the more senses it is. And this is not necessarily about the senses in the dictionary, but how the senses are used in the corpus. Right? And orange can have now three meanings. So now the real problem is this. If it was only one word that was ambiguous, it would be easy. So thinking is really about how do you assign meaning to all the words simultaneously? Right? Because all the words are ambiguous. And depending on what meaning you assign to one word, it will change the meaning you assign to the other word. Right? So that's a big problem, and that's a hard enough problem that I'm going to use to explain what is thinking. Okay? So what are we doing when we are thinking about assigning the right senses to these words? Right? So I'm just going to take three words, remove the rest, so we'll focus on these three words. And the goal is I want to know if I assign one sense to the word filed, what should be the sense of the word suit, and if I assign some other sense to the word orange, what should be the sense of the word suit? So I'm going to make two assumptions in this discussion. One is that I know all the senses already. Okay? So I'm not discovering new senses while I'm doing this. Let's say I already know all the senses, so the number of senses is not going to change. The second assumption I'm going to make is that I've done this enough, I've read enough documents, I've done this thing enough that I can now store this in my knowledge base. So in a way, when I say suit I, I really mean the ith meaning of the word suit. Okay? So what I'm saying is, here what I'm saying is, what is the probability that if the word filed takes the meaning K, then the word suit takes the meaning I. Okay? So these are different meanings. So now this is the matrix of, you know, four by three essentially. And imagine that I already have this knowledge from all the documents I have read. Okay? So now I'm just going to focus on these two and say how do I now use this kind of knowledge to now actually think about the means that I should assign? Okay? So now here's the first meaning question of thinking, the sub equation. You recognize this equation very quickly. So what I'm doing is the following. So I'm saying, so let's go through this one step at a time. So this is saying, if you had assigned the word filed with a sense K as in file, a lawsuit, then the word suit cannot mean what you wear. It should actually mean the lawsuit. Right? So this is that knowledge that I have acquired over time. Now remember, as we learn more and more and read more and more, this knowledge bases are getting refined over time. So initially children will disambiguate not correctly, but slowly as they read more, they'll be able to do this kind of stuff. So that is this part. This one is saying that in the previous iteration of thinking, so now imagine thinking is an iterative process. Initially you used the prior, right? And you said Apple has to be the computer and Orange has to be the fruit because these are the priors. So initially they were the priors, but slowly the priors will change based on the context. So now imagine in iteration T, the word filed has the sense K with this probability. So these probabilities are what are changing. And then using these two, I can write a very simple equation and this is saying what is the number of senses of the word filed. Using these two, I can just say that, oh, if that is the case, if these probabilities are known, then I can, and this is already known, this is the knowledge part, then I can easily compute the probability of the word soot taking the sense I in the next iteration, given the word field. So if we do this again, now can anybody recognize this equation? It's page right, right? So random walk is not just random walk for Google search. You can use it in many, many different ways. And the more I think about random walk, it's a very, very fascinating algorithm. I don't think Larry realized what he was doing when he created that. It's a very fascinating algorithm. It obviously came from way before Larry, right? We had Markov chains and things like that. But that's the idea that we can use random walk to do this. So if you can do the same thing on the other side. So now I'm focusing only on this. I'm saying what is the probability that the word soot is going to take sense one. And now, remember, this guy soot i is affected both by the word, what was the other word? File as well as the word orange, right? All the other words are influencing all the words. That is how the joint disambiguation is being solved. It's like a team. Everybody affects everybody else. So therefore the word orange is also saying something. It's saying no, no. You know what? Orange and soot. And I think orange is a color. So therefore this must be the soot that you wear, for example. So the word orange could be influencing it in a different way. Then the word file can be influencing. And that's what thinking is. Thinking is really about resolving these conflicts and ambiguities to converge so that everybody is happy. That's what thinking is in this context. And now, again, we understand these numbers. So this is, again, what is the sense, what is the probability of the mth sense of the word orange? Color, company, or fruit in the tth iteration. This is what it is saying about the probability of the word soot i. So soot i is influenced by the word filed, by the word orange, and all the other words too. I'm just using these two words, but really it's affecting by all the other words. So now, so far so good, right? So this is, again, page rank. Now how do we put it all together? So now the idea is that the word orange is saying, oh, the word soot, you must be the clothing. And the word filed is saying, oh, the word soot, you must be the lawsuit. And now how do we resolve that conflict? And how do we aggregate these two evidences together? So really the goal is to compute the probability distribution over the senses of the word soot in the next iteration. And what is influencing this is the probability distribution of the senses of the word filed and the word orange, and all the other words, which we are not showing. And these distributions are available to me in the previous iteration. So these distributions are available. These are probability distribution over the four senses and the three senses. And what I want to do is I want to now combine these two together to come up with a new distribution over the word soot. And if I could do this recursively, right? It will converge, hopefully. And then I'll have a process of thinking defined. So now the real question is, do you weigh the word orange more, trust the word orange more, or do you trust the word filed more to decide what the sense should be? How would you know that? How would you say that, oh, you know what? I should trust the word orange, or I should trust the word filed. How would you think about that? That's the next step, which is I need to aggregate. I need to find which word to trust. So you want to know something about movies. You have two friends, one who cooks well, one who is a movie buff. Who do you trust more? So same thing here. You have two friends filed an orange. Who do you trust more to tell you what should be my sense? Pardon? The action word? Exactly. So what we do is we say, hey, look at the distribution of this. If this word, if I'm very sure about the word filed right now, that means I'm not saying which of these I have a probability of high or low. All I'm saying is if the distribution is very skewed towards one sense, that means I'm really sure about the word filed, that I have almost converged to the meaning of the word filed. So that is the basic idea we use, which is I say 1 minus the entropy of the distribution becomes the weight. And now I take this distribution from the previous iteration, and I say 1 minus h, distribution here. So now these are my context words. These are all the words in my context. I say all of you words, tell me which should I trust you the most based on your distribution entropy. And based on that, I'm going to use your evidence, which we computed in the previous iteration to compute this guy. And now I'm able to aggregate the probability distribution of soot given the probabilities of all the context words. So this is what I call the equation of thinking. So you can think of thinking as really a multi-partite random walk. So there are multiple small, small graphs, one over each sense. And each graph is influencing all the other graphs. And when you think about this, this is what it looks like. This is not as pretty as Euler equation or any of those beautiful ones. But what it is saying is that this is what I think the sense of the word w prime is. This is the sense as prime at iteration t. This is what I know about if that is the sense of that, what should be the sense of this word. And this is the confidence I have at state. And this is the function of time, so that it will change over time. And therefore, I can find the next iteration. So I use this example of disambiguation, a joint disambiguation problem as a way of thinking about what is thinking. So now let's talk about understanding. I'm going to switch gears. And we had a lot of talk about word embeddings and paragraph embeddings and whatnot. Let me give you a very different perspective. And I'm going to subscribe to this very famous quote by Chomsky. He said, probability of a sentence is an entirely useless notion under any known interpretation of this term. Now if you look at n gram models, what are they doing? What is the probability of next word given the last few words? If you look at word embeddings, what are they doing? Word embeddings are doing very similar. What are they doing is, what is the probability of seeing this word given these context words? So in a sense, they are both equally bad with respect to this philosophy that there's no such thing as probability of a sentence. It's a weird notion to think about. And that really inspired me to think very deeply about what is understanding. How do we do things which machines can't do? So let's see how we do it today and how we do it and how machines do it. So if you look at search, we do it all the time. We give it a query. We get a bunch of documents. Simple. Nothing to do. Do you think the search engines understand our query? At least the previous ones? They did a fingerprint matching between inverse indexing. There was nothing understanding about it. What about question answering? We talked about question answering. We have Yahoo Answers. We have Cora. We have all this. And again, what we are doing is we are matching keywords to keywords and hopefully with deep learning we'll do something better. What about document summarization? You've heard of text rank, these kind of algorithms that summarize the document. What do they do? They take a few of these sentences from the document and say, oh, these are diverse sentences. They must summarize the document. What about conversation? If I say I have a dog and you say my aunt lives in San Diego, that's not a conversation. How do I decide my next utterance? Currently, if you look at Cleverbot and some of these chatbots who are trying to do this, what they're doing is they're taking a large corpus of chat messages across people and they're taking all that corpus and learning a probability of finding the next utterance given the previous utterance. Again, what Chomsky said is a very bad idea. What it is landing up doing, it's over-training on that and every time you do something different than the corpus, it says, may I help you? Or something very stupid, right? As if it didn't know what we just said. So obviously that is going to happen. One of the classical examples of what we need to do very differently is machine translation. Now think about this. If you are a United Nations translator, how do you learn how to translate? Do you take parallel corpora into languages? Go through them sentence by sentence. Look at statistical correlation between words and phrases and learn that kind of a model. We don't do that. This is not possible. Humans cannot do that. So what is it that we should be doing very differently in all these cases? None of these, the way machines do today, is how humans do it. We don't summarize a document this way. We don't create next utterance this way. So I'm going to give you a completely different perspective of how we do it. So this is what I call language to language mapping. We're trying to take a bunch of keywords or a bunch of questions or a bunch of documents, map it to another language. But that's not how we do it. So we'll see how we do it. But how does deep learning do it? So it's not language to language anymore. It actually learns from IDs, it learns embeddings. So it's doing something better. So how does the language embeddings do? They basically learn some embeddings, word embeddings, and paragraph embeddings. And they can do these kind of things. They can compute similarities between things. They can do analogies, king, queen, kind of stuff. And then using that, they can do slightly better. But maybe this is how we do it too. And what I'm going to content now that I'm going to make a very bold statement that there's no such thing as language. I'm going to say this in English that there's no such thing as language. Now think about this. When I say dog, all the people in this room have a different embedding, if that's what we do, for the word dog based on your own experiences of the word dog. Some people hate dogs, some people like dogs. So everybody's brain sees dog differently. Now the whole point of having a language was that you can communicate that concept to me. So I had to create either a shape. When we didn't have language, cavemen used to draw pictures. Or you had to create a sound. A lot of languages are just phonetic. Or you had to create a script. And we had to agree that this is what we mean by dog. Although my embedding of dog, which is changing over time, when your embedding of dog is also changing over time, is different. In order for species to become social and have communication, language was invented. So language was an afterthought in evolution. What was happening before that? Could the tiger still recognize a deer without having language? Could it tell whether the deer is running away or not? So animals could do understanding of the world already without having to communicate that. Only because the need for communication came, we invented language. Now what's happening today? We have NLP, we have statistical NLP, we have text mining, we have some other kind of text something. We have sentiment analysis, we have the internet. So we are so hooked up to this idea that language is something very important. It is not. And that's what the bold hypothesis I'm going to make, that as quickly as possible, if you wean off of the language and go to something else, what you need to go off to is knowledge. Knowledge is about, again I'm going to have to say it in language, but it is not language. You understand? Language is just one form of knowledge representation. Language knowledge is the key. So this is how an AI architecture should look like. How we should solve all those problems. Language should be converted to knowledge as quickly as possible. Knowledge, like we saw earlier, probability of filed I suit J. Think of that as knowledge. And using that, we ran an equation of thinking. So that is thinking. When you receive new knowledge, you merge it with existing knowledge. And then your knowledge grows. And that's what we do when we sleep over it. And then whenever the need comes, you have to say something, give a talk, have a joke, have a conversation. Then you can synthesize language again. Now let's see how this will completely change the way we think about what we have been doing. Language to language is not the solution. If we keep going that path, whether through LSTM, whether through more complex LSTMs, hierarchical LSTM, anything you do, I don't think language to language is the answer. You have to go via knowledge. And we'll see what that means. Now this is not just about communication and knowledge. This is also about all the other kinds of things we do. Stimuli, you see a car coming. You analyze, your vision tells you that something is going on. Your knowledge says, A, you have to survive. You think about what you need to do. And then you act on it. You run for cover. So this whole process cannot be from stimuli to resistance directly. It is again through the knowledge path. So in order to build this up, let me ask you this question. What is the atomic unit of language? Physicists are very hung up on what is the atomic unit of the universe? The God particle or something below that. Think about the same question for linguists. What is the atomic unit of language? Lexical units? So you could give me all these answers. Letter, A, B, C, D. That's what children learn first. Or phones, the sound. If you're doing speech, these are sounds. Unigrams, words which are single, phrases. Senses of like apple as in this or apple as in that. Concepts, the whole LDA stuff and how do you extract concepts. A sentence could also be a unit of language. A paragraph could be a unit of language or a document. You can think of it this way. Same thing in images. When I say language, I don't mean language anymore. I mean any form of data. So images are also languages. Dance is a form of language. All these are languages. So again, in images you could say pixel is the smallest unit of that language. Or you can say, no, no, we have shift features and hog features and these are the units of language. Or you can go up and up and say, no, no, unless we can detect something like eyebrows is not there yet. And you can say the whole image is also a unit. So we don't know. All right. So now let's talk about knowledge. What is the atomic unit of knowledge? Any ideas? Pardon? Object is not an object. Object? What else? Thought. Event. Experience. No, no, I am saying how do you represent knowledge at the most atomic level? Experience gives you knowledge. Facts. Facts. Synapse. Emotions. Emotions. Okay, no, no. We are confusing between how we gain knowledge or express knowledge versus how we represent knowledge. We are talking about representing knowledge and what is the atomic unit to represent knowledge. Okay? Reason. Again, reason is a way to increase knowledge, right? Or create more knowledge from old knowledge. I am talking about how do we represent knowledge? Grammar. Grammar. Okay? That is a language thing. Interconnect. Interconnect. Okay? So, we are getting closer. So, before somebody tells me the answer, let me start. Okay? Let's see how difficult it is. Thousand people in the room, it's very hard to tell what is the unit of language. Okay? So, let me build this up, hopefully. So, if I say the word John, did I convey any knowledge? No? If I say the word son, did I convey any knowledge? Right? If I say the word poodle, did I convey any knowledge? Imagine a sentence that just stops there. You'd be like, what? Right? You haven't deceived the whole thing yet. It's like a packet which is partly received. If I say the word apple or east or dog by themselves, have I conveyed any knowledge yet? No? So, although you might say that for language, the atomic unit of language could be words, right? These words could be atomic units of language, but we haven't yet reached the atomic unit of knowledge. What about two words? So, if I say works for rises in the east, issa, you're saying what do you mean issa? Right? Okay? So, these are not knowledge yet. So, a phrase in English cannot also be an atomic unit of knowledge. What if I say John works for? Have I communicated any knowledge? Are you still waiting for something? Right? If I say sun rises in the, or poodle issa, you could say poodle is a pet, poodle is a blah, right? Poodle is a cute dog, whatever, right? So, if I take any two combinations of these, if I say works for apple, that's not neither a sentence nor a communication of a knowledge, right? Or if I say this, John apple, what do you mean John apple? Does John like apple? Does John hate apple? Does John work for apple, right? So, what is a complete transfer of knowledge? So, unless I say all the three things, I have not communicated any knowledge to you, right? So, when I say John works for apple, now your brain is satisfied and calm, that okay, now I have added one more link in my knowledge graph. Okay? So, a unit of language may be anything, but a unit of knowledge is a triple, right? Subject, predicate object. Minimum you have to have this to have a knowledge, okay? So, now we talked about knowledge to language, language to knowledge. So, let's say this is an article in one form of language representation. And if I do analysis, I can come up with a, what we call a knowledge graph, right? So, we all hear knowledge graph, free base and all this stuff. So, really all it is is entities and relationships between entities, right? So, this is really knowledge. This is one of the ways in which you can represent this knowledge, right? Now, this mapping between text to knowledge is that fixed, is that this is a many to one, right? I could say the same thing in many, many different ways and still get the same knowledge, right? So, that is the beauty of a knowledge graph. It's a canonical representation of knowledge, while language gives so many problems, right? It has ambiguity. It has different sentence construction, you know, grammar errors. So, language is really good for communication, but not for knowledge representation. Our brain immediately wants to get rid of the knowledge. You won't remember the sentence exactly what I said in this talk after one hour. You will only remember the knowledge that you converted it to, right? You can't regurgitate the text because we are not memorizing knowledge. Our brain very quickly converts this to this, okay? Now, think about those problems we talked about, machine translation, summarization. Now, imagine I remove the knowledge, language part. Now, I give you this and I say translate or say this in Russian. Can you do it? If you knew Russian, you could do it, very simply. If I said summarize this in three sentences, right? Or tell me what to say if I already know this. You can say this. Did I depend on the previous text that created this knowledge graph? No? I was completely able to decouple from the previous text, come to the knowledge graph and now do whatever I need to do on the knowledge graph. I could also do reasoning on it. I can say, oh, I have to say something new. So maybe I should say that, hey, what other things aspirin can cure, right? I can have a conversation like that. Did I need that language at all, right? And that is the beauty of this transformation. Now, given the knowledge graph, I can synthesize whatever I want. So one of the conjectures I have very strongly feeling about this. We have to go from this to that. So let me tell you the key to this. So really the problem of NLP is not to learn phrases or to do word embeddings and all this good stuff. Really the key is synthesis and analysis. How do you analyze a piece of text, convert it to a knowledge graph? How do you take a knowledge graph and create text out of it? That's all we do. And that's what the language centers do. We don't have such gigantic language modeling systems. All their job is to convert it to knowledge. So there are four problems again in mapping text to knowledge. One is entity equivalencing, right? So I could read so many different ways of saying the same entity, right? See how the degree of freedom in language is causing a problem. But when I map it to the same mid, mid as machine ID in freeways, when I map it to a unique entity in your brain, all these words map to exactly one entity, right? It doesn't matter how I say it. Same thing with relationships, right? If I can say works for, works at, is a software engineer at, is employed by, is paid by, has a job at. These are different ways called paraphrases. We are paraphrasing the same edge in so many different ways. This is the real problem in text mining, which is if you could find a way to map all paraphrases of one relationship and create a knowledge graph unambiguously, you've really cracked an LP in the way humans do. Now if I talk about an entity disambiguation, that's another problem. Will Smith, for example, was a comedian before he became a TV artist, before he became a movie actor. So which Will Smith are you talking about, right? So if I say, hey, Will Smith was great. I don't know what you're talking about, right? But if I say Will Smith was great in After Earth or some movie, then you know I'm talking about this meaning of the word Will Smith, right? So there's ambiguity. Even if the entity is known, the role of the entity is not known, right? Okay? And my favorite is relationship disambiguation. So here we talked about different ways of saying the same thing. Now let's look at the other thing. You know, when I say the same thing and it means different things. So if I say he lives in Bangalore, it means something else. If I say he lives in poverty, right? What about he lives in his own world? What about he lives in a fairy land? He lives in stress or he lives in the past? These are all different meanings of the word lives in. You see the problem? Again, language, that's why I say if you're a very language-centric person, after this talk you'll realize I don't want to be language-centric. It has too many problems and the more quickly I go away from it, the better. All right? So learning to paraphrase is the key problem in NLP. Forget all the other problems you're working on. Just focus on this. You're done. Okay? All right. So five more minutes. So there are three. What we talked about was there are really three knowledge representations. Okay? And they all have very different purpose. The first knowledge representation is language and its only purpose is communication. Okay? Why? Because I can't transfer my embeddings to you on the Wi-Fi, right? If I could, that would be great. And then, you know, hell will break loose because your embedding for dog is very different than my embedding from dog, right? So we need language to communicate. The second, which is what embeddings do, is what they do is they bring the same similar-looking words in nearby places in the embedding space, right? This is probably what the brain does too. But the purpose of this is not thinking or that the purpose of this is generalization. Okay? Because you have very little data, you want to maximize the benefit out of it. You can generalize very quickly if you do this. And the purpose of this third representation of knowledge is reasoning. So if you want to think about reasoning, think about this representation. Okay? So now let me summarize the talk and then there's one more slide after this. So how do we build a thinking machine? First, there are three dimensions to this. One is, be enough of language, go to knowledge as soon as possible, figure out how to paraphrase things, solve the entity, relationship, equivalence, and disambiguation problems. You're done, right? Like, you know, the example he gave, right? Flipkart is better than Amazon. Amazon is better than Flipkart. The predicate and subject, the predicate is still better than, but the subject and object are different. So it's very clear what is important, what is up, right? Knowledge has no ambiguity. Language has ambiguity. Okay? So that is one dimension. How do we go from there? So you have to go from words to phrases to senses to entities and relationships, right? Subjects, predicates, and objects. What are the things we are talking about? We keep talking about Internet of Things. We don't see text in the same way, right? It's really not about words and phrases. It's about things and relationships between them. The second dimension is, again, go away from tokens and go towards embeddings. You will get much better generalization. The bag of words, knife base, SVM is all here, and we see a lot of improvements here, and you can go from, you know, word embeddings, sense embeddings. This is another big problem with embeddings today, is that Apple will have only one embedding, although it will have two meanings, and that's a bad idea, right? Can you figure out a way to do sense embeddings? And then what about embeddings of entities and relationships, right? And the third dimension I think is, we have to go from prediction, which is all about data science and prediction, prediction that to reasoning, okay? So if we do these three things, I think we'll be much closer to building a thinking machine. We'll go from data science to what we will call artificial intelligence, okay? So I'm just going to leave you with these two questions. This is a preview of my next year's talk, but hopefully this will keep you up for one year because these questions have been bothering me for a long time. So the first question is, you know, we've seen how great these vision systems are and how deep the vision systems are, right? Now tell me this, if you were born, and suddenly I showed you an image of a cat and told you this is a cat, and then you shut your eyes, then I show you another image of a dog and say this is a dog, and then I show you 10,000 cats in different orientations and say these are all cats, these are all dogs. Will you learn vision? Is that how we learn vision? No? But that's what the deep learning systems are doing now, right? Okay? Think about that very carefully. I don't think, see, deep learning is great. You know, hammers are good, guns are great. But how we apply them, you know, what happens whenever a field is overtaken by supervised learning people? This is what happens. They say, oh, everything is a supervised learning problem. Why don't we throw a lot of supervised data at it? We have Google, we have all this stuff. We have Mechanical Terp to get labels. Why don't we just throw a supervised learning, right? That's not how we learn. So think about that problem. How did we learn to see? We didn't learn it by looking at images of able data. My second big problem is, does more data or more parameters or a bigger network mean more intelligence? Okay? Now, think about this in a different way. If you are a student of math and I had to teach you something in math and I had to teach you again and again and again and again with more problems and you're still not getting it, big data, a lot of data. Right? Are you intelligent or dumb? I'll give you those two and if you have any questions. Thank you, Shailesh, for that lovely talk. And we're all out of time, so no questions. Please take it all outside. So we're done? Oh, great. Thank you all.