 Hey everyone in this video, we're going to talk about natural language processing, which is the ability of computers to understand natural language. This year is chat GPT. It can solve a host of problems. For example, I give it a translation task and it can translate. I give it a summarization task. Just can you summarize this with an entire text and it can summarize and it can also perform other tasks like question answering too. Now this is just one model that's doing so many different facets of natural language processing. And this one model of chat GPT is what we call a language model. Now what I want to explore in this video is how these different facets of natural language processing converged into just what we call language models today. Some of the pillars of artificial intelligence include machine translation, which is the translation from one language to another. Then we have speech processing, which is the transformation of speech to actual text. Then we have text summarization, which is a machine that takes in a large amount of text and tries to condense it into a small amount of text while preserving key points or meanings. And we have question answering, which answers questions. Another pillar that's not mentioned here is language modeling. But why exactly is language modeling so much more prevalent now than all of these other ones? To understand that we need to go through some history. The birth of natural language processing can be traced back to the 1950s by this dissertation by Alan Turing, who posits the question, can machines think? Now this dissertation said to be one of the birthplaces of artificial intelligence. And although this paper doesn't really talk about natural language processing directly, it did influence many other papers and research that followed it that even have its impact today. And so you'll see like a lot of different parts of natural language processing have their initial research around the 1950s. So starting out with machine translation, one of the very earliest forms of artificial intelligence that we saw was Georgetown IBM experiment. This was used during the Cold War era to translate between English and Russian. And this was predominantly a rule based system and was also the pioneer of many other rule based systems to follow. But of course, rule based translation is extremely simplistic and can't really capture the very complex problem of machine translation itself. And so towards the 1980s, we move towards a new standard or technique and paradigm of ways to translate, which is example based machine translation. This involves using a database. So let's say that we have two translations of like a sentence from English to French. One of the translations is for the sentence, I love dogs. And another is cats are cool. Now, given that we know these translations, can we determine the translation of I love cats? Now, this can be done pretty easily just by looking at different examples in the databases and kind of piecing together a solution. Probably better than rule based systems, but it's also pretty rudimentary in its own way. Now towards the 80s and 90s, we saw the emergence of statistical machine translation. This involves the use of statistical models to solve machine translation. Statistical models in general are kind of like your hidden Markov models, decision trees, support vector machines as such. We were able to use them because we actually have more data to understand patterns, but machine translation is still a fairly complex problem to solve. And so the way that statistical models would solve them is by breaking this complex problem down into sub problems. So specifically machine translation was broken down into a translation piece and then a language modeling piece. So for example, if we were translating between English to French, we would create a model that would translate between different words and phrases. And we have that translation model. Then we'll have a language model that will make sure that the generated French translation pieces actually all make sense together. For context, a language model will take in some input sequence, in this case a sequence of words, and it'll try to generate the next token in that sequence. And so statistical models would solve machine translation by breaking it down into sub problems that are much easier to solve. During the early 2000s, we saw the genesis of neural networks used for many kinds of problems, and that also includes machine translation. Neural networks became super popular because we had a lot of data. We could also use GPUs for faster training and neural networks could actually solve problems end to end. So we can directly solve the problem, the complex problem of machine translation instead of breaking it down into these sub problems that don't really optimize the objective of machine translation directly. And this is why neural machine translators, they're able to get even better performance. And so because of the abundance of data and also the availability of hardware resources, they're used a lot today. Now let's take a look at speech processing. The earliest systems are also here in the 1950s with Audrey that could recognize digits spoken from zero to nine, and also IBM's shoebox, which could recognize also digits from zero to nine and also perform arithmetic via some spoken commands. These, however, were rule-based systems. So Hidden Markov models were introduced in 1966 and there were a lot of other advances with Hidden Markov models that were introduced within a few years of that time. And so we see that by the 1970s, we could use more statistical approaches to speech processing. Now for statistical approaches, kind of like in the machine translation case, we would break down the complex problem of speech processing into three different parts. Acoustic modeling, pronunciation modeling, and language modeling. So if we have a raw speech wave, we would first chunk that up. We would take a speech wave and train an acoustic model in order to generate different phonemes, that's different phonetic sounds, and recognize those phonetic sounds. This was done with Hidden Markov models. Then we had a pronunciation model, which would learn and identify the mapping between phonemes and actual words. This could be as simple as a lookup table, but it can also be a statistical model. And then we have language modeling, which will take these individual words or sequence of words and try to see if they actually make sense in the target language. And so we would use all of these three models together in order to answer the problem of speech processing. However, like we mentioned before, with the advent of more data and more hardware that can support such data, neural network approaches took flight in the 2000s. And these neurons are universal function approximators that can approximate almost any function very well. And so we could use them to end-to-end solve speech processing, that is that we could optimize directly for the speech processing objective instead of breaking down the complex problem into sub-problems that we did here. This is exactly also what we did in machine translation. And so this is why neural network approaches are just super common today, even for speech processing. So now the timeline for text summarization. As usual, we start in the 1950s with rule-based systems. Now we started with empirical methods, which basically used extraction-based methods. Extraction-based methods for text summarization are, if you have a paragraph, you will try to just select directly the statements that you think are the most important without any manipulation. So it's not super intelligent, but it is some early form of summarization. By the 1970s now, we moved on to rational approaches, where we started to less try to just use sentences directly and try to create our own new sentences. A few decades later, we saw the advent of statistical approaches. And this involved using machine learning classifiers, where for every single sentence in the input story, you would try to determine whether or not that sentence should be a part of the summary. And so this is still an extraction-based method, which is not ideal. But it's still used some statistical approach in some form of intelligence in order to inform what the summarization should be. And then in the 2000s and the 2010s, with the renaissance of neural networks, we started using them even for text summarization. They were initially used just to create titles for long text or just abstracts and headlines, but with long short-term memory cells, which are a type of recurrent neural networks, they could be used to generate longer summaries. For each natural language processing task, we can come up with our own timelines just like we did right now. But for all of them, you'll kind of see a very similar trajectory, where they all start in different places in the 1950s after the advent of artificial intelligence. And then we start with a phase of rule-based systems, where there's just a set of if-then statements to actually make decisions. Then in the 1970s, with the advent of certain statistical approaches and statistical models, and also because we have so much more data collected, we could use these models to solve complex problems. But of course, we can't solve these complex problems directly. We would typically take the approach of breaking down the complex problem into smaller sub-problems and then solve these complex NLP tasks as such, because language is complicated. And then in the 2000s, we saw the advent of neural-based approaches in the Renaissance of neural networks. This is primarily because we have now so much more data to train neural networks, and neural networks can directly optimize for very complex tasks, because they are very good at approximating almost any kind of function. And so we see, like all of these different NLP tasks, which are these different colors, they eventually would have converged somewhere in some point in the 2000s. And they are all actually converging today as well. Now that we've seen some convergence of all of these natural language approaches to using neural networks, we'll try to establish the timeline of neural networks to how we see language modeling at the core of all NLP tasks today. The type of neural networks that deal with sequence data are recurrent neural networks. They are like feedforward neural networks but rolled out over time. These recurrent networks, thus, can solve a few types of problems. So first of them is sequence to vector modeling. Sequences, by the way, are like sentences. So given a sequence, can you output a vector? This is the exact case of sentiment analysis. Given a review about a movie, can you tell whether this person liked the movie or not? Another type is vector to sequence modeling, which is kind of like image captioning. Given a vector image, can you describe said image? And then another one is sequence to sequence modeling, where the input is a sentence, the output is another sentence. For example, machine translation, language modeling, question answering, all of these fall under the sequence to sequence architecture approach. However, RNNs were very difficult to train. And this was kind of solved in the PhD dissertation by Ilya Setskever back in 2013, where he showed how RNNs can be effectively trained. And with this, RNNs started really burgeoning into the field for all these kinds of sequence related tasks. And so recurrent neural networks, and soon long short for memory cells, became the state of the art for dealing with sequences in general. But it did have a few flaws. The first issue was that LSTMs are slow to train. They require inputs to be passed in sequentially and the outputs were generated sequentially. And so they didn't really make use of modern GPUs very well. And another issue is that they didn't truly understand the context of the word. This is even the case for bi-directional recurrent neural networks, because they learn the forward direction context and a backward direction context, and then they concatenate them. And so the true context may be lost. In order to address these two major issues with recurrent neural networks, a seminal paper in 2017 was introduced that introduced the concept and the architecture of transformers. While recurrent neural networks were slow to train, these transformers were much quicker to train because they can pass in data in parallel instead of in a sequence. And also, while recurrent neural networks might lose the true context of the word because they learned forward and backward context and then concatenate them, these transformers use something called an attention mechanism. An attention can directly learn the forward and backward context simultaneously. And so we actually get a better true preserved context of the meaning of words. The main issue with transformers, however, now, is that they require a lot of data to learn every kind of machine learning task, right? For if you wanted to learn translation, you would need like hundreds of thousands of examples. The same thing goes with, let's say, question answering. You need another hundred thousands of examples. However, all of these are kind of very similar problems very under the hood. They all have some understanding or know-how of language. And if you do understand language, then you should be able to solve these tasks without too much data. And so in practice, we would create a pre-trained model to understand the core essence of language and how language works. And then a fine-tuned model, which doesn't require as much data, but can be used to fine-tune on different NLP tasks. This is exactly what BERT and GPT do. And hence they were introduced. The more popular variant is the GPT architecture. And we pre-train the model to understand language using the problem of language modeling. And this is why in order to create models like chat GPT, we first need to train a model on understanding language via language modeling. And then we can fine-tune it to understand any other task that is in natural language processing. And this is why we see large language models, that concept of LLMs, just thrown around so much more today than any other machine learning task. Because they are fundamentally important to understand how to do other machine learning tasks. So that's kind of the gist of what's happened in the 2000s till present day. I know that's a lot of information to take, but in the future, I'm going to break down these into multiple videos so that we can understand exactly what are the seminal works that led to what we have in natural language processing today. But I hope you still learned something in today's video. Now videos like this are fun yet challenging to make. So I want to take some time to talk about our sponsor, Taro. This is a social platform to help software engineers grow in their career. So say you land a software job, but then what? It could be really hard to navigate your career and it's tough to get good career advice. Taro facilitates these discussions whether you are an entry level or a senior. You can be a part of discussions to get advice from software engineers across many companies. There are many non-technical questions that I wish I could have asked someone in the past to advance my career, but really never found a good forum to do so. But I think Taro is that good place. I'm a machine learning engineer which does overlap with software engineers. And while the platform does not have too many machine learning engineering questions at the moment, I'm doing my best to answer any questions that are there still whenever I can. And I think this community is really nice to be a part of still. So if you're looking for a premium community of software engineers to be a part of, consider signing up for Taro. Using my link in the description to get 20% off your annual purchase.