 This video is sponsored by TARO. Stick around till the end to understand how you can become a premium software engineer. JatchiPT has become an oracle for language, but how did we get here? In this series of videos, we'll be talking about the evolution of language models over time. So let's talk history. In 1948, Claude Chen introduced information theory, a way to link language and mathematics. With the English character set, let's generate a sentence one character at a time. And let's say that each character has some equal probability of appearing. We might end up with a sentence that looks like this. Hmm, this doesn't quite look right. Now what if we were to change these probabilities a little bit? And now let's try to generate the sentence one word at a time. Ah, so it looks like these probabilities, these math and numbers, control what characters are being generated. So to make the text generated of higher quality, we need these probabilities to be as accurate as possible. Now, how to generate this probability table though? Intuitively, the more frequent characters in the English language should occur and be generated more frequently. So we can take a look at some books, count the frequency of all the characters with good old fashion book read and fun, and generate these probabilities. In fact, that's exactly what we did to get this table. Now this is a unigram model. Uni is one, gram is character here, and model, you can think of it as probabilities. We'll use single character probabilities to create language. But language is complex, and simply taking counts of characters without taking into account context isn't a great way to create probabilities. It's pretty obvious since this generated sentence is gibberish. The characters and words in a sentence depend on those that come before it and after it, but that gets really complicated. Remember, this is the 1940s. We don't have any technology to represent information this way in a scalable manner. So let's think really simple. And let's just say, for assumption's sake, every character in the English sentence should be simplified to depend on the character just before them. And so we can represent this as a conditional probability table. What this table shows is that if we have the previous character generated as a, then we'll have a greater chance of generating b and c than generating another a. And it also suggests that b followed by an a is more likely than generating a b or c after another b. We also see that if we have like the last character as c, the chance of generating an a next is actually so much higher than generating a b or c next. Now this is just three cross three, but English has 26 characters. And if you include a space, that's 27 characters. So you can assume that this table is in total like 27 cross 27 table of probability values. And using this table of probabilities, we can generate slightly higher quality sentences. Now this is a by gram model by is to gram is character here and model represents probability. So we use two character probabilities to create language. While the by gram model is better than the unigram model, the sentences generated are far from perfect. But instead of by gram models, we can create tri gram models where we use three character probabilities to create language. Or we can extend this to four gram models and use four character probabilities to create language. And generally speaking, we can extend this to n gram models where language is created using n character probabilities. Now let's just say that instead of using this character model and generating one character at a time, what if we consider generating one word at a time? In this case, we will need to come up with the probabilities of generating a word. So let's start back with our unigram model case, but for words. And so a snippet of this probability table might look something like this. This is just a few words of the thousands of words that exist. When generating words one at a time, we reference this table. And in doing so, we might get a sentence that looks like this. Like in the unigram character case, you'll see words with higher probabilities occur more frequently. The sentence generated also looks more sensible and of higher quality. In the bi-gram model case, we use two word probabilities to generate language. And you can see the generated words look even more sensible, especially every pair of subsequent words. Though there are a few downsides to this n grams representation specifically, the n grams model was the first of many steps that eventually led to the widespread use of language models that we see today. In conclusion, n grams language models show how we can represent and generate language using math. And when you have mathematics, this opens up opportunities for the digital age. Join us next time as we go through some feats that lead to modern language models. Now videos like this are fun yet challenging to make. So I want to take some time to talk about our sponsor, Taro. This is a social platform to help software engineers grow in their career. So say you land a software job, but then what? It could be really hard to navigate your career and it's tough to get good career advice. Taro facilitates these discussions whether you are an entry level or a senior. You can be a part of discussions to get advice from software engineers across many companies. There are many non-technical questions that I wish I could have asked someone in the past to advance my career, but really never found a good forum to do so. But I think Taro is that good place. I'm a machine learning engineer which does overlap with software engineers. And while the platform does not have too many machine learning engineering questions at the moment, I'm doing my best to answer any questions that are there still whenever I can. And I think this community is really nice to be a part of still. So if you're looking for a premium community of software engineers to be a part of, consider signing up for Taro. Using my link in the description to get 20% off your annual purchase.