 What happens when you enter a question into a chatbot? For example, suppose you ask, what is Charles Darwin's most famous book? The sentence you enter is chopped up in parts, in words or tokens, and each word is mapped to a series of numbers called embeddings. And with these embeddings, large language models start to calculate. The computations for each word are happening in parallel to each other. The following visualization of this process shows which next word is expected at each processing step for each word in parallel. Let's look at one column in this graph. Here we see that the large language model guesses almost immediately after seeing the word what, that the next word will be is. And as we move from the bottom start of the computation to the top, the end of the computation, this prediction is not changing much. Let's look at an other column, the one that processes the word Charles. Here we see that the large language model does not have much of a clue about what will come next, but it eventually settles on dickens. That makes sense. Charles Dickens might be the world's most famous Charles. All the other columns are worth a look too. They are crucial for the proper functioning of the large language model, as we will see later. But their predictions are not really used, because I as a user have already typed in the entire question in my prompt, as well as the start of the answer Darwin wrote. So let's have a look at what happens when the large language model is asked to take over and answer my question. After receiving the word wrote, the large language model in the first couple of computation steps predicts the typical words that may follow wrote in English. In it, that, or a. But in the later steps in the computation, the large language model has figured out that given the question that was asked, the appropriate answer must start with the. So how does this large language model, at its 21st layer, know it needs to predict the? At that stage in processing, it should have understood the query well enough to know that the appropriate answer should be the name of the book that is Charles Darwin's most famous one. That means it must in some sense understand the grammatical construction what is X's most famous Y. And it should know that Darwin wrote on the original species, and that that book is more famous than many other books that Darwin wrote. So how do they actually do it? Well, large language models are so large that it is very difficult, in fact, to tell exactly how they represent the knowledge that they have acquired. But we do know how the basic components work that do all the work. These basic components are called attention heads and multi-layer perceptrons. Some components are specialized in aspects of English grammar, such as the S that marks that Darwin is the owner or author of the most famous book. Other components are specialized in higher-order linguistic constructions, such as what is X's most famous Y. Yet other components store factual information. One might be specialized in book titles. Another one might be specialized in finding author names in the input. In current large language models there are, for each word that is received or generated, tens of thousands of these components working in parallel to properly process it. These 10,000 components communicate with each other in various ways. But one very crucial one is called multi-head attention, and it allows components to ask other components for information. Such requests for information are called queries. It also allows components to offer information to other components. Such messages are called keys. And if keys and queries match, the requested information, then called value, is passed on. For instance, we can imagine the component specialized in author names to send out a key that essentially means, I have an author name on offer. The components specialized in book titles might send out a query asking for author names. And because key and query match, the relevant information, Charles Darwin, is sent from the first to the second component, where this in turn is sent to yet another component that has stored book titles associated with specific names. What is important to know is that all these messages, keys, queries, values are just rows of numbers. They are difficult to interpret by humans, but the computer can quickly pass them from one component to the other and then apply the required mathematical operations on them. And the fact that they are just rows of numbers makes it possible to run standard machine learning algorithms on them, so that the computer can find the optimal set of numbers to perform a given task. In our case, that given task is to predict the next word. It finds that optimal set by going through an enormous amount of text, but once it is straight, large language models do no longer need access to the internet or the training set. So, large language models, the models underlying chatbots such as ChatTPT, work by generating their answers one word at a time. There are many layers in these models, and in each layer predictions about the next word are computed. Information from all the words in the prompt, as well as in the answer up to the current moment, is combined using a mechanism called attention. Across the layers, for all these words in parallel, very accurate predictions are often formed, combining knowledge about how language works and facts about the world. There's no magic here, but by scaling up these models to enormous sizes, they have acquired capabilities that have surprised the world.