In this post we will step out of the technical weeds and take a higher-level look at what is going on under the hood of modern sequence-prediction models, which include large language models such as ChatGPT, audio-generation models such as WaveNet, and image-generation models such as the original version of DALL·E. We will provide a basic treatment of the specific task these models busy themselves with and explore what it is that makes the most recent models so good at producing new, coherent content resembling—but not simply copying—their training data. We will see that the core task of these models, which define their training objective, is extremely simple, but the structure of the models themselves is often highly complex. It is likely the structure and sheer size of these models—along with the volume of data that they are trained on—that is largely responsible for their outstanding generative capabilities.
20 Questions
To set the stage, let’s play a game. I am thinking of a word, and your job is to guess what it is. Since this is a game, and since it wouldn’t be much fun if it were allowed to go on forever, we will place a limit on the number of questions you can ask to inform your guess. If that limit is 20 questions, then I have just described the game of the same name, where the object is to identify a mystery object in as few yes-or-no questions as possible.
In our case, that “object” is a word, which we assume is in a dictionary that we have both agreed upon. There are good strategies and bad strategies for playing this game. A very bad strategy is to start at the beginning of the dictionary and guess one word at a time, since you will run out of questions before you have covered even half a page. A decidedly better strategy is to have your first question be whether the word is in the first half of the dictionary. Regardless of whether the answer is yes or no, you have cut the search space in half. You can continue to cut it in half by asking whether the word is in progressively narrower slices of the dictionary, and eventually on progressively smaller portions of a page. This strategy is known as binary search, and it is incredibly effective.
An essentially equivalent binary-search procedure is to number all the words in the dictionary from 1 to $N$ and then ask whether the word is in the interval $[1, \frac{N}{2}]$, since if it’s not, it’s in the interval $[\frac{N}{2} + 1, N]$, and so on. According to one estimate, $N = 171146$ for words currently in use in the English language, so it would take you about $\log_2 171146 \approx 17.4$ questions to identify any English word I might be thinking of (including some really weird and rare ones). Over 170 thousand words is a large search space, but it shrinks exponentially with each question, and you’ll finish with guesses to spare.
Let’s say that we change the rules a bit to make it more interesting. Now you have to guess the word I’m thinking of one letter at a time.1 Since there are 26 letters in the alphabet, if you choose uniformly, you’ll have a one-in-26 chance of correctly guessing the correct letter. But instead you have a clever idea and decide to choose the first letter based on how frequently it shows up at the beginning of words in English. You could estimate this by flipping through the dictionary to see how may pages are devoted to entries beginning with certain letters. If you do, you would be led to believe that S is the most probable letter to start a word. And if I had chosen my word by randomly picking a dictionary entry, this would be a pretty good strategy. But the number of words starting with a letter is not the same as the usage of words starting with a letter, so instead you consult a frequency table compiled from a survey of a large corpus of English texts and find that T shows up most often at the beginning of words. So you go with that, and you’re in luck: The first letter of the word I’m thinking of is T.
Now it’s time for your second guess. The most frequent letter in English is E, so you could guess that. But this guess would be independent of the knowledge that the word starts with T. After all, you now know that whatever the word is, it’s contained within the T section of the dictionary. Now you leaf through the T‘s and notice that the section corresponding to words beginning with TH is particularly beefy. And indeed, TH is the most common two-letter combination, or digram (also called a bigram), in English, regardless of position within the word. (A relative frequency matrix for bigrams is shown below.2) And if H is your next guess, you’re right again.3

You’ve got 18 guesses left, and you’re doing quite well. Each guess supplies a clue, and by conditioning your subsequent guesses on what you have learned about the word so far, you have a pretty good chance of figuring out that the word I am thinking of is TH******.4 And in the process, you have stumbled upon a few core principles of one of the most influential mathematical theories of the 20th century and the raw ingredients of a sequence-prediction algorithm that is one of the earliest versions of a statistical language model.
Information, Bit by Bit
We have seen that in playing “20 questions,” there are good questions and bad questions. A bad question is one that has little expected payoff; it doesn’t get you much closer to the answer unless you get incredibly lucky on your guess. In the case of guessing any of over 170 thousand English words, you can see that with a scattershot approach, the odds are not in your favor. But by cleverly asking your questions, you can greatly reduce the search space each time, giving you “more bang for your buck,” as they say. Quantifying this “bang” was the subject of a hugely influential piece of applied mathematics that has come to be known as information theory.
In the late 1940s, Claude Shannon, a researcher at Bell Telephone Laboratories, studied the problem of communicating messages over a noisy channel. A classic example of the problem is that of a telegraph, which is not an unusual thing for a researcher at a telephone company to be interested in. From watching movies about the old days, we know that telegraph messages are sequences of electronic “symbols” made up of buzzes or beeps that are either long or short, dashes and dots, respectively. Different combinations of dashes and dots make up a code that stands for letters and numbers. You may already be familiar with the maritime distress signal SOS—dot dot dot, dash dash dash, dot dot dot—and it should be obvious that if your ship is going down, you want to make sure that this message gets across unambiguously even if there is noise on the line.
Despite the practical side of the problem, Shannon approached it from a purely mathematical perspective. His idea of a “message” was quite general, covering not only discrete symbolic messages such as those transmitted by a telegraph but also essentially any type of message that can be sent from a source to a receiver. That class of possible messages and range of possible sources and receivers are so broad that information theory quickly found application (and during its peak popularity, over-application) in almost every branch of the hard and soft sciences.
Chief among the ideas that information theory brought forth is that the information in a message can be quantified in a way that winds up being fairly intuitive. If you hear something that you already know—if someone says, “looks like rain” when it’s already pouring outside—you gain no information from hearing it. However, if a piece of news is unexpected—if your golf outing has been ruined by a monsoon when the forecast predicted little chance of rain—you register surprise, and your view of the world is changed to some degree. In Shannon’s treatment, information is related to a message’s effect of surprisal on the receiver, as represented by how probable the message (or observation) was before it was received. Specifically, for a message symbol $x$ (a word, a letter, a weather event, etc.) with probability $p(x)$, its “surprisal” is quantified as $s(x) = – \log_2 p(x)$, where the subscripted 2 indicates that we are taking the base-2 logarithm.5 The smaller the probability $p(x)$, the greater the surprisal $s(x)$, which tends toward infinity as $p(x)$ approaches zero.
Things get more interesting when we consider the average surprisal over a set of message symbols following a distribution $p$. This quantity, $$H = -\sum_x p(x) \log_2 p(x),$$ is known as information entropy (or Shannon entropy). Returning to our 20-questions game, information entropy tells us the average number of yes-or-no questions it would take for us to identify a mystery object if we were playing the game with a strategy that takes the distribution of symbols into account.
I said above that it would take about 17.4 steps of binary search—or about 17.4 yes-or-no questions—to identify a mystery word among 170 thousand options. But this implicitly assumes that all words are equally likely, which is not the case. Words that are very common will have high probability, which means low surprisal, because $-\log_2 p$ gets smaller as $p$ gets larger, reaching zero when $p=1$ (“looks like rain”). But since this low surprisal is weighted by a larger $p$, it will contribute more to the average that gives us the entropy. Similarly, less likely words will have higher surprisal but less weight to contribute to this average. As a result, the entropy for English words (when treated as symbols) is much lower than the binary-search example would have you believe. One estimate for English is 9.8 bits per word.
In fact, the maximum entropy of any set of symbols is achieved when they are all equally likely. This is not difficult to show mathematically, but it also makes intuitive sense. If every symbol is equally likely, then you have no edge in making a guess; there is nothing that would make one choice more likely than another. So the entropy of just over 170 thousand equally likely—or uniformly distributed—objects is a little over 17 bits, but change that distribution to anything but uniform, and the entropy will decrease. That is, it will become easier to identify a mystery object from that distribution with clever questioning.
What Comes Next?
There are many beautiful aspects of information theory and many practical applications, but I won’t cover them in this post. Instead, let’s look at a set of experiments Shannon reported in his 1948 paper, excerpted in the image below. These experiments contain at their heart the core principle behind even the most complex modern sequence-prediction models.

Shannon looked at what sort of sequence would result if one randomly drew letters (or, more generally, symbols) according to certain probability models. The first probability model he tried was uniform, meaning that every symbol (26 letters and one space) was equally probable. The resulting sequence is, unsurprisingly, a mess. In the second case, Shannon used actual letter frequency estimates to decide the probabilities. Now the “words” have more appropriate length, but the results are still a mess.
In the third case, Shannon began looking at digrams, which came in handy to us earlier when we were playing the word-guessing game. The procedure for generating a sequence would go like this: Pick the first letter according to a letter-frequency table. Then consult a 27-by-27 table of digram probabilities—that is, the probability of choosing a symbol given that you had just chosen another symbol. So if the first letter you drew was a B, you would look at the second row of the table to find the conditional probability of drawing A through Z (plus a space) given that you drew B first. The results of this experiment aren’t English, but they’re at least becoming pronounceable.
By the time you get to trigrams, you need a 27-by-27-by-27 probability table, which has 19683 entries in it. The procedure is otherwise the same. But now it’s really starting to look like language. Not your language, perhaps, but language. By conditioning on more of the sequence history, the resulting output becomes much more coherent.
At this point, you can’t blame Shannon for moving on from letters to words, since the next probability table he would have needed for 4-grams of letters would have had over 530 thousand entries, and he was doing this in the 1940s, when computers far weaker than the phone you had several upgrades ago were bigger than industrial refrigerators. Shannon began by picking words independently (that is, with no conditioning) according to their default frequencies in English, and the results are predictably ungrammatical nonsense. Remarkably, though, when increasing the model complexity only to the point where the next symbol is chosen while taking just the previous one into account, the results make a huge leap in quality. In fact, the sentence Shannon randomly produced would not have seemed out of place in certain texts I had to read in college.
The order (or degree) of a sequence model describes how much of what came before in the sequence is taken into context. Current usage differs from Shannon’s terminology very slightly. A zeroth-order model treats elements of a sequence independently.6 The probability of a length-$T$ sequence is just the product of the $T$ probabilities of its individual entries. In notation, we would write $$p(w_1, w_2, \cdots, w_T) = p(w_1) p(w_2) \cdots p(w_T) = \prod_{t=1}^T p(w_t),$$ where $w_t$ is the word or symbol at position $t$ in the sequence.
A first-order model takes the previous entry—and only the previous entry—into context, which is something known as a Markov chain.7 In notation, the probability of a sequence in a first-order model would be $$p(w_1, w_2, \cdots, w_T) = p(w_1) \prod_{t=2}^T p(w_t|w_{t-1}),$$ where one would interpret $p(w_2|w_1)$ as “the probability of the second word given (or conditioned on) the observation of the first word.” This is just the mathematical way of representing the procedure described a few paragraphs above.
The Broad Horizon
Models of progressively higher order would simply condition on more things in the past, dramatically increasing the coherence of their output in the process. But as we saw with the $n$-grams Shannon was working with, conditioning on more things exponentially increases the amount of probabilities we have to keep track of. Modern language models are also called large language models because they are indeed enormous. GPT-3, for instance, has about 175 billion parameters; the model itself takes up about 800 GB of space, which is well beyond the capacity of any current consumer machine to even load into memory. Such models possess the capacity to make predictions of elements of a sequence conditioned on more and more elements of a sequence’s history.
Modern models work with tokens that are common words or chunks of words, which amount to far more symbols than Shannon’s 27. But it turns out that even models that generate only one letter at a time, like Shannon’s did, can produce amazingly coherent output in a variety of styles given enough conditioning context. This was beautifully demonstrated by Andrej Karpathy in a 2015 blog entry, using an LSTM model far less sophisticated than what is in common use today.
The current state of the art in sequence modeling, which is certain to be surpassed in the very near future, permits sequences whose context (what the model conditions on) can be up to 2048 tokens (symbols), which amounts to about 1500 words.8 That’s a lot. That’s about three pages of single-spaced text, which immediately brings back unpleasant memories of school essay assignments with length requirements.
Despite GPT-3’s gargantuan parameter count of 175 billion, that number is dwarfed by the parameters that would be required to store a lookup table for a sequence model for 2048-grams. Instead, these models rely on their ability to learn to compress the conditional relationships among tokens, exploiting regularities and patterns that exist within the huge volume of text that they are trained on. Just as native speakers of a language think not about the grammatical details of a sentence but rather speak how they’re used to the language being spoken, modern sequence models get used to the statistical patterns embedded in the data they are exposed to.
How these patterns are learned is where the sophistication and complexity come in. Nevertheless, the objective employed by all modern language models that are used for text generation is exactly the same as what Shannon explored back in 1948: Given what you’ve seen so far, what comes next?
Notes and References
- To be fair, we should limit the length of the words to be shorter than 20 letters. Let’s say 10 or fewer.
- Table source: https://blogs.sas.com/content/iml/2014/09/26/bigrams.html
- I swear that I chose the word before I knew this.
- THINKING
- When using the base-2 logarithm, the units of information are referred to as bits. Sometimes these are referred to as shannons in Claude’s honor. When using the natural (base-$e$) logarithm, the units are nats. In base 10, they’re hartleys, bans, or dits.
- Shannon called this a first-order model, reserving zeroth order for a model that assumed both independence and uniform probabilities.
- Markov chains can have order or degree greater than one, but when someone mentions a Markov chain, it almost always means a chain of order one.
- This limit of about 1,500 words (2,048 tokens, the common words or chunks of words the model treats as symbols) is for the context plus the output. But since the output can be one token at a time, we can consider the full amount to be the upper limit of the context.
Leave a Reply