In the last section, we noted that the storage-retrieval model has dominated our understanding of cognition. We now turn to the alternative. The most powerful AI systems today—particularly large language models like GPT-4, Claude, and LLaMA—do not work by storing and retrieving information. Instead, they operate through a process called autoregression. This process, which has revolutionized AI capabilities, may offer a more accurate model of how human cognition works as well.
At its most fundamental level, an autoregressive process is built on two key elements: a core function that generates outputs from inputs, and an iterative mechanism that feeds those outputs back as inputs. This pattern appears across many domains—from statistics and economics to signal processing and, as we'll see, cognition itself.
The essence of autoregression is that each new value in a sequence depends on previous values in that same sequence. Think of it as a recursive loop: the system makes a generation, uses that generation as part of its next input, makes another generation, and so on. The power of this approach is that complex, extended patterns can emerge from a relatively simple underlying function that only looks at one step at a time.
A Simple Example: The Fibonacci Sequence
To understand autoregression, consider one of the simplest examples: the Fibonacci sequence. This famous sequence (0, 1, 1, 2, 3, 5, 8, 13, 21...) follows a straightforward rule: each number is the sum of the two preceding numbers. In autoregressive terms, we can break this down into two steps:
A function that takes the two most recent numbers as input and outputs their sum
A mechanism that incorporates this output into the next input state
Let's see how this works:
Current input: [0, 1]
Function: Add the two most recent numbers
Output: 1
This output now becomes part of the input:
Current input: [1, 1]
Function: Add the two most recent numbers
Output: 2
And the process continues:
Current input: [1, 2]
Function: Add the two most recent numbers
Output: 3
This simple function, applied iteratively, generates an endless sequence with fascinating mathematical properties. The key insight is that nowhere in this system is the "entire Fibonacci sequence" stored. It doesn't need to be. The sequence exists only as it's generated, one step at a time, through the application of a simple function to its own previous outputs.
Let's represent this process more clearly with a table:
Each row represents a single application of our addition function. The key insight is that each output becomes part of the next input. After generating "1," the system tacks this on to the last digit of the previous output, forming the next two-digit input, and runs the function again. Rinse and repeat.
From Simple Function to Neural Network
The Fibonacci example uses an extremely simple function (addition), but the same autoregressive principle works with more sophisticated functions. This is where neural networks enter the picture.
A neural network is fundamentally just another function that maps inputs to outputs. While we'll explore the technical details in later sections, for now, all you need to know is that these networks consist of a series of simple mathematical operations—primarily multiplications, additions, and some basic non-linear transformations. Despite this relative simplicity, they can reliably take a given input and produce an appropriate output.
Importantly, neural networks are trainable, meaning there's a systematic way to adjust exactly which multiplications and additions are performed on the inputs. This training process allows the network to learn from examples, but we'll save those details for later discussions.
It's important to understand how these neural networks learn to perform their input/output functions in the first place. The training process involves exposing the network to many examples of inputs paired with their desired outputs. For a handwriting recognition system, this means showing it thousands of images of handwritten digits along with their correct labels—a process pioneered with the famous MNIST dataset that contained 70,000 small images of handwritten digits from 0-9.
The training follows a simple conceptual pattern: the network is given an input (like an image of a handwritten "7"), it produces an output based on its current parameters, and then these parameters are algorithmically adjusted to make the actual output closer to the desired output (the label "7"). This process repeats thousands or millions of times across many examples, gradually refining the network's ability to map inputs to appropriate outputs.
Next-token prediction in language models works on exactly the same principle. The network is given a sequence of text as input (like "The capital of France") and trained to output the token that actually follows in the training data (like "is"). By exposing the model to vast amounts of text—billions of words from books, articles, websites, and other sources—it learns the statistical patterns of language: which words typically follow which other words in which contexts.
The remarkable thing is that this conceptually straightforward training process—"here's some text, now predict the next token"—ends up capturing not just simple word associations but complex grammatical structures, factual knowledge, and even reasoning patterns that are embedded in the statistical regularities of language. The network doesn't explicitly learn grammar rules or store facts in a database; it simply adjusts its parameters to get better at predicting what comes next. Yet from this singular focus emerges a system that appears to "know" an astonishing amount about language and the world it describes.
Consider a straightforward neural network application: handwriting recognition. The input might be a digital image of a handwritten digit, and the output is a determination: "This is the number 7." The network doesn't store a catalog of all possible 7s; instead, it has learned parameters that allow it to recognize patterns associated with 7s. This single input/output function takes an image and produces an identification in one forward pass.
Now, let's focus on a specific type of output: generating the next element in a sequence. Given the first few words of a sentence, a neural network can be trained to output what word might come next. For example:
Input: "Once upon a"
Neural network function: (series of mathematical operations)
Output: "time"
The term "prediction" is often used in this context, but it simply means generating the appropriate output based on the network's training. This is still just a single input/output operation—nothing autoregressive yet. The network takes in a sequence of words and outputs a single next word, all in one forward pass.
Autoregression in Language: Looping the Next-Word Function
Here's where the magic happens. Just as with the Fibonacci sequence, we can create an autoregressive process by feeding the output back into the input. This creates a loop:
A neural network function that generates the next word given a sequence of words
A mechanism that appends this generated word to the original sequence to create a new input
Let's see how this works with our example:
Each row represents a single forward pass through the neural network. The key insight is that each output becomes part of the next input. After generating "time," the system doesn't start over from scratch. Instead, it appends "time" to the original input and runs another forward pass to generate the next token.
This looping mechanism—taking the output of one forward pass and feeding it back as input for the next—is the essence of autoregression in language models. The network never generates more than one word at a time, yet through this iterative process, it can produce coherent, lengthy texts that appear to reflect planning and forethought.
The Surprising Power of Next-Token Generation
Perhaps the most remarkable aspect of modern language models—and in my view, the greatest scientific surprise of my career—is how extraordinarily well this simple next-token approach works. Despite only being trained to generate one word at a time, these systems produce outputs that display sophisticated linguistic structure, factual knowledge, and even what appears to be reasoning capabilities.
What makes this so surprising is that the outputs often have the appearance of being planned many steps ahead. A well-constructed paragraph, a logical argument, or a coherent story seems to require knowing where you're going before you start. Yet these models have no capability to plan ahead—they genuinely generate one token at a time, with each token influencing what comes next.
This unexpected effectiveness suggests something profound: language itself, as a system, appears to have evolved to enable complex thought through simple next-token processes. The truly remarkable insight is that even though we can see the entire "magic trick"—we know there is no explicit planning mechanism beyond generating one token at a time—language somehow contains within itself the capacity to anticipate and construct extended, coherent thoughts. It's as if language evolved specifically to allow complex meaning to emerge from simple sequential prediction. Maybe what we perceive as our careful planning and forethought in communication is actually an emergent property of a system that's generating language one unit at a time, with each unit influencing what comes next. The shocking implication is that our subjective experience of "knowing where we're going" with our thoughts might be an illusion—the system doesn't need to know where it's going to get there. If complex, coherent language can emerge from simple next-token generation in AI systems, perhaps our own linguistic abilities—and even our thoughts themselves—function in a fundamentally similar way.
The discovery that language can operate through simple next-token autoregression presents us with two profound mysteries that challenge our fundamental understanding of cognition and communication.
The first mystery, as we've just seen, is the appearance of planning and foresight in a system that has none. How can a process that only looks one step ahead create outputs that appear to be guided by long-term objectives? This "illusion of planning" suggests that what we experience as intentional, goal-directed thinking might actually emerge from much simpler sequential processes.
But there's an even deeper mystery: how does a self-contained generative system like language manage to communicate meaningfully about the external world? The parameters of a language model are shaped by exposure to text, not direct experience with reality. Yet somehow, the statistical patterns of language contain within them the capacity to represent and reason about objects, events, and concepts that exist outside of language itself.
This is truly astonishing—language, which appears to operate through purely internal statistical regularities, nevertheless provides a window onto reality. The words "apple," "gravity," and "democracy" don't just predict other words; they connect to actual things and concepts in the world. How does a system built solely on predicting the next word in a sequence develop representations that correspond to reality?
Both of these mysteries—the illusion of planning and the apparent connection to reality—suggest that language doesn't operate the way we've traditionally assumed. These realizations fundamentally upend standard notions of semantics and reference. If language functions through autoregressive generation rather than storage and retrieval, then perhaps words don't "refer" to objects or concepts in the naïve sense at all—whatever "refer" was supposed to mean in traditional theories.
Instead, what we're seeing is that language plays its communicative and thinking roles through some other mechanism entirely. The capacity of language to seemingly represent the world and support reasoning might emerge from statistical patterns of token generation rather than from symbolic reference. This demands a complete rethinking of how language does what it does—from how meaning arises to how communication succeeds to how thought itself operates.
Later in this manuscript, we'll explore these profound implications more thoroughly, but for now, it's enough to recognize that the autoregressive view of language doesn't merely offer an alternative mechanism—it challenges the very foundations of how we conceptualize language, thought, and their relationship to reality.
The Contrast with Conventional Computing
Now that we’ve introduced the basics on autoregression, we can appreciate how radically different the transformer approach is to ‘storing’ information compared with conventional computers. The Contrast with Conventional Computing
To appreciate how radical this approach is, it helps to understand how conventional computers actually store sequential information. When you save a text sequence like "Once upon a time there was a princess" on your computer, each character is encoded as a specific pattern of 1s and 0s in a designated physical location. This is genuine storage: the entire sequence exists in a discrete, fixed form. When you access this sequence later, your computer retrieves exactly those same bits from those specific locations, reconstructing the complete sequence as it was stored.
Now, one might argue that the forward pass of a neural network itself represents a kind of limited storage and retrieval, albeit in a distributed form rather than a simple memory file at an address. The network's weights could be seen as “storing” information—such as the correct class of an image or the next token after a given sequence—albeit in a highly transformed way that is very different than conventional computers. However, this comparison completely breaks down in the case of autoregressive sequence generation: here, the "retrieval" depends on the generation of a single output and then looping that back in as input into the single-forward-pass system. There is simply no way in which we can say the sequential information is "in there"—even in a distributed form.
When ChatGPT generates the sequence "Once upon a time there was a princess," that sequence wasn't stored anywhere in the system, not even in a distributed or encoded form. It emerged from the iterative application of the next-token function. The system doesn't contain the sequence or even subsequences—it contains only the capacity to generate tokens that, when strung together through autoregression, form coherent sequences.
This distinction is crucial: conventional computers store complete sequences; autoregressive systems generate sequences one element at a time without ever storing the whole.
From Storage to Generation: A Paradigm Shift
To appreciate the radical nature of this shift, consider how differently the storage-retrieval and autoregressive models would explain common cognitive phenomena:
Storage-Retrieval Model: When you recite your ABCs, you're accessing a stored representation of the alphabet sequence that exists as a complete entity in your memory. When you state that Paris is the capital of France, you're retrieving a stored fact from your semantic memory. When you express a political view, you're accessing stored beliefs. In theory, if we could perfectly decode the brain, we would find these entire sequences and facts stored holistically in some form.
Autoregressive Model: When you recite your ABCs, you're not retrieving a stored sequence; you're generating each letter through an autoregressive process. The first element "A" triggers the generation of "B," which triggers "C," and so on. There is no complete alphabet sequence stored in your brain. If neuroscientists could perfectly decode your brain, they wouldn't find the alphabet stored as a unit; they would only find patterns of neural parameters that predispose you to generate each letter given the preceding one. The sequence only exists when it's actively generated through this recursive chaining process. Similarly, "knowing" that Paris is the capital of France means having neural parameters that reliably generate this output when prompted with relevant inputs. Having a political "belief" means having a propensity to generate certain types of sequences when relevant topics arise.
Learning as Parameter Adjustment
If cognition operates through autoregressive generation rather than storage-retrieval, then learning takes on a different character as well. Rather than "storing" new information, learning involves adjusting the parameters of the generative system to increase the likelihood of producing appropriate sequences in response to relevant inputs.
When you learn a new fact, you're not adding data to a storage bank; you're adjusting network parameters to make certain sequence generations more likely. This is fundamentally different from how we typically imagine learning—there is no discrete "file" being created, no new entry added to a database. Instead, the entire network of parameters shifts slightly, altering its generative tendencies. This is why learning is distributed and associative rather than compartmentalized.
When you form a new memory, you're not creating a record but altering the generative tendencies of your neural networks. This view of learning helps explain why practice and repetition work: they strengthen the parameters that generate certain responses, making them more likely to emerge when similar contexts arise in the future. It also explains why learning is rarely all-or-nothing—the parameters continue to adjust with each exposure, gradually refining the generated outputs rather than suddenly creating a perfect "file."
The Illusion of Storage and Retrieval
If there are no stored memories or knowledge, why do we so strongly experience cognition as storage and retrieval? The generation process itself is typically unconscious—we experience only its product, which appears in consciousness as if retrieved. Well-trained generative systems produce consistent outputs given similar inputs, creating the impression of accessing the same stored information repeatedly.
Our metaphors for mind (like "memory storage") and the influence of computer technology have reinforced the storage-retrieval model in our thinking. But this model persists not because it accurately reflects cognitive architecture but because it aligns with our subjective experience. The autoregressive model requires us to look beyond this immediate experience to understand the generative processes that create it.
Autoregression is a generative method but not a linguistic description. Language sequences can be generated autoregressively, but language itself is structured by grammar, logic, and semantic constraints. These structures encode operations like comparison, implication, and hierarchy. It is such features that exceed local token dependency. Transformer models simulate reasoning by statistically approximating such constraints. But this is not cognition, and it is not conceptual understanding. Recognizing language as structured logic is not a novel claim. What is novel - dangerously so - is your assertion that sequence generation alone constitutes comprehension. That is an error in understanding.
Cool! I think an autoregressive model of cognition also better explains the prevalence of contradictory beliefs better than traditional models. An individual just has the propensity to espouse different beliefs in different contexts...