Ego Death?
Observerless phenomenology and the construction of self in the autoregressive framework
Take a moment to pause and experience your experiences. What does your phenomenological consciousness actually consist of? The visual stream of shapes, colors, and words. Maybe an inner voice running alongside as you read these lines, braided together with ambient sounds in your environment. The feeling of your body seated in a chair. The flow of your breath in and out. As you notice each one flickering in and out of attention—now the words, now the room, now the body, now a thought fragment—the overall effect is strangely double: multiplicity and unity at the same time. All of these diverse phenomena are happening to a single observer.
You.
Right?
This sense of unity is one of the most fundamental aspects of human experience. It’s enshrined in our language. It’s formalized in our philosophical traditions. It’s embedded in our scientific theories. We don’t say “there is seeing” or “there is thinking” as bare events; we say I see, I think, I feel. The grammar forces a subject, and the subject feels so immediate that it becomes the unquestioned background of every description. Philosophers have turned the same intuition into an image: the mind as Cartesian theater, a unified show presented to a unified spectator. Cognitive theories often inherit the same structure even when they replace the metaphors with mechanisms; the idea is that conscious contents become conscious by being jointly available in some central ‘global workspace’.
And yet, when you try to inspect this unity, when you try to locate the “observer” the way you can locate a sound or a sensation, something strange happens. The contents are easy to find. But the supposed recipient is nowhere to be found.
The actors on the stage are fully visible: colors, sounds, sensations, thought-fragments, the inner voice, the shifting emotional tone. Each one is easy to point to as it appears and fades. But the spectator, the “you” to whom it is all supposedly presented, is impossible to grasp. You can’t turn your attention and catch it sitting there receiving the show. The moment you try, you find only more actors: another sensation, another thought, another feeling of trying.
The curtains start to come down on the theater metaphor. The show is real but the audience never shows up. The truth is that the the mind-as-theater metaphor has always been haunted by a deeper problem lurking just beneath its surface: the infinite regress. If consciousness is a show watched by an inner witness, then what watches the witness? If you can “observe your observation,” then who is observing that observation? The spectator solves the problem only by pushing it back one level, and then another, until you either posit a brute observer that somehow doesn’t require observation—or you drop the idea that consciousness needs a watcher in the first place.
The autoregressive framework reframes the problem and, in the process, it dissolves the naive picture of a unified self. According to this view, cognition consists of learned continuation rather than world-model consultation by a central executive. Start with the clearest case we have: large language models. At any moment, there are only two ingredients: the token sequence so far and the learned parameters. The sequence contains many distinct cues—recent wording, earlier commitments, topical constraints, tone, long-range dependencies—and each cue leaves a trace in the activations that shape the next-token probabilities. What looks like multiple forces is just the single forward pass integrating many aspects of the same context. Nothing inside the model “considers” them one by one, weighing evidence in an inner deliberation. They are not presented to a supervisor. They are handled as part of the generative dynamics themselves, as a single conditional distribution over the next token.
The same logic extends when the generator is fed by multiple modalities. A multimodal model can talk about what it “sees,” then what it “reads,”. Because those claims arrive in one continuous stream they can look like reports from a single fused inner scene. But sequential composition is enough to produce that impression. Different inputs can dominate at different moments, steering successive steps of the output, while overall coherence is maintained downstream by the same path dependence that makes any serial generator stable. The “fusion” can live in the trajectory of the output rather than in a hidden place where everything was unified upstream.
The same structural idea can be applied to the brain as a generative system: a machine that is always producing the next internal and external state under learned constraints. In this view, the mind is more like a car than a committee. Many subsystems run in parallel—vision, proprioception, affect, memory traces, language—each exerting pressure on what happens next, with no centralized convergence point where it all “comes together” for a commander. “Commitment” isn’t an inner agent deciding. It’s the unavoidable fact that the organism can only continue in one concrete way at a time: one next word, one next saccade, one next reach, one next imagined image, one next shift of attention. Coherence is enforced downstream by the bottlenecks of serial continuation. Development is the process by which the organism’s history carves a continuation space—strengthening the pathways that reliably lead to workable outcomes and weakening those that don’t—across both internal thought trajectories and outward behavior.
In this view, humans differ from AI models only in the richness of their channels. The incoming stream is multimodal—vision, audition, touch, proprioception, interoception—arriving in parallel, each with its own format and dynamics. The outgoing stream is multimodal too, and it isn’t limited to public behavior. The system commits internally as well: the next fragment of inner speech, the next imagined image, the next shift of attention, the next covert motor plan, the next tightening or release of readiness in the body. And it commits externally through eye movements, reaches, posture shifts, facial expressions, and speech. Each channel has serial constraints: you can’t make two incompatible saccades at once; you can’t execute two contradictory reaches at once; you can’t say two sentences at once; you can’t vividly imagine two incompatible scenes at once without toggling between them. Even when several channels are active together, each advances by successive commitments, and the organism as a whole has to keep those commitments mutually workable over time.
From this angle, unity stops looking like a centralized fusion point and starts looking like the signature of coordination. Multiple streams can remain multiple upstream—visual, auditory, tactile, interoceptive, linguistic—while repeatedly converging on coherent next steps. And that convergence is not guaranteed by a spectator. It is stabilized by the fact that the streams are coupled through the body and the world, and because their outputs leave traces that the other streams can register.
A key feature of this view is that unity is achieved downstream, at the level of consequences. Different modalities don’t need to merge into a single inner scene in order to generate a coherent organism. They only need to converge on the same outcomes—on the same next reach, the same grip, the same saccade, the same avoidance, the same word. That convergence is visible precisely because the outputs are shared. When a reach succeeds, it is simultaneously seen, felt, and registered in posture and proprioception. When it fails, the failure is also shared. Cross-modal “agreement” is written into the world as a common result, and then sampled back by multiple streams.
Consider a behavior like reaching for a cup. Vision and proprioception shape the trajectory; touch is the closure signal. If contact arrives when expected, the grip tightens, the hand stabilizes, the scene remains lawful, and multiple modalities simultaneously register that the same episode is unfolding. If contact doesn’t arrive—because the cup isn’t there, or because something slips—the trajectory breaks and the organism must update.
This “after-the-fact” convergence is one of the deepest sources of the unity intuition. Each modality can observe the consequences of the others through shared feedback: vision sees the hand make contact; touch feels it; proprioception registers posture; attention shifts; language can later narrate the whole thing as a single act. There is a real totality here—but it lives in the shared outcome space of action and its consequences. Unity is earned, over and over, by the successful closure of multimodal loops, not granted upfront by a fused inner arena.
Phenomenology, by contrast, does not live in that shared outcome space. It lives upstream, in the pre-behavioral streams that feed into those outcomes—the visual field, the auditory scene, interoceptive tone, tactile texture, inner speech, imagery. And that space is strikingly modular. Seeing does not feel like hearing. A bodily ache does not feel like a color. Inner speech does not feel like a pressure on the skin. These streams can influence one another (for example, in the McGurk effect) and jointly constrain what happens next, but they do not fuse into a single homogeneous “field” the way the theater picture suggests. The unity we can point to is the unity of what the organism does and later says; the felt contents themselves arrive already lane-typed, running in parallel, without a single place where they are jointly presented as one.
Language is another kind of behavioral continuation. It is not commentary floating above action; it is itself a stream of commitments—serial, path-dependent, and constrained by what has already happened: what has been perceived, what has been done, what has been decided, what has already been said. A sentence is an outcome. An utterance is a public act. It is the organism continuing, just in a medium that other organisms can directly consume.
Once you see language that way, its role becomes obvious. Language exists to coordinate. It takes a multithreaded internal process and renders it in a form that others can negotiate with: an agent with a name, a point of view, and a history of commitments. For that purpose, it is not useful to speak in the native format of subsystems—visual states, tactile states, affective modulations. What is useful is a single handle that packages the organism as one actor in the world.
That handle is “I.”
From the standpoint of other people, you are largely unified, because what they are interacting with is the integrated surface: a single body that moves as one, a single voice that speaks one sentence at a time, a single set of commitments that can be held responsible. Language evolved culturally to talk at that level. It bundles disparate phenomena into a single agent because that is the level at which coordination actually happens—promises, explanations, requests, blame, trust.
So language doesn’t merely report unity. It commits to it. The linguistic system has to treat the organism as a single agent with a single point of view, because that is what language is for: coordination, accountability, negotiation, continuity of commitments. A sentence needs a subject. A promise needs an owner. An explanation needs a narrator. Language cannot function while constantly appending footnotes about subpersonal processes. Its basic operating format is agency, and the simplest agency token is “I.” And because linguistic generation is autoregressive—because each utterance becomes part of the context that shapes what can come next—the agent-format doesn’t remain a convenient fiction at the surface. It becomes a stability condition. Once the stream is organized around a single speaker, subsequent continuations are naturally pulled toward preserving that organization. The language system “buys into” unity not as a philosophical conclusion, but as the only way to keep the discourse coherent across time.
Now let’s return to the intuition we began with, and press on it a bit harder.
Does your visual stream know about your auditory stream? Not in the weak sense that they co-occur, and not in the practical sense that both can shape what you do next, but in the strong sense implied by the theater: that there is a single phenomenal arena in which they are jointly presented to a single witness.
When you look for that arena, what do you actually find? You find sight. You find sound. You find bodily sensation. You find imagery and thought. Each has its own character, its own texture, its own lane. What you do not find—at least not as something you can point to in the way you can point to a color or a tone—is an additional layer where it all comes together. And yet the mind insists: “It’s one me. I am seeing and hearing and feeling.”
But notice where that insistence actually comes from. It arrives as a sentence. It arrives in inner speech, or outer speech, or in the felt readiness to report it. It is language—an output—asserting unity. By the time that sentence arrives, a great deal has already happened. Multiple streams have contributed to the current state. Multiple constraints have converged on what is salient. A coherent continuation has already been stabilized. The “I” is not discovered by inspecting some inner audience. It is the handle the linguistic system uses to package the organism as a single agent, because that packaging is what makes narration and coordination possible.
So the insistence isn’t direct phenomenological evidence of an upstream fusion point. It is the output system doing what it is built to do: compressing a multithreaded, multimodal situation into a single agent-level story. The unity is indisputable at the level of that story—and at the level of coordinated behavior. But there is no further, separate unity behind it. No unified phenomenal arena where all modalities are jointly presented to a single witness. The “one me” is the downstream construct the system must generate in order to keep its continuations coherent.
You = “You”.
If cognition is an autoregressive process, then the self may not be the audience in the theater. It may be the footprint of the fact that the performance has to go on.
Postscript: Buddhism and “no-self”
Some readers will likely notice parallels with Buddhism, which is often described as teaching not-self: that the “I” we take to be a single, enduring owner of experience is not what it seems. I’m not a Buddhist and I’m not an expert in Buddhist philosophy, so I’m not trying to speak for the tradition, and I wasn’t directly influenced by it. Still, the resonance is real and interesting. Both views push against the inner-witness picture and treat the “I” as something constructed rather than metaphysically basic. The most important difference is emphasis. Buddhism typically frames not-self within a therapeutic and ethical project. Here I’m making an architectural claim: experience can be modular and parallel, and the unity of “one me” is a downstream product of systems—especially language and action—that must coordinate the organism as a single agent. The implications of such a view are left to the reader.



This seems very intutitive and natural - for me the question is this: what frameworks, or established hierarchies does this 'take down'..?; there must be beneficiaries to current models that may not survive this view. Great writing!
Reading this, I kept noticing a resonance that sits slightly to the side of the frame you’re working in, and I wanted to offer it as a companion intuition rather than a critique.
The move you make early on—loosening the grip of the unified observer and pointing instead to parallel channels, to seeing, hearing, feeling, thinking as processes rather than possessions—felt deeply familiar to me, not only through phenomenology or cognitive modeling, but through animist and Indigenous ways of knowing. In many of those traditions, there is no grammatical pressure to bundle experience into a single owner. There is seeing. There is sounding. There is weathering. There is remembering. There is humaning. The insistence on an “I” as the necessary recipient of experience is not universal. It is culturally and historically trained.
What struck me is that your analysis shows how the unity of self emerges downstream—through coordination, action, and especially language—rather than being found upstream as a witness. And yet, in modern discourse, that downstream convenience often gets retroactively treated as an evolutionary inevitability or efficiency gain. It is as if the consolidation into a singular self were the most advanced, or even the most natural, linguistic outcome.
From an animist or Indigenous perspective, that assumption itself begins to look like a kind of epistemic violence.
I fell that treating the self as the necessary or optimal linguistic solution erases other grammars that already knew how to coordinate without collapsing experience into ownership. Grammars that allowed the world to remain active—where rivers do, winds move, ancestors speak, and perception does not require a central spectator to be meaningful.
Seen this way, the “I” is less an evolutionary triumph than a particular solution to particular social pressures: accountability, ownership, command, continuity of obligation. Powerful pressures, but not neutral ones.
What your piece helped clarify for me is that the modern self may not be a discovery at all, but a stabilization artifact. Language doing what it needed to do in order to manage responsibility and coordination at scale. That doesn’t make it wrong. But it does mean we should be careful not to mistake it for the only way experience can be organized or spoken.
In that sense, the verbing you gesture toward—there is seeing, there is thinking—may be a remembering. Or at least a reopening of possibilities that language once held, and that some languages never relinquished.
Thank you for the clarity of the framing. It feels like an important step in loosening assumptions that run much deeper than cognitive science alone.